基于 CentOS 7.3.x + hadoop v2.9.0 集群的 Hive 2.3.2 的安装与使用

前言

安装Apache Hive前提是要先安装hadoop集群，并且hive只需要在hadoop的namenode节点集群里安装即可：需要在namenode上安装，可以不在datanode节点的机器上安装。

还需要说明的是，虽然修改配置文件并不需要把hadoop运行起来，但是本文中用到了hadoop的hdfs命令，在执行这些命令时你必须确保hadoop是正在运行着的，而且启动hive的前提也需要hadoop在正常运行着，所以建议先把hadoop集群启动起来。

本次安装的软件版本罗列如下：

CentOS v7.3.x ;
Hadoop v 2.9.0 集群 ;
JDK8 ;
Hive 2.3.2

有关如何在CentOS7.3.x 上安装hadoop集群请参考我的博客：CentOS7.3.x + Hadoop 2.9.0 集群搭建实战

1.下载Apache Hadoop

下载地址：http://hive.apache.org/downloads.html

点击下图中的某个下载地址，优先选择国内源，本次安装下载的上2.3.2版本，下载地址如下：

http://ftp.cuhk.edu.hk/pub/packages/apache.org/hive/hive-2.3.2/apache-hive-2.3.2-bin.tar.gz

2.安装Apache Hive

2.1.上载和解压缩

把apache-hive-2.3.2-bin.tar.gz下载到Hadoop NameNode主机上，并解压到 /opt目录下。

# cp apache-hive-2.3.2-bin.tar.gz /opt
# cd /opt ; tar zxvf apache-hive-2.3.2-bin.tar.gz

2.2.配置环境变量

# vim /etc/profile
#在文件结尾添加内容如下：
export HIVE_HOME=/opt/apache-hive-2.3.2-bin/
export PATH=$PATH:$HIVE_HOME/bin
# . /etc/profile

2.3.Hive配置Hadoop HDFS

2.3.1 hive-site.xml配置

进入目录$HIVE_HOME/conf，将hive-default.xml.template文件复制一份并改名为hive-site.xml

cd $HIVE_HOME/conf ; cp hive-default.xml.template hive-site.xml

在hive-site.xml中设置有如下配置，你自己在你的环境里修改为别的目录也可以。

<property>
    <name>hive.Metastore.warehouse.dir</name>
    <value>/data/hive/warehouse</value>
    <description>location of default database for the warehouse</description>
  </property>
<property>
    <name>hive.exec.scratchdir</name>
    <value>/data/hive/tmp</value>
    <description>HDFS root scratch dir for Hive jobs which gets created with write all (733) permission. For each connecting user,an HDFS scratch dir: ${hive.exec.scratchdir}/<username> is created,with ${hive.scratch.dir.permission}.</description>
  </property>

<property>
    <name>hive.druid.broker.address.default</name>
    <value>10.70.27.12:8082</value>
    <description>
      Address of the Druid broker. If we are querying Druid from Hive,this address needs to be
      declared
    </description>
  </property>
  <property>
    <name>hive.druid.coordinator.address.default</name>
    <value>10.70.27.8:8081</value>
    <description>Address of the Druid coordinator. It is used to check the load status of newly created segments</description>
  </property>

执行hadoop命令新建/data/hive/warehouse （上面配置文件中指定的）目录：

#新建目录/data/hive/warehouse
# $HADOOP_HOME/bin/hdfs dfs -mkdir -p /data/hive/warehouse#给新建的目录赋予读写权限
# $HADOOP_HOME/bin/hdfs dfs -chmod 777 /data/hive/warehouse #查看修改后的权限
# $HADOOP_HOME/bin/hdfs dfs -ls /data/hive
Found 1 items
drwxrwxrwx  - root supergroup     0 2018-03-19 20:25 /data/hive/warehouse

#运用hadoop命令新建/data/hive/tmp目录
# $HADOOP_HOME/bin/hdfs dfs -mkdir -p /data/hive/tmp#给目录/tmp/hive赋予读写权限
# $HADOOP_HOME/bin/hdfs dfs -chmod 777 /data/hive/tmp#检查创建好的目录
# $HADOOP_HOME/bin/hdfs dfs -ls /data/hive/
Found 2 items
drwxrwxrwx  - root supergroup     0 2018-03-19 20:32 /data/hive/tmp
drwxrwxrwx  - root supergroup     0 2018-03-19 20:25 /data/hive/warehouse

2.3.2修改$HIVE_HOME/conf/hive-site.xml中的临时目录

- 按下面的步骤修改文件 $HIVE_HOME/conf/hive-site.xml.

1. 将文件中的所有 ${system:java.io.tmpdir}替换成/opt/apache-hive-2.3.2-bin/tmp

2. 将文件中所有的${system:user.name}替换为 root

[root@apollo conf]# cd $HIVE_HOME
[root@apollo hive]# mkdir tmp

2.4安装配置MysqL

2.4.1.安装 MysqL

CentOS7.0安装MysqL请参考：CentOS7 rpm包安装mysql5.7，本文不再累述。

2.4.2. 把MysqL的驱动包上传到Hive的lib目录下：

到下面的官方网站上去下载MysqL connector：

https://dev.mysql.com/downloads/connector/j/

本文选择的是MysqL-connector-java-5.1.46.tar.gz，然后按如下步骤把它copy到hive系统中

# tar zxvf  MysqL-connector-java-5.1.46.tar.gz
# cd MysqL-connector-java-5.1.46; cp MysqL-connector-java-5.1.46.jar $HIVE_HOME/lib

2.4.3修改hive-site.xml数据库相关配置

按以下步骤修改$HIVE_HOME/conf/hive-site.xml 文件。

搜索 javax.jdo.option.ConnectionURL,将该name对应的value修改为MysqL的地址:

<property>
  <name>javax.jdo.option.ConnectionURL</name>
  <value>jdbc:MysqL://10.70.27.12:3306/hive?createDatabaseIfNotExist=true</value>
  <description>
    JDBC connect string for a JDBC Metastore.
    To use SSL to encrypt/authenticate the connection,provide database-specific SSL flag in the connection URL.
    For example,jdbc:postgresql://myhost/db?ssl=true for postgres database.
  </description>
</property>

搜索javax.jdo.option.ConnectionDriverName，将该name对应的value修改为MysqL驱动类路径:

<property>
  <name>javax.jdo.option.ConnectionDriverName</name>
  <value>com.MysqL.jdbc.Driver</value>
  <description>Driver class name for a JDBC Metastore</description>
</property>
<property>

搜索javax.jdo.option.ConnectionUserName，将对应的value修改为MysqL数据库登录名:

<property>
  <name>javax.jdo.option.ConnectionUserName</name>
  <value>hive</value>
  <description>Username to use against Metastore database</description>
</property>

搜索javax.jdo.option.ConnectionPassword，将对应的value修改为MysqL 数据库的登录密码:

<property>
  <name>javax.jdo.option.ConnectionPassword</name>
  <value>hive888</value>
  <description>password to use against Metastore database</description>
</property>

搜索hive.Metastore.schema.verification，将对应的value修改为false：

<property>
  <name>hive.Metastore.schema.verification</name>
  <value>false</value>
  <description>
    Enforce Metastore schema version consistency.
    True: Verify that version information stored in is compatible with one from Hive jars.  Also disable automatic
          schema migration attempt. Users are required to manually migrate schema after Hive upgrade which ensures
          proper Metastore schema migration. (Default)
    False: Warn if the version information stored in Metastore doesn't match with one from in Hive jars.
  </description>
</property>

2.4.4 在$HIVE_HOME/conf目录下新建hive-env.sh

# cd $HIVE_HOME/conf
#将hive-env.sh.template 复制一份并重命名为hive-env.sh
# cp hive-env.sh.template hive-env.sh
#打开hive-env.sh并添加如下内容
# vim hive-env.sh
export HADOOP_HOME=/opt/hadoop-2.9.0
export HIVE_CONF_DIR=/opt/apache-hive-2.3.2-bin/conf
export HIVE_AUX_JARS_PATH=/opt/apache-hive-2.3.2-bin/lib

3.启动和测试

3.1.MysqL 数据库进行初始化

首先用root登陆MysqL去授权和建库。登陆后执行下面的命令。create user 'hive'@'%' identified by 'hive888'; create database hive DEFAULT CHARSET utf8 COLLATE utf8_general_ci; GRANT ALL ON hive.* TO 'hive'@'%'; flush privileges; quit;  然后进入$HIVE/bin
# cd $HIVE_HOME/bin
#对数据库进行初始化：
# schematool -initSchema -dbType MysqL

输出

SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/apache-hive-2.3.2-bin/lib/log4j-slf4j-impl-2.6.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/hadoop-2.9.0/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Metastore connection URL: jdbc:MysqL://10.70.27.12:3306/hive?createDatabaseIfNotExist=true
Metastore Connection Driver : com.MysqL.jdbc.Driver
Metastore connection User: root
Starting Metastore schema initialization to 2.3.0
Initialization script hive-schema-2.3.0.MysqL.sql
Initialization script completed
schemaTool completed
执行成功后，查看MysqL 数据库
# MysqL -uroot -p<yourpassword>

MysqL> use hive;
Database changed

MysqL> show tables;

3.2.启动Hive

# $HIVE_HOME/bin/hive

.....

hive>

3.3.测试

3.3.1.查看函数命令：

hive>show functions;
OK
...hive> describe database bigtreetrial;OK bigtreetrial  hdfs://hadoopServer3:9000/data/hive/warehouse/bigtreetrial.db  root  USER Time taken: 0.02 seconds,Fetched: 1 row(s) hive>

3.3.2.查看sum函数的详细信息的命令：

hive> desc function sum;
OK
sum(x) - Returns the sum of a set of numbers
Time taken: 0.008 seconds,Fetched: 1 row(s)

3.3.3.新建测试数据库和数据表

#新建数据库
hive> create database bigtreeTrial;
#新建数据表
hive> use bigtreeTrial;
hive> create table student(id int,name string) row format delimited fields terminated by '\t';
hive> desc student;
OK
id                      int                                         
name                    string                                      
Time taken: 0.114 seconds,Fetched: 2 row(s)hive> select * from student;
OK
Time taken: 1.089 seconds

3.3.4.将文件写入到表中

3.3.4.1.在$HIVE_HOME下新建一个文件

# cd $HIVE_HOME
新建文件student.dat
# touch student.dat
在文件中添加如下内容
[root@apollo hive]# vim student.dat
001   daniel
002   bill
003   bruce
004   xin

文件

3.3.4.2.导入数据

hive> load data local inpath '/opt/apache-hive-2.3.2-bin/student.dat' into table bigtreeTrial.student;
Loading data to table sbux.student
OK
Time taken: 4.844 seconds

3.3.4.3查看导入数据是否成功

hive> use bigtreeTrial;
OK
Time taken: 0.022 seconds
hive> select * from student;
OK
1    daniel
2    bill
3    bruce
4    xin
Time taken: 1.143 seconds,Fetched: 4 row(s)

3.3.4.4 在HDFS系统中查看数据

# $HADOOP_HOME/bin/hdfs dfs -ls /data/hive/warehouse
Found 1 items
drwxrwxrwx - root supergroup 0 2018-03-20 11:40 /data/hive/warehouse/bigtreetrial.db
3.4.在界面上查看刚刚写入的hdfs数据

在浏览器里打开如下的连接（hadoop的namenode）来查看HIVE的HDFS信息。

http://10.70.27.3:50070/explorer.html#/data/hive/warehouse

说明：先打开http://10.70.27.3:50070，然后在最右边的菜单Utilities -> Browse File System,输入 /,然后选择go,就可以一步一步地浏览

HDFS信息了。

3.5.在MysqL的hive数据里查看

1 row in set (0.00 sec)

4.编译与patch （可选）
这步和安装配置无关。
在本次安装后，在使用过程中，发现hive与druid对接有问题，需要给hive打patch，但是这个时候的官方hive是没有这个patch的，就只能自己动手了。
问题现象如下：
Druid broker 日志
==============
2018-03-23T03:20:00,992 INFO [qtp2119918107-144] io.druid.java.util.emitter.core.LoggingEmitter - Event [{"Feed":"metrics","timestamp":"2018-03-23T03:20:00.992Z","service":"druid/broker","host":"10.70.27.12 :8082","version":"0.12.0","metric":"query/bytes","value":389,"context":"{\"queryId\":\"ae955617-7f55-4db6-a239-16a8acc85316\"}","dataSource":"druid_metrics","duration":"PT9223372036854775.807S","hasFilters":"false","id":"ae955617-7f55-4db6-a239-16a8acc85316","interval":["-146136543-09-08T08:23:32.096Z/146140482-04-24T15:36:27.903Z"],"remoteAddress":"10.70.27.3","success":"true","type":"segmentMetadata"}]
2018-03-23T03:17:24,459 ERROR [qtp2119918107-147] com.sun.jersey.spi.container.ContainerResponse -
The RuntimeException could not be mapped to a response,re-throwing to the HTTP container
java.lang.IllegalArgumentException: Invalid format: "1900-01-01T08:05:43.000 08:05:43" is malformed at " 08:05:43"
at org.joda.time.format.DateTimeFormatter.parseDateTime(DateTimeFormatter.java:945) ~[joda-time-2.9.9.jar:2.9.9]
at org.joda.time.convert.StringConverter.setInto(StringConverter.java:212) ~[joda-time-2.9.9.jar:2.9.9]
at org.joda.time.base.BaseInterval.<init>(BaseInterval.java:200) ~[joda-time-2.9.9.jar:2.9.9]
at org.joda.time.Interval.<init>(Interval.java:289) ~[joda-time-2.9.9.jar:2.9.9]
at io.druid.java.util.common.Intervals.of(Intervals.java:38) ~[java-util-0.12.0.jar:0.12.0]
at io.druid.server.ClientInfoResource.getQueryTargets(ClientInfoResource.java:303) ~[druid-server-0.12.0.jar:0.12.0]
fix solution:
https://issues.apache.org/jira/browse/HIVE-16576
这个fix 要3.0.0才有，我们目前只能手工打patch步骤如下：
4.1 下载hive source code
@L_404_8@
本次选择的是apache-hive-2.3.2-src.tar.gz。把下载了的源码包放到一个centOS的linux主机上。
#tar zxvf apache-hive-2.3.2-src.tar.gz
本次需要的patch地址为：
https://issues.apache.org/jira/browse/HIVE-16576

按patch里面的内容修改源代码并保持。然后到下一步去构架

4.2 构建 from source code

注：本台机器上必须安装 jdk8 和 maven 工具。

1. 在maven 的 /usr/share/maven/conf/settings.xml 做如下的配置，可以加速构建。

<mirrors>
    <mirror>
      <id>alimaven</id>
      <name>aliyun maven</name>
      <url>http://maven.aliyun.com/nexus/content/groups/public/</url>
      <mirrorOf>central</mirrorOf>
    </mirror>
  </mirrors>

2. cd apache-hive-2.3.2-src;mvn clean package -Pdist -DskipTests

经过比较长的编译过程，等构建完毕。

# cd ./packaging/target/

该目录下就会有新生成的 apache-hive-2.3.2-bin.tar.gz。

5.（可选）关于Hive和druid (0

.9.x及其以后

)的集成

5.1 集成 jira：@L_403_10@
5.2 集成介绍的官方page:https://cwiki.apache.org/confluence/display/Hive/Druid+Integration

第一步：配置和启动tranquility 服务器

下载 tranquility-distribution-0.8.2.tar to /opt/

step2: # tar xvfdownloadtranquility-distribution-0.8.2.tar

step3: # cd/opt/tranquility-distribution-0.8.2/conf

viserver.json

{
  "dataSources" : {
    "pageviews" : {
      "spec" : {
        "dataSchema" : {
          "dataSource" : "pageviews","parser" : {
            "type" : "string","parseSpec" : {
              "timestampSpec" : {
                "format": "auto","column": "time"
              },"dimensionsSpec" : {
               "dimensions": ["url","user"]
              },"format" : "json"
            }
          },"granularitySpec" : {
            "type" : "uniform","segmentGranularity" : "hour","queryGranularity" : "none"
          },"metricsSpec" : [
                          {"name": "views","type": "count"},{"name": "latencyMs","type": "doubleSum","fieldName": "latencyMs"}
          ]
        },},"ioConfig" : {
          "type" : "realtime"
        },"tuningConfig" : {
          "type" : "realtime","maxRowsInMemory" : "100000","intermediatePersistPeriod" : "PT10M","windowPeriod" : "PT10M"
        }
      },"properties" : {
        "task.partitions" : "1","task.replicants" : "1"
      }
    } },"properties" : {
    "zookeeper.connect" : "10.70.27.8:2181,10.70.27.10:2181,10.70.27.12:2181","druid.discovery.curator.path" : "/druid/discovery","druid.selectors.indexing.serviceName" : "druid/overlord","http.port" : "8200","http.threads" : "8"
  }
}

启动tranquility server

# cd /opt/tranquility-distribution-0.8.2 ;./bin/tranquility server conf/server.json

....

2018-03-28 02:00:24,210 [main] INFO o.e.jetty.server.ServerConnector - Started ServerConnector@406ca9fc{HTTP/1.1}{0.0.0.0:8200}
2018-03-28 02:00:24,210 [main] INFO org.eclipse.jetty.server.Server - Started @3868ms

第二步：向tranquility 服务器发送数据

post ： http://10.70.27.8:8200/v1/post/pageviews

// 10.70.27.8 是tranquility 服务器运行的地址。pageviews 是上面配置文件中的data source地址。

text/plain; raw

{"time": "2018-03-27T12:42:49Z","url": "/foo/bar","user": "billhongbin","latencyMs": 45}

第三步：查看druid task

http://【overlord server IP】:8090/console.html

可以看到任务完成。

第四步，下载hive并配置hive中的druid设置

第五步，从hive中检索数据

# /opt/apache-hive-2.3.2-bin/bin/hive

hive> show databases;
OK
bigtreetrial
default
Time taken: 3.255 seconds,Fetched: 2 row(s)

hive> use bigtreetrial;

hive >CREATE EXTERNAL TABLE bill_druid_table
STORED BY 'org.apache.hadoop.hive.druid.DruidStorageHandler'
TBLPROPERTIES ("druid.datasource" = "pageviews");

hive> describe formattedbill_druid_table;

OK
# col_name data_type comment

__time timestamp from deserializer
latencyms string from deserializer
url string from deserializer
user string from deserializer
views bigint from deserializer

# Detailed Table Information
Database: bigtreetrial
Owner: root
CreateTime: Tue Mar 27 20:48:43 CST 2018
LastAccessTime: UNKNOWN
Retention: 0
Location: hdfs://hadoopServer3:9000/data/hive/warehouse/bigtreetrial.db/bill_druid_table

Table Type: EXTERNAL_TABLE

Table Parameters:
COLUMN_STATS_ACCURATE {\"BASIC_STATS\":\"true\"}
EXTERNAL TRUE
druid.datasource pageviews
numFiles 0
numRows 0
rawDataSize 0
storage_handler org.apache.hadoop.hive.druid.DruidStorageHandler
totalSize 0

transient_lastDdlTime 1522154923

# Storage Information
SerDe Library: org.apache.hadoop.hive.druid.serde.DruidSerDe
InputFormat: null
OutputFormat: null
Compressed: No
Num Buckets: -1
Bucket Columns: []
Sort Columns: []
Storage Desc Params:
serialization.format 1

Time taken: 0.046 seconds,Fetched: 37 row(s)

hive> select * from bill_druid_table;OK2018-03-28 11:37:04 NULL /datang/machine billtang 12018-03-28 11:37:04 NULL /datang/machine tiger 12018-03-28 12:42:15 NULL /datang/machine billtang 12018-03-28 12:48:15 NULL /datang/machine billtang 12018-03-28 12:48:15 NULL /sina/machine bigtree 1Time taken: 2.037 seconds,Fetched: 5 row(s)