Hive之——整合MySQL存储元数据信息及基本操作示例

云计算 waitig 592℃ 百度已收录 0评论

一、概述

Hive在实际工作过程中,需要一个统一的存储媒介来存储元数据信息,这个存储媒介通常可以是MySQL数据,那么如何将Hive的元数据信息存储到MySQL中呢,今天,就带着大家一起学习大数据入门系列的Hive篇——整合MySQL存储元数据信息。

二、环境整合

1、安装MySQL数据库

大家可以参考博文《MySQL之——CentOS6.5 编译安装MySQL5.6.16》或者《MySQL之——RPM方式安装MySQL5.6》

2、安装Hadoop

(1) 伪分布式安装

请参考博文:《Hadoop之——Hadoop2.4.1伪分布搭建》

(2) 集群安装

请参考博文《Hadoop之——CentOS + hadoop2.5.2分布式环境配置》

(3) 高可用集群安装

请参考博文《Hadoop之——Hadoop2.5.2 HA高可靠性集群搭建(Hadoop+Zookeeper)前期准备》和《Hadoop之——Hadoop2.5.2
HA高可靠性集群搭建(Hadoop+Zookeeper)》

3、安装Hive

这里我用的hive版本是hive-0.12.0.tar.gz,也可以使用其他版本的Hive.

1.1上传并解压

tar -zxvf hive-0.12.0.tar.gz

1.2 修改配置文件

进入到HIVE_HOME的conf目录下,修改文件hive-default.xml.template文件名称为hive-site.xml并编辑里面的内容如下:

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
   Licensed to the Apache Software Foundation (ASF) under one or more
   contributor license agreements.  See the NOTICE file distributed with
   this work for additional information regarding copyright ownership.
   The ASF licenses this file to You under the Apache License, Version 2.0
   (the "License"); you may not use this file except in compliance with
   the License.  You may obtain a copy of the License at

       http://www.apache.org/licenses/LICENSE-2.0

   Unless required by applicable law or agreed to in writing, software
   distributed under the License is distributed on an "AS IS" BASIS,
   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
   See the License for the specific language governing permissions and
   limitations under the License.
-->

<configuration>

<!-- WARNING!!! This file is provided for documentation purposes ONLY!     -->
<!-- WARNING!!! Any changes you make to this file will be ignored by Hive. -->
<!-- WARNING!!! You must make your changes in hive-site.xml instead.       -->


<!-- Hive Execution Parameters -->
<property>
	<name>javax.jdo.option.ConnectionURL</name>
	<value>jdbc:mysql://127.0.0.1:3306/hive?createDatabaseIfNotExist=true</value>
</property>
<property>
	<name>javax.jdo.option.ConnectionDriverName</name>
	<value>com.mysql.jdbc.Driver</value>
</property>
<property>
	<name>javax.jdo.option.ConnectionUserName</name>
	<value>root</value>
</property>
<property>
	<name>javax.jdo.option.ConnectionPassword</name>
	<value>root</value>
</property>


</configuration>

并将MySQL驱动Jar包放到HIVE_HOME的lib目录下。

配置好后启动hive即可。

三、Hive基本操作示例

1、在hive当中创建两张表

create table trade_detail (id bigint, account string, income double, expenses double, time string) row format delimited fields terminated by '\t';
create table user_info (id bigint, account string, name  string, age int) row format delimited fields terminated by '\t';

2、将mysq当中的数据直接导入到hive当中

sqoop import --connect jdbc:mysql://192.168.1.10:3306/lyz --username root --password 123 --table trade_detail --hive-import --hive-overwrite --hive-table trade_detail --fields-terminated-by '\t'
sqoop import --connect jdbc:mysql://192.168.1.10:3306/lyz --username root --password 123 --table user_info --hive-import --hive-overwrite --hive-table user_info --fields-terminated-by '\t'

3、创建一个result表保存前一个sql执行的结果

create table result row format delimited fields terminated by '\t' as select t2.account, t2.name, t1.income, t1.expenses, t1.surplus from user_info t2 join (select account, sum(income) as income, sum(expenses) as expenses, sum(income-expenses) as surplus from trade_detail group by account) t1 on (t1.account = t2.account);

4、将本地文件系统上的数据导入到HIVE当中

create table user (id int, name string) row format delimited fields terminated by '\t'
load data local inpath '/root/user.txt' into table user;

5、创建外部表

Hive内部表和外部表的区别是,当Hive删除表的时候,内部表会将所有相关的数据库/表文件删除,外部表则不会删除;外部表用关键字external标识。

create external table stubak (id int, name string) row format delimited fields terminated by '\t' location '/stubak';

6、创建分区表

普通表和分区表区别:有大量数据增加的需要建分区表

create table book (id bigint, name string) partitioned by (pubdate string) row format delimited fields terminated by '\t'; 

7、分区表加载数据

load data local inpath './book.txt' overwrite into table book partition (pubdate='2017-10-21'); 

四、其他各种操作示例

set hive.cli.print.header=true;

CREATE TABLE page_view(viewTime INT, userid BIGINT,
     page_url STRING, referrer_url STRING,
     ip STRING COMMENT 'IP Address of the User')
 COMMENT 'This is the page view table'
 PARTITIONED BY(dt STRING, country STRING)
 ROW FORMAT DELIMITED
   FIELDS TERMINATED BY '\001'
STORED AS SEQUENCEFILE;

//sequencefile
create table tab_ip_seq(id int,name string,ip string,country string) 
    row format delimited
    fields terminated by ','
    stored as sequencefile;
    
//使用select语句来批量插入数据
insert overwrite table tab_ip_seq select * from tab_ext;


//create & load
create table tab_ip(id int,name string,ip string,country string) 
    row format delimited
    fields terminated by ','
    stored as textfile;
    
//从本地导入数据到hive的表中(实质就是将文件上传到hdfs中hive管理目录下)
load data local inpath '/home/hadoop/ip.txt' into table tab_ext;

//从hdfs上导入数据到hive表中(实质就是将文件从原始目录移动到hive管理的目录下)
load data inpath 'hdfs://ns1/aa/bb/data.log' into table tab_user;


//external外部表
CREATE EXTERNAL TABLE tab_ip_ext(id int, name string,
     ip STRING,
     country STRING)
 ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
 STORED AS TEXTFILE
 LOCATION '/external/user';
 

// CTAS  根据select语句建表结构
CREATE TABLE tab_ip_ctas
   AS
SELECT id new_id, name new_name, ip new_ip,country new_country
FROM tab_ip_ext
SORT BY new_id;


//CLUSTER <--相对高级一点,你可以放在有精力的时候才去学习>
create table tab_ip_cluster(id int,name string,ip string,country string)
clustered by(id) into 3 buckets;

load data local inpath '/home/hadoop/ip.txt' overwrite into table tab_ip_cluster;
set hive.enforce.bucketing=true;
insert into table tab_ip_cluster select * from tab_ip;

select * from tab_ip_cluster tablesample(bucket 2 out of 3 on id); 



//PARTITION  分区表
create table tab_ip_part(id int,name string,ip string,country string) 
    partitioned by (year string)
    row format delimited fields terminated by ',';
    
load data local inpath '/home/hadoop/data.log' overwrite into table tab_ip_part
     partition(year='1990');
    
    
load data local inpath '/home/hadoop/data2.log' overwrite into table tab_ip_part
     partition(year='2000');

select * from tab_ip_part;

select * from tab_ip_part  where part_flag='part2';
select count(*) from tab_ip_part  where part_flag='part2';


alter table tab_ip change id id_alter string;
ALTER TABLE tab_cts ADD PARTITION (partCol = 'dt') location '/external/hive/dt';

show partitions tab_ip_part;

//insert from select   通过select语句批量插入数据到别的表
create table tab_ip_like like tab_ip;
insert overwrite table tab_ip_like
    select * from tab_ip;
   
//write to hdfs  将结果写入到hdfs的文件中
insert overwrite local directory '/home/hadoop/hivetemp/test.txt' select * from tab_ip_part where part_flag='part1';    
insert overwrite directory '/hiveout.txt' select * from tab_ip_part where part_flag='part1';

//cli shell  通过shell执行hive的hql语句
hive -S -e 'select country,count(*) from tab_ext' > /home/hadoop/hivetemp/e.txt  

select * from tab_ext sort by id desc limit 5;

select a.ip,b.book from tab_ext a join tab_ip_book b on(a.name=b.name);


//array 
create table tab_array(a array<int>,b array<string>)
row format delimited
fields terminated by '\t'
collection items terminated by ',';

select a[0] from tab_array;
select * from tab_array where array_contains(b,'word');
insert into table tab_array select array(0),array(name,ip) from tab_ext t; 

//map
create table tab_map(name string,info map<string,string>)
row format delimited
fields terminated by '\t'
collection items terminated by ','
map keys terminated by ':';

load data local inpath '/home/hadoop/hivetemp/tab_map.txt' overwrite into table tab_map;
insert into table tab_map select name,map('name',name,'ip',ip) from tab_ext; 

//struct
create table tab_struct(name string,info struct<age:int,tel:string,addr:string>)
row format delimited
fields terminated by '\t'
collection items terminated by ','

load data local inpath '/home/hadoop/hivetemp/tab_st.txt' overwrite into table tab_struct;
insert into table tab_struct select name,named_struct('age',id,'tel',name,'addr',country) from tab_ext;



//UDF
select if(id=1,first,no-first),name from tab_ext;

hive>add jar /home/hadoop/myudf.jar;
hive>CREATE TEMPORARY FUNCTION fanyi AS 'cn.lyz.hive.Fanyi';
select id,name,ip,fanyi(country) from tab_ip_ext;


本文由【waitig】发表在等英博客
本文固定链接:Hive之——整合MySQL存储元数据信息及基本操作示例
欢迎关注本站官方公众号,每日都有干货分享!
等英博客官方公众号
点赞 (0)分享 (0)