Hive数据类型

基本数据类型【1】

数据类型	所占字节
tinyint	1字节整数
SMALLINT	2字节整数
INT/INTRGER	4字节整数
BIGINT	8字节整数
FLOAT	4字节单精度浮点数
DOUBLE	8字节双精度浮点数
DATE
STRING

复杂数据类型

数据类型	描述	示例
ARRAY	一组有序字段。字段类型必须相同	Array(1,2)
MAP	一组无序的键值对，键的类型必须是原子的，值可以是任何类型，同一个映射的键的类型必须相同，值的类型也必须相同	Map('a',1,'b',2)
STRUT	一组命名的字段，字段类型可以不同	Struct('a',1,1,0)

分区表

hive中没有索引，导致查询速度很慢，如果不设置分区，则每次查询会全表扫描

如果该表为分区表时，在where条件中必须对分区字段进行限制
如果该表为分区表且为全量表，需要注意分区字段的限制情况
- 如：where dt=get_dt_date(get_date(-1))
  and dt between '20210813' and '20210814' 是不行的
- 全量表：一个dt就是截至对应dt及以前的历史全量数据，所以只需要拿最新的dt就是最新全量数据。如果对全量分区表的分区字段（dt）进行范围选择，会导致报错。 （订单表一般是全量表，或者说除了流量、曝光这些发生了之后状态就不会再变化的表是增量表以外，其他基本都是全量表）
- 增量表：一个dt只是当天的数据，一天天累计下来，所以就叫增量表。（流量表一般是增量表）
Q:非分区表的全量表是否可以直接对dt进行范围的选择呢？

执行顺序

from->where->group by->having->select->order by->limit

时间函数

hive数据库中存储的时间是string类型的(string是字符串，int是整数)

把时间戳转为日期

from_unixtime(paytime,'yyyy-MM-dd hh:mm:ss')

转化UNIX时间戳（从1970-01-01 00:00:00 UTC到指定时间的秒数）到当前时区的时间格式，返回值: string

select from_unixtime(1323308943,'yyyy-MM-dd HH:mm:ss');
--2011-12-08 09:49:03
select from_unixtime(1323308943,'yyyyMMdd');
--20111208
select from_unixtime(1323308943,'yyyy-MM-dd');
--2011-12-08
select from_unixtime(1323308943,'yyyy-MM');
--2011-12

把日期转为时间戳

unix_timestamp(string date)

select unix_timestamp()
--1323309615(获得当前时区的UNIX时间戳)

select unix_timestamp('2011-12-07 13:01:03')
--1323234063(转换格式为“yyyy-MM-dd HH:mm:ss“的日期到UNIX时间戳。如果转化失败，则返回0。)

select unix_timestamp('20111207 13:01:03','yyyyMMdd HH:mm:ss');
--1323234063

select unix_timestamp('2011-12-07 13:05','yyyy-MM-dd HH:mm');
--1323234300

select unix_timestamp('2011-12','yyyy-MM');
--1322668800（转换pattern格式的日期到UNIX时间戳。如果转化失败，则返回0）

只能识别 2011-12-07 13:01:03 这种完全格式的时间,返回值: bigint

日期转换函数

date_format（date/timestamp/string ts, string fmt）

select date_format('2015-04-08', 'y');
--2015
select date_format('2015-04-08', 'yyyy');
--2015
select date_format('2015-04-08', 'yyyy-MM');
--2015-04
select date_format('2015-04-08 10:10:01', 'yyyy-MM');
--2015-04
select date_format('2015-04-08', 'yyyy-MM-dd');
--2015-04-08

日期时间转日期

to_date(string timestamp)

说明: 返回日期时间字段中的日期部分。返回值:string,year()、month()、day（）、month（）、minute（）、second（）同理。

select to_date('2011-12-08 10:03:01');
--2011-12-08
select to_date('2011-12-08');
--2011-12-08
select to_date('2011-12');
--NULL

日期转周

weekofyear (string date)

说明: 返回日期在当前的周数,返回值: int

select weekofyear('2011-12-08 10:03:01');
--49

计算日期间隔

datediff(string enddate,string startdate)
结束时间-开始时间

日期减少函数

date_sub(stringstartdate,int_days)

日期增加 函数

date_add(string,startdate,int days)

格式转换函数

CAST (expression AS data_type)

CAST()函数将任何类型的值转换为具有指定类型的值。目标类型可以是以下类型之一：BINARY，CHAR，DATE，DATETIME，TIME，DECIMAL，SIGNED，UNSIGNED

cast(create_time as timestamp)
--把create_time创建时间转化成timestamp类型

条件函数

case when
if

字符串函数

截取 函数

substr(string,int start,int len)如果不指定截取长度，则从起始位一直截取到最后

json提取 函数

string型字符串，存储为json格式：get_json_object(string json_string,'\(字段名')（'\)字段名）为string path

在json字符串中提取出想要的字段和值

--如在字段extra1(string)：{"systemtype": "android", "education": "doctor", "marriage_status": "1", "phonebrand": "VIVO"} ；提取出extra1和extra2中的手机品牌:

select get_json_object(extra1,'$.phonebrand') as phone_brand,count(distinct user_id)
from user_info
group by get_json_object(extra1,'$.phonebrand');

map<string,string>,存储为json格式：**extea2['字段名']

替换函数

regexp_replace（string,'要被替代的','替代的'）

regexp_replace(substr(create_time,1,10),'-','')
regexp_replace(to_date(create_time),'-','')

替换NULL值函数

if null():IFNULL函数接受两个参数，如果不为NULL则返回第一个参数，否则返回第二个参数
coalesce():COALESCE函数需要许多参数，并返回第一个非NULL参数。如果所有参数都为NULL，则COALESCE函数返回NULL,如果参数的数量为2，则与if null相同

聚合统计 函数

sum()、count()等（sum(),avg(),max(),count(),min()这五个聚合函数之间不可以互相嵌套。如不允许嵌套组合avg(count(*))；）

多表连接查询

在连接查询时，先去重，再做表连接，先连接再去重执行效率会低。
在表的数据量级很大的时候先分别查询出结果再进行连接效率会快些。
多表关联时尽量key相同，会当成同一个mr任务执行

应用技巧

where 1=0

where 1=0; 这个条件始终为false，结果不会返回任何数据，只有表结构，可用于快速建表或者观察表结构

连接字段类型

在连接表时，需要注意表的连接条件的字段类型是否是一致的，如果出现不一致的情况，需要进行转换

如：表1和表2的连接条件为user_id，如果表1的user_id是bigint型，表2的user_id是string型，如果直接连接，则会导致数据出现的膨胀的情况，则需要对数据进行转换，如cast(user_id as bigint),，将string类型的user_id转换成整型的user_id。

中位数

percentile_approx(字段名,0.5)

优化

order by优化【3】

order by会引发全局排序，使用distribute和sort进行分组排序效率更快（sort by 替换order by，配合distribute by一起使用）

select * from store cluster by merid;

select * 
from store 
distribute by 
merid sort by merid asc

distinct优化【5】

能使用group by代替distinct就不要使用distinct,group by 代替distinct去重数据，会有多个mapreduce执行，大数据量情况下比较好用。

join优化【4】

小表前置，因为hive在解析sql的时候会把第一个表放进内存

参考

【1】Hive基本简介和基础语句：https://zhuanlan.zhihu.com/p/154379734
【2】在互联网公司实习中使用Hive sql的一些体会和注意点：https://zhuanlan.zhihu.com/p/214254848
【3】hive中distribute by 、sort by 、cluster by 、order by的区别：https://www.jianshu.com/p/193b69210a82
【4】hive中sql优化有哪些：https://zhuanlan.zhihu.com/p/133455414
【5】hive group by distinct区别以及性能比较：https://blog.csdn.net/xiaoshunzi111/article/details/68484426

Hive SQL&Spark SQL笔记

Hive数据类型

基本数据类型【1】

复杂数据类型

分区表

执行顺序

时间函数

把时间戳转为日期

把日期转为时间戳

日期转换函数

日期时间转日期

日期转周

计算日期间隔

日期减少函数

日期增加 函数

格式转换函数

条件函数

字符串函数

截取 函数

json提取 函数

替换函数

替换NULL值函数

聚合统计 函数

多表连接查询

应用技巧

where 1=0

连接字段类型

中位数

优化

order by优化【3】

distinct优化【5】

join优化【4】

参考

相关推荐