博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
Hive Outline - Part I (UDF, JOIN, SELECT, Map-Red)
阅读量:6000 次
发布时间:2019-06-20

本文共 6877 字,大约阅读时间需要 22 分钟。

hot3.png

The Outline of Hive

User-Defined Functions

- UDF

UDF is one-one row function,for example, Substring,trim. A customized class extended from org.apache.hadoop.hive.ql.exec.UDF, And implement method 'evaluate' with multiple signature body. This is not declared as interface method, just because it might need multiple parameters,

this is uncertain on application context. for Example:

public class strip extends UDF {  public Text evaluate(...){}}

- UDAF

UDAF is many-one rows (aggregation) function, example, Count,Max. The class extends org.apache.hadoop.hive.ql.exec.UDAF. And The class need multiple nested class that extends from org.apache.hadoop.hive.ql.exec.UDADEvaluator, like:

 class maxint extends UDAF {  public static class maxintEvaluator extends UDAFEvaluator{   //very-first   void init(){}      // for each row in partial   bool iterator(){}      // finish a partial   terminatePartial(){}      //merge partial   bool merge(partialresult);      //final result   doubleWritable terminate(){}  }  public static class maxlongEvaluator extends UDAFEvaluator{  } }

 

- UDTF 

UDTF is one-many (table-generated) function, for example explode(column),

 The class extends org.apache.hadoop.hive.ql.udf.generic.GenericUDTF
 The class implement initialize, process, close() method.
 using with lateral view,like:

 select src.id, mytable.col1, mytable.col2 from src lateral view explode_map(properties) mytable as col1, col2;

Query Hive

Questions

1. What's LATERAL VIEW? Lateral View is using with split/explode together, which are able to explode Array/Map/Structure to multiple rows. but not able to use join, sort, order and etc. Very limited.

  • View

generate by CVAS (Create View AS Select), it's just a saved query, no cached View in hive. it can show by 'Show tables;', and using 'describe extended view' for details.

  • SubQuery, must have alias, how many levels not sure, need verify on practise.

Select a,b From (   select... ) aliashere Group by a,b

  • Joins,

    Hive support Inner Join, Left/Right/Full Outer join, On syntax only support eaqual condition, other condition not supported.

    The algorithm will use Reduce Only, which will try to cache left table (FROM table) , and transfer right table (JOIN table). So better set FROM table smaller, this will get best in practise.

       'Explain/Explain Extended' can use rule-based query optimizer, and tell more details about       the query. cost-based optimizer might be adedd in futrue.

    -- Semi Join, which is to repolace IN syntax, example, the right table only be referenced in On condition, and it can't be referenced in SELECT and other key words, for example.

 

Select * from TA left SEMI Join TB on (TA.id = TB.id);

    -- Map-side Join

    If the table is a small data set, then Hive able to optimize the table by In-Memory in mapper, to enhance the speed, that's so called Map size, since it's different with Reduce side jon. Using C style comment, this looks wierd, I guess some unkown reason the developer using this syntax, such as this is a hint, but not a rule. because some condition hive need to justice if it's ready to use map-side join.

SELECT /*+ MAPJOIN(tb2) */ tb1.field1,tb2.filed1FROM student tb1JOIN teacher tb2ON(tb1.teacherId = tb2.id)

In above sample, Table teacher is smaller data set than student.

Without MAPJOIN, hive might use auto convert to optimize common JOIN, some Configurations:

hive.auto.convert.join = true; this enable Auto convert in Hive, if the table is taken as small table, the value is as below setting shows.

hive.smalltable.filesize=10M; small table size standard.

hive.hashtable.max.memory.usage =50; if Map task need memory exceed this percentage, the map task will terminate.

Normally, LEFT OUTER JOIN tb1, and RIGHT OUTER JOIN tb1, try to chck if tb1 is ready to be hashtable.

However, FULL OUTER JOIN never try this, it will use reduce only algorithm. see Joins for details.

- Index

Hive actually create a table, ordered by index fields, and remember the offset to acclerate speed.

in my case the Index Table named by Database__Table_Index__.

CREATE INDEX table02_index ON TABLE table02 (column3) AS 'COMPACT' WITH DEFERRED REBUILD;ALTER INDEX table02_index ON table2 REBUILD;SHOW FORMATTED INDEX ON table02;DROP INDEX table02_index ON table02;

Note: In previous version of Hive, I still see bucket which is a hash map to enhance map-side join, but now it looks deprecated because Index is a more friendly feature to cover.

- Order By, order by is bad performance without 'limt N' clause.

Difference between Sort By and Order By

Hive supports SORT BY which sorts the data per reducer. The difference between "order by" and "sort by" is that the former guarantees total order in the output while the latter only guarantees ordering of the rows within a reducer. If there are more than one reducer, "sort by" may give partially ordered final results.

Note: It may be confusing as to the difference between SORT BY alone of a single column and CLUSTER BY. The difference is that CLUSTER BY partitions by the field and SORT BY if there are multiple reducers partitions randomly in order to distribute data (and load) uniformly across the reducers.

Basically, the data in each reducer will be sorted according to the order that the user specified. 

FROM (FROM (FROM src            SELECT TRANSFORM(value)            USING 'mapper'            AS value, count) mapped      SELECT cast(value as double) AS value, cast(count as int) AS count      SORT BY value, count) sortedSELECT TRANSFORM(value, count)USING 'reducer'AS whatever

Syntax of Cluster By and Distribute By

Cluster By and Distribute By are used mainly with the . But, it is sometimes useful in SELECT statements if there is a need to partition and sort the output of a query for subsequent queries.

Cluster By is a short-cut for both Distribute By and Sort By.

Hive uses the columns in Distribute By to distribute the rows among reducers. All rows with the same Distribute By columns will go to the same reducer. However, Distribute By does not guarantee clustering or sorting properties on the distributed keys.

Instead of specifying Cluster By, the user can specify Distribute By and Sort By, so the partition columns and sort columns can be different. The usual case is that the partition columns are a prefix of sort columns, but that is not required.

SELECT col1, col2 FROM t1 DISTRIBUTE BY col1 SELECT col1, col2 FROM t1 DISTRIBUTE BY col1 SORT BY col1 ASC, col2 DESC

Below table run Map-Reduce job and have DISTRIBUTE BY and SORT BY in different fields, if it's same, you may only specify 'CLUSTER BY'.

FROM (  FROM pv_users  MAP ( pv_users.userid, pv_users.date )  USING 'map_script.py'  AS c1, c2, c3  DISTRIBUTE BY c2  SORT BY c2, c1) map_outputINSERT OVERWRITE TABLE pv_users_reduced  REDUCE ( map_output.c1, map_output.c2, map_output.c3 )  USING 'reduce_script.py'  AS date, count;

转载于:https://my.oschina.net/zhujinbao/blog/301604

你可能感兴趣的文章
对象复制
查看>>
Mongodb内嵌数组的完全匹配查询
查看>>
MyBatis学习笔记(四) 注解
查看>>
hihoCoder #1015 : KMP算法【KMP裸题,板子】
查看>>
用sublime 3搭建php 运行环境
查看>>
主机安装
查看>>
windows基础(二)
查看>>
Mysql/Mariadb 升级注意事项
查看>>
event.preventDefault() 和 return false 都可以终止程序,二者有什么异同点?(转)
查看>>
简单实现验证码
查看>>
游戏 找CALL技巧 突破口
查看>>
java程序员面试经典问题总汇
查看>>
css position:absolute 如何居中对齐
查看>>
MySQL各种日期类型与整型(转)
查看>>
linux:sed高级命令之n、N(转)
查看>>
mass Framework class模块v12
查看>>
Android错误-error: Found text " " where item tag is expected
查看>>
PHP语言中global和$GLOBALS[]的分析 之二
查看>>
dom 的添加或事件绑定
查看>>
Basic Oracle For Developer & Beginner
查看>>