Hive SQL解析/执行计划生成流程分析
最近在研究Impala,还是先回顾下Hive的SQL执行流程吧。 Hive有三种用户接口: cli (Command line interface) bin/hive或bin/hive –service cli 命令行方式(默认) hive-server/hive-server2 bin/hive –service hiveserver 或bin/hive –service hiveserve
最近在研究Impala,还是先回顾下Hive的SQL执行流程吧。
Hive有三种用户接口:
cli (Command line interface) | bin/hive或bin/hive –service cli | 命令行方式(默认) |
hive-server/hive-server2 | bin/hive –service hiveserver 或bin/hive –service hiveserver2 | 通过JDBC/ODBC和Thrift访问(Impala通过这种方式借用hive-metastore) |
hwi (Hive web interface) | bin/hive –service hwi | 通过浏览器访问 |
在hive shell中输入“show tables;”实际执行的是:
bin/hadoop jar hive/lib/hive-cli-0.9.0.jar org.apache.hadoop.hive.cli.CliDriver -e 'SHOW TABLES;'
CLI入口函数:cli.CliDriver.main()
读入参数->建立SessionState并导入配置->处理输入文件中指令CliDriver.processFile();或交互型指令CliDriver.processLine()->解析输入CliDriver.processCmd()
(1) 如果是quit或者exit,退出
(2) 以source开头的,读取外部文件并执行文件中的HiveQL
(3) !开头的命令,执行操作系统命令(如!ls,列出当前目录的文件信息)
(4) list,列出jar/file/archive
(5) 其他命令,则生成调用相应的CommandProcessor处理,进入CliDriver.processLocalCmd()
下面看看CliDriver.processLocalCmd()这个函数:
set/dfs/add/delete指令交给指定的CommandProcessor处理,其余的交给org.apache.hadoop.hive.ql.Driver.run()处理
org.apache.hadoop.hive.ql.Driver类是查询的起点,run()方法会先后调用compile()和execute()两个函数来完成查询,所以一个command的查询分为compile和execute两个阶段。
Compile
(1)利用antlr生成的HiveLexer.java和HiveParser.java类,将HiveQL转换成抽象语法树(AST)。
首先使用antlr工具将srcqlsrcjavaorgapachehadoophiveqlparsehive.g编译成以下几个文件:HiveParser.java,?Hive.tokens,?Hive__.g,?HiveLexer.java
HiveLexer.java和HiveParser.java分别是词法和语法分析类文件,Hive__.g是HiveLexer.java对应的词法分析规范,Hive.tokens定义了词法分析后所有的token。
然后沿着“Driver.compile()->ParseDriver.parse(command, ctx)->HiveParserX.statement()->antlr中的API”这个调用关系把输入的HiveQL转化成ASTNode类型的语法树。HiveParserX是由antlr生成的HiveParser类的子类。
(2)利用对应的SemanticAnalyzer类,将AST树转换成Map-reduce task。主要分为三个步骤:
a) AST -> operator DAG
b) optimize operator DAG
c) oprator DAG -> Map-reduce task
首先接着上一步生成的语法树ASTNode,?SemanticAnalyzerFactory会根据ASTNode的token类型生成不同的SemanticAnalyzer (所有这些SemanticAnalyzer都继承自BaseSemanticAnalyzer)
1) ExplainSemanticAnalyzer
2) LoadSemanticAnalyzer
3) ExportSemanticAnalyzer
4) DDLSemanticAnalyzer
5) FunctionSemanticAnalyzer
6) SemanticAnalyzer
然后调用BaseSemanticAnalyzer.analyze()->BaseSemanticAnalyzer. analyzeInternal()。
下面以最常见的select * from table类型的查询为例,进入的子类是SemanticAnalyzer. analyzeInternal(),这个函数的逻辑如下:
1) doPhase1():将sql语句中涉及到的各种信息存储起来,存到QB中去,留着后面用。
2) getMetaData():获取元数据信息,主要是sql中涉及到的 表 和 元数据 的关联
3) genPlan():生成operator tree/DAG
4) optimize:优化,对operator tree/DAG 进行一些优化操作,例如列剪枝等(目前只能做rule-based optimize,不能做cost-based optimize)
5) genMapRedTasks():将operator tree/DAG 通过一定的规则生成若干相互依赖的MR任务
Execute
将Compile阶段生成的task信息序列化到plan.xml,然后启动map-reduce,在configure时反序列化plan.xml
实例分析:
在hive中有这样一张表:
uid |
fruit_name |
count |
a |
apple |
5 |
a |
orange |
3 |
a |
apple |
2 |
b |
banana |
1 |
执行如下的查询:
SELECT uid, SUM(count) FROM logs GROUP BY uid
通过explain命令可以查看执行计划:
EXPLAIN SELECT uid, SUM(count) FROM logs GROUP BY uid;
依照hive.g的语法规则,生成AST如下:
ABSTRACT SYNTAX TREE: ( TOK_QUERY (TOK_FROM (TOK_TABREF (TOK_TABNAME logs))) ( TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) ( TOK_SELECT (TOK_SELEXPR (TOK_TABLE_OR_COL uid)) (TOK_SELEXPR (TOK_FUNCTION sum (TOK_TABLE_OR_COL count))) ) (TOK_GROUPBY (TOK_TABLE_OR_COL uid)) ) )
生成的执行计划operator tree/DAG如下:
STAGE DEPENDENCIES: Stage-1 is a root stage Stage-0 is a root stage STAGE PLANS: Stage: Stage-1 Map Reduce Alias -> Map Operator Tree: logs TableScan // 扫描表 alias: logs Select Operator //选择字段 expressions: expr: uid type: string expr: count type: int outputColumnNames: uid, count Group By Operator //在map端先做一次聚合,减少shuffle数据量 aggregations: expr: sum(count) //聚合函数 bucketGroup: false keys: expr: uid type: string mode: hash outputColumnNames: _col0, _col1 Reduce Output Operator //输出key,value给reduce key expressions: expr: _col0 type: string sort order: + Map-reduce partition columns: expr: _col0 type: string tag: -1 value expressions: expr: _col1 type: bigint Reduce Operator Tree: Group By Operator aggregations: expr: sum(VALUE._col0) //聚合 bucketGroup: false keys: expr: KEY._col0 type: string mode: mergepartial outputColumnNames: _col0, _col1 Select Operator //选择字段 expressions: expr: _col0 type: string expr: _col1 type: bigint outputColumnNames: _col0, _col1 File Output Operator //输出到文件 compressed: false GlobalTableId: 0 table: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat Stage: Stage-0 Fetch Operator limit: -1
Hive优化策略:
1. 去除查询中不需要的column
2. Where条件判断等在TableScan阶段就进行过滤
3. 利用Partition信息,只读取符合条件的Partition
4. Map端join,以大表作驱动,小表载入所有mapper内存中
5. 调整Join顺序,确保以大表作为驱动表
6. 对于数据分布不均衡的表Group by时,为避免数据集中到少数的reducer上,分成两个map-reduce阶段。第一个阶段先用Distinct列进行shuffle,然后在reduce端部分聚合,减小数据规模,第二个map-reduce阶段再按group-by列聚合。
7. 在map端用hash进行部分聚合,减小reduce端数据处理规模。
参考文献:
http://fatkun.com/2013/01/hive-group-by.html
原文地址:Hive SQL解析/执行计划生成流程分析, 感谢原作者分享。

Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

Video Face Swap
Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Hot Tools

Notepad++7.3.1
Easy-to-use and free code editor

SublimeText3 Chinese version
Chinese version, very easy to use

Zend Studio 13.0.1
Powerful PHP integrated development environment

Dreamweaver CS6
Visual web development tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Hot Topics











HQL and SQL are compared in the Hibernate framework: HQL (1. Object-oriented syntax, 2. Database-independent queries, 3. Type safety), while SQL directly operates the database (1. Database-independent standards, 2. Complex executable queries and data manipulation).

With the popularity of mobile Internet, Toutiao has become one of the most popular news information platforms in my country. Many users hope to have multiple accounts on the Toutiao platform to meet different needs. So, how to open multiple Toutiao accounts? This article will introduce in detail the method and application process of opening multiple Toutiao accounts. 1. How to open multiple Toutiao accounts? The method of opening multiple Toutiao accounts is as follows: On the Toutiao platform, users can register accounts through different mobile phone numbers. Each mobile phone number can only register one Toutiao account, which means that users can use multiple mobile phone numbers to register multiple accounts. 2. Email registration: Use different email addresses to register a Toutiao account. Similar to mobile phone number registration, each email address can also register a Toutiao account. 3. Log in with third-party account

[Analysis of the meaning and usage of midpoint in PHP] In PHP, midpoint (.) is a commonly used operator used to connect two strings or properties or methods of objects. In this article, we’ll take a deep dive into the meaning and usage of midpoints in PHP, illustrating them with concrete code examples. 1. Connect string midpoint operator. The most common usage in PHP is to connect two strings. By placing . between two strings, you can splice them together to form a new string. $string1=&qu

In today's fast-paced society, sleep quality problems are plaguing more and more people. In order to improve users' sleep quality, a group of special sleep anchors appeared on the Douyin platform. They interact with users through live broadcasts, share sleep tips, and provide relaxing music and sounds to help viewers fall asleep peacefully. So, are these sleep anchors profitable? This article will focus on this issue. 1. Are Douyin sleep anchors profitable? Douyin sleep anchors can indeed earn certain profits. First, they can receive gifts and transfers through the tipping function in the live broadcast room, and these benefits depend on their number of fans and audience satisfaction. Secondly, the Douyin platform will give the anchor a certain share based on the number of views, likes, shares and other data of the live broadcast. Some sleep anchors will also

Zhongguancun News: On the morning of April 18, Huawei suddenly announced that the P70 series of mobile phones are officially on sale under the Pioneer Plan. Friends who want to buy should be prepared to take action. According to past practice, Huawei's flagship mobile phones are very popular and will always be out of stock. . This time the Huawei P70 series has been renamed Pura, which means pure. Previously, Huawei's Yu Chengdong said: Since 2012, Huawei's P series smartphones have been like loyal partners, accompanying hundreds of millions of users around the world to spend countless precious moments and jointly witness the beauty and excitement of life. He deeply felt that the trust and love given by every user who chooses Huawei's P series is tantamount to a powerful driving force, always inspiring Huawei to move forward firmly on the road of innovation. Pura means pure.

Analysis of the Impact of MySQL Connection Number on Database Performance With the continuous development of Internet applications, databases have become an important data storage and management tool to support application systems. In the database system, the number of connections is an important concept, which is directly related to the performance and stability of the database system. This article will start from the perspective of MySQL database, explore the impact of the number of connections on database performance, and analyze it through specific code examples. 1. What is the number of connections? The number of connections refers to the number of client connections supported by the database system at the same time. It can also be managed

Analysis of new features of Win11: How to skip logging in to a Microsoft account. With the release of Windows 11, many users have found that it brings more convenience and new features. However, some users may not like having their system tied to a Microsoft account and wish to skip this step. This article will introduce some methods to help users skip logging in to a Microsoft account in Windows 11 and achieve a more private and autonomous experience. First, let’s understand why some users are reluctant to log in to their Microsoft account. On the one hand, some users worry that they

Title: Is Tencent’s main programming language Go: An in-depth analysis. As China’s leading technology company, Tencent has always attracted much attention in its choice of programming languages. In recent years, some people believe that Tencent mainly adopts Go as its main programming language. This article will conduct an in-depth analysis of whether Tencent's main programming language is Go, and give specific code examples to support this view. 1. Application of Go language in Tencent Go is an open source programming language developed by Google. Its efficiency, concurrency and simplicity are loved by many developers.
