Home Database Mysql Tutorial 使用Apache Hadoop、Impala和MySQL进行数据分析_MySQL

使用Apache Hadoop、Impala和MySQL进行数据分析_MySQL

Jun 01, 2016 pm 01:14 PM
blog how article

Apache

Apache Hadoop是目前被大家广泛使用的数据分析平台,它可靠、高效、可伸缩。Percona公司的Alexander Rubin最近发表了一篇博客文章介绍了他是如何将一个表从MySQL导出到Hadoop然后将数据加载到Cloudera Impala并在这上面运行报告的。

在Alexander Rubin的这个测试示例中他使用的集群包含6个数据节点。下面是具体的规格:

用途

服务器规格

NameNode、DataNode、Hive 元数据存储等

2x PowerEdge 2950, 2x L5335 CPU @ 2.00GHz, 8 cores, 16GB RAM, 使用8个SAS驱动器的RAID 10

仅做数据节点

4x PowerEdge SC1425, 2x Xeon CPU @ 3.00GHz, 2 cores, 8GB RAM, 单个4TB 驱动器

数据导出

有很多方法可以将数据从MySQL导出到Hadoop。在Rubin的这个示例中,他简单地将ontime表导出到了一个文本文件中:

select*intooutfile '/tmp/ontime.psv'
FIELDS TERMINATED BY ','
fromontime;

你可以使用“|”或者任何其他的符号作为分隔符。当然,还可以使用下面这段简单的脚本直接从www.transtats.bts.gov上下载数据。

foryin{1988..2013}
do
foriin{1..12}
do
                u="http://www.transtats.bts.gov/Download/On_Time_On_Time_Performance_${y}_${i}.zip"
                wget $u -o ontime.log
                unzipOn_Time_On_Time_Performance_${y}_${i}.zip
done
done

载入Hadoop HDFS

Rubin首先将数据载入到了HDFS中作为一组文件。Hive或者Impala将会使用导入数据的那个目录,连接该目录下的所有文件。在Rubin的示例中,他在HDFS上创建了/data/ontime/目录,然后将本地所有匹配On_Time_On_Time_Performance_*.csv模式的文件复制到了该目录下。

$ hdfs dfs -mkdir /data/ontime/
$ hdfs -v dfs -copyFromLocalOn_Time_On_Time_Performance_*.csv /data/ontime/

Impala中创建外部表

当所有数据文件都被载入之后接下来需要创建一个外部表:

CREATE EXTERNAL TABLE ontime_csv (
YearDint,
Quartertinyint ,
MonthDtinyint ,
DayofMonthtinyint ,
DayOfWeektinyint ,
FlightDatestring,
UniqueCarrierstring,
AirlineIDint,
Carrierstring,
TailNumstring,
FlightNumstring,
OriginAirportIDint,
OriginAirportSeqIDint,
OriginCityMarketIDint,
Originstring,
OriginCityNamestring,
OriginStatestring,
OriginStateFipsstring,
OriginStateNamestring,
OriginWacint,
DestAirportIDint,
DestAirportSeqIDint,
DestCityMarketIDint,
Deststring,
...
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION '/data/ontime';

注意“EXTERNAL”关键词和LOCATION,后者指向HDFS中的一个目录而不是文件。Impala仅会创建元信息,不会修改表。创建之后就能立即查询该表,在Rubin的这个示例中执行的SQL是:

>selectyeard, count(*)fromontime_psv groupbyyeard;

该SQL耗时131.38秒。注意GROUP BY并不会对行进行排序,这一点不同于MySQL,如果要排序需要添加 ORDER BY yeard语句。另外通过执行计划我们能够发现Impala需要扫描大小约为45.68GB的文件。

Impala使用面向列的格式和压缩

Impala最大的好处就是它支持面向列的格式和压缩。Rubin尝试了新的使用Snappy压缩算法的Parquet格式。因为这个例子使用的表非常大,所以最好使用基于列的格式。为了使用Parquet格式,首先需要载入数据,这在Impala中已经有表、HDFS中已经有文件的情况下是非常容易实现的。本示例大约使用了729秒的时间导入了约1亿5千万条记录,导入之后使用新表再次执行同一个查询所耗费的时间只有4.17秒,扫描的数据量也小了很多,压缩之后的数据只有3.95GB。

Impala复杂查询示例

select
   min(yeard), max(yeard),Carrier, count(*)ascnt,
   sum(if(ArrDelayMinutes>30, 1, 0))asflights_delayed,
   round(sum(if(ArrDelayMinutes>30, 1, 0))/count(*),2)asrate
FROM ontime_parquet_snappy
WHERE
DayOfWeeknotin(6,7)andOriginStatenotin('AK', 'HI', 'PR', 'VI')
andDestStatenotin('AK', 'HI', 'PR', 'VI')
andflightdate GROUPbycarrier
HAVING cnt > 100000andmax(yeard) > 1990
ORDERbyrate DESC
LIMIT 1000;

注意:以上查询不支持sum(ArrDelayMinutes>30)语法,需要使用sum(if(ArrDelayMinutes>30, 1, 0) 代替。另外查询故意被设计为不使用索引:大部分条件仅会过滤掉不到30%的数据。

该查询耗时15.28秒比最初的MySQL结果(非并行执行时15分56.40秒,并行执行时5分47秒)要快很多。当然,它们之间并不是一个“对等的比较”:

  • MySQL将扫描45GB的数据而使用Parquet的Impala仅会扫描3.5GB的数据
  • MySQL运行在一台服务器上,而Hadoop和Impala则并行运行在6台服务器上

尽管如此,Hadoop和Impala在性能方面的表现依然令人印象深刻,同时还能够支持扩展,因此在大数据分析场景中它能为我们提供很多帮助。


感谢崔康对本文的审校。

给InfoQ中文站投稿或者参与内容翻译工作,请邮件至editors@cn.infoq.com。也欢迎大家通过新浪微博(@InfoQ)或者腾讯微博(@InfoQ)关注我们,并与我们的编辑和其他读者朋友交流。

Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

Video Face Swap

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Tools

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)

How can I make money by publishing articles on Toutiao today? How to earn more income by publishing articles on Toutiao today! How can I make money by publishing articles on Toutiao today? How to earn more income by publishing articles on Toutiao today! Mar 15, 2024 pm 04:13 PM

1. How can you make money by publishing articles on Toutiao today? How to earn more income by publishing articles on Toutiao today! 1. Activate basic rights and interests: original articles can earn profits by advertising, and videos must be original in horizontal screen mode to earn profits. 2. Activate the rights of 100 fans: if the number of fans reaches 100 fans or above, you can get profits from micro headlines, original Q&A creation and Q&A. 3. Insist on original works: Original works include articles, micro headlines, questions, etc., and are required to be more than 300 words. Please note that if illegally plagiarized works are published as original works, credit points will be deducted, and even any profits will be deducted. 4. Verticality: When writing articles in professional fields, you cannot write articles across fields at will. You will not get appropriate recommendations, you will not be able to achieve the professionalism and refinement of your work, and it will be difficult to attract fans and readers. 5. Activity: high activity,

Start from scratch and guide you step by step to install Flask and quickly establish a personal blog Start from scratch and guide you step by step to install Flask and quickly establish a personal blog Feb 19, 2024 pm 04:01 PM

Starting from scratch, I will teach you step by step how to install Flask and quickly build a personal blog. As a person who likes writing, it is very important to have a personal blog. As a lightweight Python Web framework, Flask can help us quickly build a simple and fully functional personal blog. In this article, I will start from scratch and teach you step by step how to install Flask and quickly build a personal blog. Step 1: Install Python and pip Before starting, we need to install Python and pi first

What are the top ten open source PHP blog systems in 2022? 【recommend】 What are the top ten open source PHP blog systems in 2022? 【recommend】 Jul 27, 2022 pm 05:38 PM

Blog, also translated as web log, blog or blog, is a website that is usually managed by individuals and posts new articles from time to time. So how to set up a blog? What are the PHP blog systems? Which blogging system is best to use? Below, PHP Chinese website will summarize and share the top ten open source PHP blog systems with you. Let’s take a look!

Create a simple blog: using PHP and SQLite Create a simple blog: using PHP and SQLite Jun 21, 2023 pm 01:23 PM

With the development of the Internet, blogs have become a platform for more and more people to share their lives, knowledge and ideas. If you also want to create a blog of your own, then this article will introduce how to use PHP and SQLite to create a simple blog. Determine the needs Before starting to create a blog, we need to determine the functions we want to achieve. For example: Create a blog post Edit a blog post Delete a blog post Display a list of blog posts Display blog post details User authentication and permission control Install PHP and SQLite We need to install PHP and S

Build a blog website using the Python Django framework Build a blog website using the Python Django framework Jun 17, 2023 pm 03:37 PM

With the popularity of the Internet, blogs play an increasingly important role in information dissemination and communication. In this context, more and more people are starting to build their own blog sites. This article will introduce how to use the PythonDjango framework to build your own blog website. 1. Introduction to the PythonDjango framework PythonDjango is a free and open source web framework that can be used to quickly develop web applications. The framework provides developers with powerful tools to help them build feature-rich

How to create a simple blog using PHP How to create a simple blog using PHP Sep 24, 2023 am 08:25 AM

How to create a simple blog using PHP 1. Introduction With the rapid development of the Internet, blogs have become an important way for people to share experiences, record life and express opinions. This article will introduce how to use PHP to create a simple blog, with specific code examples. 2. Preparation Before starting, you need to have the following development environment: a computer with a PHP interpreter and Web server (such as Apache) installed, a database management system, such as MySQL, a text editor or IDE3

How to create a blog How to create a blog Oct 10, 2023 am 09:46 AM

You can create a blog by determining the topic and target audience of the blog, choosing a suitable blogging platform, registering a domain name and purchasing hosting, designing the appearance and layout of the blog, writing quality content, promoting the blog, and analyzing and improving it.

Is there a future for employment in clinical pharmacy at Harbin Medical University? (What are the employment prospects for clinical pharmacy at Harbin Medical University?) Is there a future for employment in clinical pharmacy at Harbin Medical University? (What are the employment prospects for clinical pharmacy at Harbin Medical University?) Jan 02, 2024 pm 08:54 PM

What are the employment prospects of clinical pharmacy at Harbin Medical University? Although the national employment situation is not optimistic, pharmaceutical graduates still have good employment prospects. Overall, the supply of pharmaceutical graduates is less than the demand. Pharmaceutical companies and pharmaceutical factories are the main channels for absorbing such graduates. The demand for talents in the pharmaceutical industry is also growing steadily. According to reports, in recent years, the supply-demand ratio for graduate students in majors such as pharmaceutical preparations and natural medicinal chemistry has even reached 1:10. Employment direction of clinical pharmacy major: After graduation, students majoring in clinical medicine can engage in medical treatment, prevention, medical research, etc. in medical and health units, medical research and other departments. Employment positions: Medical representative, pharmaceutical sales representative, sales representative, sales manager, regional sales manager, investment manager, product manager, product specialist, nurse

See all articles