让Mahout KMeans聚类分析运行在Hadoop上-Mysql Tutorial-php.cn

Table of Contents

配置Mahout运行环境

MAHOUT_LOCAL与HADOOP_CONF_DIR

Mahout命令行

Mahout kmeans

总结

Home

Database

Mysql Tutorial

让Mahout KMeans聚类分析运行在Hadoop上

WBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWB

Jun 07, 2016 pm 04:30 PM

hadoop mahout analyze clustering

上一篇文章“Mahout与聚类分析”介绍了如何使用Mahout进行聚类分析的步骤，并且结合实例使用K-Means对微博名人共同关注数据进行了共被关注聚类分析。Mahout运行有本地运行和Hadoop运行两种模式，本地运行是指在用户本地的单机模式下运行，就像运行其他普通的

上一篇文章“Mahout与聚类分析”介绍了如何使用Mahout进行聚类分析的步骤，并且结合实例使用K-Means对微博名人共同关注数据进行了共被关注聚类分析。Mahout运行有本地运行和Hadoop运行两种模式，本地运行是指在用户本地的单机模式下运行，就像运行其他普通的程序一样，但是这样这样就不能最大限度的发挥出Mahout的优势，在本文中我们介绍如何让我们的Mahout聚类分析程序在Hahoop集群上运行（在实际操作中笔者使用的伪分布Hadoop，而不是真正的Hadoop集群）。

配置Mahout运行环境

Mahout运行配置可以在$MAHOUT_HOME/bin/mahout里面进行设置，实际上$MAHOUT_HOME/bin/mahout就是Mahout在命令行的启动脚本，这一点与Hadoop相似，但也又不同，Hadoop在$HADOOP_HOME\conf下面还提供了专门的hadoop-env.sh文件进行相关环境变量的配置，而Mahout在conf目录下没有提供这样的文件。

MAHOUT_LOCAL与HADOOP_CONF_DIR

以上的连个参数是控制Mahout是在本地运行还是在Hadoop上运行的关键。

$MAHOUT_HOME/bin/mahout文件指出，只要设置MAHOUT_LOCAL的值为一个非空（not empty string）值，则不管用户有没有设置HADOOP_CONF_DIR和HADOOP_HOME这两个参数，Mahout都以本地模式运行；换句话说，如果要想Mahout运行在Hadoop上，则MAHOUT_LOCAL必须为空。

HADOOP_CONF_DIR参数指定Mahout运行Hadoop模式时使用的Hadoop配置信息，这个文件目录一般指向的是$HADOOP_HOME目录下的conf目录。

除此之外，我们还应该设置JAVA_HOME或者MAHOUT_JAVA_HOME变量，以及必须将Hadoop的执行文件加入到PATH中。

综上所述：

1. 添加JAVA_HOME变量，可以在直接设置在$MAHOUT_HOME/bin/mahout中，也可以在user/bash profile里面设置(如./bashrc)

2. 设置MAHOUT_HOME并添加Hadoop的执行文件到PATH中

两个步骤在~/.bashrc的设置如下：

export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-i386
#export HADOOP_HOME=/home/yoyzhou/workspace/hadoop-1.1.2
export MAHOUT_HOME=/home/yoyzhou/workspace/mahout-0.7
export PATH=$PATH:/home/yoyzhou/workspace/hadoop-1.1.2/bin:$MAHOUT_HOME/bin

Copy after login

编辑完~/.bashrc,重启Terminal即可生效。

3. 编辑$MAHOUT_HOME/bin/mahout，将HADOOP_CONF_DIR设置为$HADOOP_HOME\conf

HADOOP_CONF_DIR=/home/yoyzhou/workspace/hadoop-1.1.2/conf

Copy after login

读者可以将相关的Hadoop和Mahout主目录修改自己系统上面的目录地址，设置好之后重启Terminal，在命令行输入mahout，如果你看到如下的信息，就说明Mahout的Hadoop运行模式已经配置好了。

MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath. 
Running on hadoop...

Copy after login

要想使用本地模式运行，只需在$MAHOUT_HOME/bin/mahout添加一条设置MAHOUT_LOCAL为非空的语句即可。

Mahout命令行

Mahout为相关的数据挖掘算法提供了相应的命令行入口，同时提供了一些数据分析处理的用到的工具集。这些命令可以通过在终端输入mahout获得。以下显示了输入mahout的部分信息：

....
Valid program names are:
  arff.vector: : Generate Vectors from an ARFF file or directory
  baumwelch: : Baum-Welch algorithm for unsupervised HMM training
  canopy: : Canopy clustering
  cat: : Print a file or resource as the logistic regression models would see it
  cleansvd: : Cleanup and verification of SVD output
  clusterdump: : Dump cluster output to text
  ....
  fkmeans: : Fuzzy K-means clustering
  fpg: : Frequent Pattern Growth
  hmmpredict: : Generate random sequence of observations by given HMM
  itemsimilarity: : Compute the item-item-similarities for item-based collaborative filtering
  kmeans: : K-means clustering
....

Copy after login

Mahout kmeans

在上一篇文章，我们通过调用KMeansDriver.run()方法从Mahout程序中直接启动KMeans算法，这种方式对于在本地调试程序非常有用，但是在真实项目中，无论是使用Hadoop模式运行，还是本地运行，从命令行运行Mahout的相关算法更加合适，这样的好处是我们只需要给Mahout提供符合相应算法要求的输入数据，即可以利用Mahout分布式处理的优势。比如在本例中，使用kmeans算法，只需要事先将数据处理成Mahout kmeans算法要求的输入数据，然后在命令行调用mahout kmeans [options]即可。

在命令行输入不带任何参数的mahout kmeans，Mahout将为你列出在命令行使用kmeans算法的使用方法。

Usage:                                                                          
 [--input  --output  --distanceMeasure          
--clusters  --numClusters  --convergenceDelta    
--maxIter  --overwrite --clustering --method                   
--outlierThreshold  --help --tempDir  --startPhase   
 --endPhase ]                                             
--clusters (-c) clusters    The input centroids, as Vectors.  Must be a         
	                        SequenceFile of Writable, Cluster/Canopy.  If k is  
	                        also specified, then a random set of vectors will   
	                        be selected and written out to this path first

Copy after login

相关的参数我们已经在上篇文章中提到过。

具体的步骤如下：

1. 将数据处理为Mahout向量（Vector）的形式
2. 将Mahout向量转化为Hadoop SequenceFile
3. 创建K个初始质心\[可选\]
4. 将Mahout向量的SequenceFile复制到HDFS上
5. 运行`mahout kmeans [options]`

Copy after login

下面的命令显示使用CosineDistanceMeasure对data/vectors目录下Mahout向量数据进行kmeans聚类分析，输出结果保存在output目录下。

mahout kmeans -i data/vectors -o output -c data/clusters \
-dm org.apache.mahout.common.distance.CosineDistanceMeasure \
-x 10 -ow -cd 0.001 -cl

Copy after login

更加详细的命令行参数可以在Mahout wiki k-means-commandline上查找到。

总结

本文首先介绍了如何配置Mahout的Hadoop的运行环境，然后介绍如何使用mahout kmeans命令行将聚类分析运行在Hadoop上。

原文地址：让Mahout KMeans聚类分析运行在Hadoop上, 感谢原作者分享。

Statement of this Website

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

What's New in Windows 11 KB5054979 & How to Fix Update Issues

4 weeks ago By DDD

How to fix KB5055523 fails to install in Windows 11?

3 weeks ago By DDD

How to fix KB5055518 fails to install in Windows 10?

3 weeks ago By DDD

Strength Levels for Every Enemy & Monster in R.E.P.O.

3 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

Blue Prince: How To Get To The Basement

3 weeks ago By DDD

Hot Tools

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Hot Topics

Where is the login entrance for gmail email?

7926

Java Tutorial

1652

CakePHP Tutorial

1411

Laravel Tutorial

1303

PHP Tutorial

1249

Related knowledge

How to implement the DBSCAN clustering algorithm using Python? Sep 19, 2023 pm 02:39 PM

How to implement the DBSCAN clustering algorithm using Python? DBSCAN (Density-BasedSpatialClusteringofApplicationswithNoise) is a density-based clustering algorithm that can automatically identify data points with similar densities and divide them into different clusters. Compared with traditional clustering algorithms, DBSCAN shows higher flexibility and robustness in processing non-spherical and irregularly shaped data sets. Book

How to use MySQL database for forecasting and predictive analytics? Jul 12, 2023 pm 08:43 PM

How to use MySQL database for forecasting and predictive analytics? Overview: Forecasting and predictive analytics play an important role in data analysis. MySQL, a widely used relational database management system, can also be used for prediction and predictive analysis tasks. This article will introduce how to use MySQL for prediction and predictive analysis, and provide relevant code examples. Data preparation: First, we need to prepare relevant data. Suppose we want to do sales forecasting, we need a table with sales data. In MySQL we can use

How to implement data statistics and analysis in uniapp Oct 24, 2023 pm 12:37 PM

How to implement data statistics and analysis in uniapp 1. Background introduction Data statistics and analysis are a very important part of the mobile application development process. Through statistics and analysis of user behavior, developers can have an in-depth understanding of user preferences and usage habits. Thereby optimizing product design and user experience. This article will introduce how to implement data statistics and analysis functions in uniapp, and provide some specific code examples. 2. Choose appropriate data statistics and analysis tools. The first step to implement data statistics and analysis in uniapp is to choose the appropriate data statistics and analysis tools.

Real-time log monitoring and analysis under Linux Jul 29, 2023 am 08:06 AM

Real-time log monitoring and analysis under Linux In daily system management and troubleshooting, logs are a very important data source. Through real-time monitoring and analysis of system logs, we can detect abnormal situations in time and handle them accordingly. This article will introduce how to perform real-time log monitoring and analysis under Linux, and provide corresponding code examples. 1. Real-time log monitoring Under Linux, the most commonly used log system is rsyslog. By configuring rsyslog, we can combine the logs of different applications

Analysis of the reasons why the secondary directory of DreamWeaver CMS cannot be opened Mar 13, 2024 pm 06:24 PM

Title: Analysis of the reasons and solutions for why the secondary directory of DreamWeaver CMS cannot be opened. Dreamweaver CMS (DedeCMS) is a powerful open source content management system that is widely used in the construction of various websites. However, sometimes during the process of building a website, you may encounter a situation where the secondary directory cannot be opened, which brings trouble to the normal operation of the website. In this article, we will analyze the possible reasons why the secondary directory cannot be opened and provide specific code examples to solve this problem. 1. Possible cause analysis: Pseudo-static rule configuration problem: during use

Case analysis of Python application in intelligent transportation systems Sep 08, 2023 am 08:13 AM

Summary of case analysis of Python application in intelligent transportation systems: With the rapid development of intelligent transportation systems, Python, as a multifunctional, easy-to-learn and use programming language, is widely used in the development and application of intelligent transportation systems. This article demonstrates the advantages and application potential of Python in the field of intelligent transportation by analyzing application cases of Python in intelligent transportation systems and giving relevant code examples. Introduction Intelligent transportation system refers to the use of modern communication, information, sensing and other technical means to communicate through

Analyze whether Tencent's main programming language is Go Mar 27, 2024 pm 04:21 PM

Title: Is Tencent’s main programming language Go: An in-depth analysis. As China’s leading technology company, Tencent has always attracted much attention in its choice of programming languages. In recent years, some people believe that Tencent mainly adopts Go as its main programming language. This article will conduct an in-depth analysis of whether Tencent's main programming language is Go, and give specific code examples to support this view. 1. Application of Go language in Tencent Go is an open source programming language developed by Google. Its efficiency, concurrency and simplicity are loved by many developers.

Analyze the advantages and disadvantages of static positioning technology Jan 18, 2024 am 11:16 AM

Analysis of the advantages and limitations of static positioning technology With the development of modern technology, positioning technology has become an indispensable part of our lives. As one of them, static positioning technology has its unique advantages and limitations. This article will conduct an in-depth analysis of static positioning technology to better understand its current application status and future development trends. First, let’s take a look at the advantages of static positioning technology. Static positioning technology achieves the determination of position information by observing, measuring and calculating the object to be positioned. Compared with other positioning technologies,

See all articles