让Mahout KMeans聚类分析运行在Hadoop上
上一篇文章“Mahout与聚类分析”介绍了如何使用Mahout进行聚类分析的步骤,并且结合实例使用K-Means对微博名人共同关注数据进行了共被关注聚类分析。Mahout运行有本地运行和Hadoop运行两种模式,本地运行是指在用户本地的单机模式下运行,就像运行其他普通的
上一篇文章“Mahout与聚类分析”介绍了如何使用Mahout进行聚类分析的步骤,并且结合实例使用K-Means对微博名人共同关注数据进行了共被关注聚类分析。Mahout运行有本地运行和Hadoop运行两种模式,本地运行是指在用户本地的单机模式下运行,就像运行其他普通的程序一样,但是这样这样就不能最大限度的发挥出Mahout的优势,在本文中我们介绍如何让我们的Mahout聚类分析程序在Hahoop集群上运行(在实际操作中笔者使用的伪分布Hadoop,而不是真正的Hadoop集群)。
配置Mahout运行环境
Mahout运行配置可以在$MAHOUT_HOME/bin/mahout
里面进行设置,实际上$MAHOUT_HOME/bin/mahout
就是Mahout在命令行的启动脚本,这一点与Hadoop相似,但也又不同,Hadoop在$HADOOP_HOME\conf下面还提供了专门的hadoop-env.sh文件进行相关环境变量的配置,而Mahout在conf目录下没有提供这样的文件。
MAHOUT_LOCAL与HADOOP_CONF_DIR
以上的连个参数是控制Mahout是在本地运行还是在Hadoop上运行的关键。
$MAHOUT_HOME/bin/mahout
文件指出,只要设置MAHOUT_LOCAL
的值为一个非空(not empty string)值,则不管用户有没有设置HADOOP_CONF_DIR和HADOOP_HOME这两个参数,Mahout都以本地模式运行;换句话说,如果要想Mahout运行在Hadoop上,则MAHOUT_LOCAL必须为空。
HADOOP_CONF_DIR
参数指定Mahout运行Hadoop模式时使用的Hadoop配置信息,这个文件目录一般指向的是$HADOOP_HOME目录下的conf目录。
除此之外,我们还应该设置JAVA_HOME
或者MAHOUT_JAVA_HOME
变量,以及必须将Hadoop的执行文件加入到PATH中。
综上所述:
1. 添加JAVA_HOME
变量,可以在直接设置在$MAHOUT_HOME/bin/mahout
中,也可以在user/bash profile里面设置(如./bashrc
)
2. 设置MAHOUT_HOME并添加Hadoop的执行文件到PATH中
两个步骤在~/.bashrc
的设置如下:
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-i386 #export HADOOP_HOME=/home/yoyzhou/workspace/hadoop-1.1.2 export MAHOUT_HOME=/home/yoyzhou/workspace/mahout-0.7 export PATH=$PATH:/home/yoyzhou/workspace/hadoop-1.1.2/bin:$MAHOUT_HOME/bin
编辑完~/.bashrc
,重启Terminal即可生效。
3. 编辑$MAHOUT_HOME/bin/mahout
,将HADOOP_CONF_DIR
设置为$HADOOP_HOME\conf
HADOOP_CONF_DIR=/home/yoyzhou/workspace/hadoop-1.1.2/conf
读者可以将相关的Hadoop和Mahout主目录修改自己系统上面的目录地址,设置好之后重启Terminal,在命令行输入mahout,如果你看到如下的信息,就说明Mahout的Hadoop运行模式已经配置好了。
MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath. Running on hadoop...
要想使用本地模式运行,只需在$MAHOUT_HOME/bin/mahout
添加一条设置MAHOUT_LOCAL
为非空的语句即可。
Mahout命令行
Mahout为相关的数据挖掘算法提供了相应的命令行入口,同时提供了一些数据分析处理的用到的工具集。这些命令可以通过在终端输入mahout
获得。以下显示了输入mahout
的部分信息:
.... Valid program names are: arff.vector: : Generate Vectors from an ARFF file or directory baumwelch: : Baum-Welch algorithm for unsupervised HMM training canopy: : Canopy clustering cat: : Print a file or resource as the logistic regression models would see it cleansvd: : Cleanup and verification of SVD output clusterdump: : Dump cluster output to text .... fkmeans: : Fuzzy K-means clustering fpg: : Frequent Pattern Growth hmmpredict: : Generate random sequence of observations by given HMM itemsimilarity: : Compute the item-item-similarities for item-based collaborative filtering kmeans: : K-means clustering ....
Mahout kmeans
在上一篇文章,我们通过调用KMeansDriver.run()方法从Mahout程序中直接启动KMeans算法,这种方式对于在本地调试程序非常有用,但是在真实项目中,无论是使用Hadoop模式运行,还是本地运行,从命令行运行Mahout的相关算法更加合适,这样的好处是我们只需要给Mahout提供符合相应算法要求的输入数据,即可以利用Mahout分布式处理的优势。比如在本例中,使用kmeans算法,只需要事先将数据处理成Mahout kmeans算法要求的输入数据,然后在命令行调用mahout kmeans [options]
即可。
在命令行输入不带任何参数的mahout kmeans
,Mahout将为你列出在命令行使用kmeans算法的使用方法。
Usage: [--input --output --distanceMeasure --clusters --numClusters --convergenceDelta --maxIter --overwrite --clustering --method --outlierThreshold --help --tempDir --startPhase --endPhase ] --clusters (-c) clusters The input centroids, as Vectors. Must be a SequenceFile of Writable, Cluster/Canopy. If k is also specified, then a random set of vectors will be selected and written out to this path first
相关的参数我们已经在上篇文章中提到过。
具体的步骤如下:
1. 将数据处理为Mahout向量(Vector)的形式 2. 将Mahout向量转化为Hadoop SequenceFile 3. 创建K个初始质心\[可选\] 4. 将Mahout向量的SequenceFile复制到HDFS上 5. 运行`mahout kmeans [options]`
下面的命令显示使用CosineDistanceMeasure对data/vectors目录下Mahout向量数据进行kmeans聚类分析,输出结果保存在output目录下。
mahout kmeans -i data/vectors -o output -c data/clusters \ -dm org.apache.mahout.common.distance.CosineDistanceMeasure \ -x 10 -ow -cd 0.001 -cl
更加详细的命令行参数可以在Mahout wiki k-means-commandline上查找到。
总结
本文首先介绍了如何配置Mahout的Hadoop的运行环境,然后介绍如何使用mahout kmeans命令行将聚类分析运行在Hadoop上。
原文地址:让Mahout KMeans聚类分析运行在Hadoop上, 感谢原作者分享。

Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

Video Face Swap
Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Hot Tools

Notepad++7.3.1
Easy-to-use and free code editor

SublimeText3 Chinese version
Chinese version, very easy to use

Zend Studio 13.0.1
Powerful PHP integrated development environment

Dreamweaver CS6
Visual web development tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Hot Topics

How to implement the DBSCAN clustering algorithm using Python? DBSCAN (Density-BasedSpatialClusteringofApplicationswithNoise) is a density-based clustering algorithm that can automatically identify data points with similar densities and divide them into different clusters. Compared with traditional clustering algorithms, DBSCAN shows higher flexibility and robustness in processing non-spherical and irregularly shaped data sets. Book

How to use MySQL database for forecasting and predictive analytics? Overview: Forecasting and predictive analytics play an important role in data analysis. MySQL, a widely used relational database management system, can also be used for prediction and predictive analysis tasks. This article will introduce how to use MySQL for prediction and predictive analysis, and provide relevant code examples. Data preparation: First, we need to prepare relevant data. Suppose we want to do sales forecasting, we need a table with sales data. In MySQL we can use

How to implement data statistics and analysis in uniapp 1. Background introduction Data statistics and analysis are a very important part of the mobile application development process. Through statistics and analysis of user behavior, developers can have an in-depth understanding of user preferences and usage habits. Thereby optimizing product design and user experience. This article will introduce how to implement data statistics and analysis functions in uniapp, and provide some specific code examples. 2. Choose appropriate data statistics and analysis tools. The first step to implement data statistics and analysis in uniapp is to choose the appropriate data statistics and analysis tools.

Real-time log monitoring and analysis under Linux In daily system management and troubleshooting, logs are a very important data source. Through real-time monitoring and analysis of system logs, we can detect abnormal situations in time and handle them accordingly. This article will introduce how to perform real-time log monitoring and analysis under Linux, and provide corresponding code examples. 1. Real-time log monitoring Under Linux, the most commonly used log system is rsyslog. By configuring rsyslog, we can combine the logs of different applications

Title: Analysis of the reasons and solutions for why the secondary directory of DreamWeaver CMS cannot be opened. Dreamweaver CMS (DedeCMS) is a powerful open source content management system that is widely used in the construction of various websites. However, sometimes during the process of building a website, you may encounter a situation where the secondary directory cannot be opened, which brings trouble to the normal operation of the website. In this article, we will analyze the possible reasons why the secondary directory cannot be opened and provide specific code examples to solve this problem. 1. Possible cause analysis: Pseudo-static rule configuration problem: during use

Summary of case analysis of Python application in intelligent transportation systems: With the rapid development of intelligent transportation systems, Python, as a multifunctional, easy-to-learn and use programming language, is widely used in the development and application of intelligent transportation systems. This article demonstrates the advantages and application potential of Python in the field of intelligent transportation by analyzing application cases of Python in intelligent transportation systems and giving relevant code examples. Introduction Intelligent transportation system refers to the use of modern communication, information, sensing and other technical means to communicate through

Title: Is Tencent’s main programming language Go: An in-depth analysis. As China’s leading technology company, Tencent has always attracted much attention in its choice of programming languages. In recent years, some people believe that Tencent mainly adopts Go as its main programming language. This article will conduct an in-depth analysis of whether Tencent's main programming language is Go, and give specific code examples to support this view. 1. Application of Go language in Tencent Go is an open source programming language developed by Google. Its efficiency, concurrency and simplicity are loved by many developers.

Analysis of the advantages and limitations of static positioning technology With the development of modern technology, positioning technology has become an indispensable part of our lives. As one of them, static positioning technology has its unique advantages and limitations. This article will conduct an in-depth analysis of static positioning technology to better understand its current application status and future development trends. First, let’s take a look at the advantages of static positioning technology. Static positioning technology achieves the determination of position information by observing, measuring and calculating the object to be positioned. Compared with other positioning technologies,
