Table of Contents
Script 1 - Characterize Collection
Script 2 - MongoDB Schema Generator
Script 3 – Twitter Hourly Coffee Tweets
Next Steps
Home Database Mysql Tutorial MongoDB and Hadoop: A Step-by Step Tutorial Using

MongoDB and Hadoop: A Step-by Step Tutorial Using

Jun 07, 2016 pm 04:29 PM
and hadoop mongodb

The following is a guest post from Jeremy Karn. This article is excerpted from MongoDB + Hadoop: A Step-by-Step Tutorial. Jeremy is a cofounder at Mortar Data, a Hadoop-as-a-service provider, and creator of mortar, an open source framework

The following is a guest post from Jeremy Karn. This article is excerpted from ‘MongoDB + Hadoop: A Step-by-Step Tutorial’. Jeremy is a cofounder at Mortar Data, a Hadoop-as-a-service provider, and creator of mortar, an open source framework for data processing.

People who are worried about scalability often find themselves looking at two tools: MongoDB for storing large amounts of data easily and Hadoop for processing that data. But a common question is: “How do I combine these two to really get the most out of my data?”

Here’s a step-by-step tutorial that will get you up and running with MongoDB and Hadoop in a matter of minutes. And the best part about this tutorial is that at the end you’ll be ready to jump right into using your own MongoDB data with Hadoop.

For this tutorial you’ll be using Apache Pig, a high-level data flow language that compiles down into Hadoop MapReduce jobs. It was designed to be easy to learn and simple to write. If you’ve written SQL, Pig will feel familiar, it is like procedural SQL.

To run your Hadoop jobs, you’re going to use a free Mortar account. Mortar provides Hadoop as a service, which means you can run your jobs without worrying about how to set up and manage a multi-node Hadoop cluster.

To get started, we’ve already set up a small MongoDB instance on MongoLab, populated it with a random sampling of Twitter data from a single day (around 120,000 tweets), and created a read-only user for you.

We’ve also set up a public Github repo with a Mortar project that has three Pig scripts ready to run. Here’s what you need to do:

If you don’t already have a free Github account - create one.? You’ll need a github username in step 4.

  1. Sign into (or create) your free Mortar account.
  2. After you receive the confirmation email, log into Mortar at https://app.mortardata.com.
  3. Install?the Mortar Development Framework:?
    gem install mortar
    Copy after login
  4. Clone the example git project and register it as a mortar project:?
    git clone git@github.com:mortardata/mongo-pig-examples.git
    Copy after login
    cd mongo-pig-examples
    Copy after login
    mortar register mongo-pig-examples
    Copy after login

Script 1 - Characterize Collection

If you’re like most MongoDB users, you may not have a great sense of the different fields, data types, or values in your collection. We built characterize_collection.pig to deeply inspect your collection to extract that information.

From the base directory of the mongo-pig-examples project you just cloned take a look at pigscripts/characterize_collection.pig. It loads all the data in the collection as a map, sends the map to Python (udfs/python/mongo_util.py) to gather a bunch of metadata, calculates some basic information about the collection, and then it writes the results out to an S3 bucket.

To see this script in action let’s run it on a 4 node Hadoop cluster. In your terminal (from the base directory of your mongo-pig-examples project) run:

mortar run characterize_collection --clustersize 4
Copy after login

This job will take about 10 minutes to finish. You can monitor the job’s status on the command line or by going to https://app.mortardata.com/jobs?

Once the job has finished, you’ll receive an email with a link to your job results. Clicking on this link will bring you into the Mortar web app, where you can download the results from s3. The output is described at the top of the characterize_collection script but as an example you can scroll down the output and find:

…
user.is_translator	2	false	unicode	118806
user.is_translator	2	true	unicode	31
user.lang	26	en	unicode	114108
user.lang	26	es	unicode	3462
user.lang	26	fr	unicode	532
user.lang	26	pt	unicode	281
user.lang	26	ja	unicode	79
user.listed_count	398	0	int	73757
user.listed_count	398	1	int	18518
Copy after login

Looking at the values for user.lang - we see that there are 26 unique values for the field in our dataset. The most common was “en” with 114108 occurrences, the next most common was “es” with 3462 occurrences, and so on. To see the full results without running the job you can view the output file here.

Script 2 - MongoDB Schema Generator

It can be tricky to properly declare MongoDB’s highly nested schemas in Pig. Now, Pig is graceful—it can roll without a schema, or with inconsistent, or incorrect schemas. But it’s easier to read and write your Pig code if you have a schema because it allows you (and the Pig optimizer) to focus on just the relevant data.

So this next script automatically generates a Pig schema by examining your MongoDB collection. If you don’t need the whole schema, you can easily edit it to keep just the fields you want.

Running this script is similar to running the previous one. If you ran the Characterize Collection script in the past hour, the same cluster you used for that job should still be running. In that case, you can just run:

mortar run mongo_schema_generator
Copy after login

If you don’t have a cluster that’s still running, just run the job on a new 4 node cluster like this:

mortar run mongo_schema_generator --clustersize 4
Copy after login

Script 3 – Twitter Hourly Coffee Tweets

Using a Twitter coffee tweets script (pigscripts/hourly_coffee_tweets.pig), we’re going to demonstrate how we can use a small subset of the fields in our MongoDB collection. For our example, we’ll look at how often the word “coffee” is tweeted throughout the day. As with the Mongo Schema Generator script, you can run this job on an existing cluster or start up a new one.

Next Steps

If you already have a mongo instance/cluster based in US-East EC2, the first two example scripts should run on one of your collections with only minor modifications. You’ll just need to:

  1. Update the MongoLoader connection strings in the pig scripts to connect to your MongoDB collections with one of your own users. If your mongo instance is on a non-standard port (any port other than 27017), just email us at support@mortardata.com to allow your Mortar account to access that port.
  2. If you’d like your jobs to write to one of your own S3 buckets, you can update the AWS keys associated with your Mortar account by following these instructions to enable s3 access.
  3. If you run out of free cluster hours with Mortar, you can upgrade your account to get additional free hours each month.
  4. You can find more resources for learning Pig here
  5. If you have any questions or feedback, please contact us at support@mortardata.com or ping us on in-app chat at app.mortardata.com
Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

Video Face Swap

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Tools

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Hot Topics

Java Tutorial
1664
14
PHP Tutorial
1266
29
C# Tutorial
1239
24
Use Composer to solve the dilemma of recommendation systems: andres-montanez/recommendations-bundle Use Composer to solve the dilemma of recommendation systems: andres-montanez/recommendations-bundle Apr 18, 2025 am 11:48 AM

When developing an e-commerce website, I encountered a difficult problem: how to provide users with personalized product recommendations. Initially, I tried some simple recommendation algorithms, but the results were not ideal, and user satisfaction was also affected. In order to improve the accuracy and efficiency of the recommendation system, I decided to adopt a more professional solution. Finally, I installed andres-montanez/recommendations-bundle through Composer, which not only solved my problem, but also greatly improved the performance of the recommendation system. You can learn composer through the following address:

How to configure MongoDB automatic expansion on Debian How to configure MongoDB automatic expansion on Debian Apr 02, 2025 am 07:36 AM

This article introduces how to configure MongoDB on Debian system to achieve automatic expansion. The main steps include setting up the MongoDB replica set and disk space monitoring. 1. MongoDB installation First, make sure that MongoDB is installed on the Debian system. Install using the following command: sudoaptupdatesudoaptinstall-ymongodb-org 2. Configuring MongoDB replica set MongoDB replica set ensures high availability and data redundancy, which is the basis for achieving automatic capacity expansion. Start MongoDB service: sudosystemctlstartmongodsudosys

Navicat's method to view MongoDB database password Navicat's method to view MongoDB database password Apr 08, 2025 pm 09:39 PM

It is impossible to view MongoDB password directly through Navicat because it is stored as hash values. How to retrieve lost passwords: 1. Reset passwords; 2. Check configuration files (may contain hash values); 3. Check codes (may hardcode passwords).

What is the CentOS MongoDB backup strategy? What is the CentOS MongoDB backup strategy? Apr 14, 2025 pm 04:51 PM

Detailed explanation of MongoDB efficient backup strategy under CentOS system This article will introduce in detail the various strategies for implementing MongoDB backup on CentOS system to ensure data security and business continuity. We will cover manual backups, timed backups, automated script backups, and backup methods in Docker container environments, and provide best practices for backup file management. Manual backup: Use the mongodump command to perform manual full backup, for example: mongodump-hlocalhost:27017-u username-p password-d database name-o/backup directory This command will export the data and metadata of the specified database to the specified backup directory.

How to choose a database for GitLab on CentOS How to choose a database for GitLab on CentOS Apr 14, 2025 pm 04:48 PM

GitLab Database Deployment Guide on CentOS System Selecting the right database is a key step in successfully deploying GitLab. GitLab is compatible with a variety of databases, including MySQL, PostgreSQL, and MongoDB. This article will explain in detail how to select and configure these databases. Database selection recommendation MySQL: a widely used relational database management system (RDBMS), with stable performance and suitable for most GitLab deployment scenarios. PostgreSQL: Powerful open source RDBMS, supports complex queries and advanced features, suitable for handling large data sets. MongoDB: Popular NoSQL database, good at handling sea

How to encrypt data in Debian MongoDB How to encrypt data in Debian MongoDB Apr 12, 2025 pm 08:03 PM

Encrypting MongoDB database on a Debian system requires following the following steps: Step 1: Install MongoDB First, make sure your Debian system has MongoDB installed. If not, please refer to the official MongoDB document for installation: https://docs.mongodb.com/manual/tutorial/install-mongodb-on-debian/Step 2: Generate the encryption key file Create a file containing the encryption key and set the correct permissions: ddif=/dev/urandomof=/etc/mongodb-keyfilebs=512

How to set up users in mongodb How to set up users in mongodb Apr 12, 2025 am 08:51 AM

To set up a MongoDB user, follow these steps: 1. Connect to the server and create an administrator user. 2. Create a database to grant users access. 3. Use the createUser command to create a user and specify their role and database access rights. 4. Use the getUsers command to check the created user. 5. Optionally set other permissions or grant users permissions to a specific collection.

MongoDB and relational database: a comprehensive comparison MongoDB and relational database: a comprehensive comparison Apr 08, 2025 pm 06:30 PM

MongoDB and relational database: In-depth comparison This article will explore in-depth the differences between NoSQL database MongoDB and traditional relational databases (such as MySQL and SQLServer). Relational databases use table structures of rows and columns to organize data, while MongoDB uses flexible document-oriented models to better suit the needs of modern applications. Mainly differentiates data structures: Relational databases use predefined schema tables to store data, and relationships between tables are established through primary keys and foreign keys; MongoDB uses JSON-like BSON documents to store them in a collection, and each document structure can be independently changed to achieve pattern-free design. Architectural design: Relational databases need to pre-defined fixed schema; MongoDB supports

See all articles