Table of Contents
Hash-based sharding in an existing collection
Sharding a new collection
Manual chunk assignment and other caveats
Home Database Mysql Tutorial New Hash-based Sharding Feature in MongoDB 2.4

New Hash-based Sharding Feature in MongoDB 2.4

Jun 07, 2016 pm 04:30 PM
new sharding

Lots of MongoDB users enjoy the flexibility of custom shard keys in organizing a sharded collection’s documents. For certain common workloads though, like key/value lookup, using the natural choice of _id as a shard key isn’t optimal bec

Lots of MongoDB users enjoy the flexibility of custom shard keys in organizing a sharded collection’s documents. For certain common workloads though, like key/value lookup, using the natural choice of _id as a shard key isn’t optimal because default ObjectId’s are ascending, resulting in poor write distribution. ?Creating randomized _ids or choosing another well-distributed field is always possible, but this adds complexity to an app and is another place where something could go wrong.

To help keep these simple workloads simple, in 2.4 MongoDB added the new Hash-based shard key feature. ?The idea behind Hash-based shard keys is that MongoDB will do the work to randomize data distribution for you, based on whatever kind of document identifier you like. ?So long as the identifier has a high cardinality, the documents in your collection will be spread evenly across the shards of your cluster. ?For heavy workloads with lots of individual document writes or reads (e.g. key/value), this is usually the best choice. ?For workloads where getting ranges of documents is more important (i.e. find recent documents from all users), other choices of shard key may be better suited.

Hash-based sharding in an existing collection

To start off with Hash-based sharding, you need the name of the collection you’d like to shard and the name of the hashed “identifier" field for the documents in the collection. ?For example, we might want to create a sharded “mydb.webcrawler" collection, where each document is usually found by a “url" field. ?We can populate the collection with sample data using:

shell$ wget http://en.wikipedia.org/wiki/Web_crawler -O web_crawler.html
shell$ mongo 
connecting to: /test
> use mydb
switched to db mydb
> cat("web_crawler.html").split("\n").forEach( function(line){
... var regex = /a href="http://blog.mongodb.org/post/\""([^\"]*)\"/; if (regex.test(line)) { db.webcrawler.insert({ "url" : regex.exec(line)[1] }); }})
> db.webcrawler.find()
...
{ "_id" : ObjectId("5162fba3ad5a8e56b7b36020"), "url" : "/wiki/OWASP" }
{ "_id" : ObjectId("5162fba3ad5a8e56b7b3603d"), "url" : "/wiki/Image_retrieval" }
{ "_id" : ObjectId("5162fba3ad5a8e56b7b3603e"), "url" : "/wiki/Video_search_engine" }
{ "_id" : ObjectId("5162fba3ad5a8e56b7b3603f"), "url" : "/wiki/Enterprise_search" }
{ "_id" : ObjectId("5162fba3ad5a8e56b7b36040"), "url" : "/wiki/Semantic_search" }
...
Copy after login

Just for this example, we multiply this data ~x2000 (otherwise we won’t get any pre-splitting in the collection because it’s too small):

> for (var i = 0; i 
<p><span>Next, we create a hashed index on this field:</span></p>
<pre class="brush:php;toolbar:false">> db.webcrawler.ensureIndex({ url : "hashed" })
Copy after login

As usual, the creation of the hashed index doesn’t prevent other types of indices from being created as well.

Then we shard the “mydb.webcrawler" collection using the same field as a Hash-based shard key:

> db.printShardingStatus(true)
--- Sharding Status ---
sharding version: {
  "_id" : 1,
  "version" : 3,
  "minCompatibleVersion" : 3,
  "currentVersion" : 4,
  "clusterId" : ObjectId("5163032a622c051263c7b8ce")
}
shards:
  {  "_id" : "test-rs0",  "host" : "test-rs0/nuwen:31100,nuwen:31101" }
  {  "_id" : "test-rs1",  "host" : "test-rs1/nuwen:31200,nuwen:31201" }
  {  "_id" : "test-rs2",  "host" : "test-rs2/nuwen:31300,nuwen:31301" }
  {  "_id" : "test-rs3",  "host" : "test-rs3/nuwen:31400,nuwen:31401" }
databases:
  {  "_id" : "admin",  "partitioned" : false,  "primary" : "config" }
  {  "_id" : "mydb",  "partitioned" : true,  "primary" : "test-rs0" }
      mydb.webcrawler
          shard key: { "url" : "hashed" }
          chunks:
              test-rs0    4
          { "url" : { "$minKey" : 1 } } -->> { "url" : NumberLong("-4837773290201122847") } on : test-rs0 { "t" : 1, "i" : 3 }
          { "url" : NumberLong("-4837773290201122847") } -->> { "url" : NumberLong("-2329535691089872938") } on : test-rs0 { "t" : 1, "i" : 4 }
          { "url" : NumberLong("-2329535691089872938") } -->> { "url" : NumberLong("3244151849123193853") } on : test-rs0 { "t" : 1, "i" : 1 }
          { "url" : NumberLong("3244151849123193853") } -->> { "url" : { "$maxKey" : 1 } } on : test-rs0 { "t" : 1, "i" : 2 }
Copy after login

you can see that the chunk boundaries are 64-bit integers (generated by hashing the “url" field). ?When inserts or queries target particular urls, the query can get routed using the url hash to the correct chunk.

Sharding a new collection

Above we’ve sharded an existing collection, which will result in all the chunks of a collection initially living on the same shard. ?The balancer takes care of moving the chunks around, as usual, until we get an even distribution of data.

Much of the time though, it’s better to shard the collection before we add our data - this way MongoDB doesn’t have to worry about moving around existing data. ?Users of sharded collections are familiar with pre-splitting - where empty chunks can be quickly balanced around a cluster before data is added. ?When sharding a new collection using Hash-based shard keys, MongoDB will take care of the presplitting for you. Similarly sized ranges of the Hash-based key are distributed to each existing shard, which means that no initial balancing is needed (unless of course new shards are added).

Let’s see what happens when we shard a new collection webcrawler_empty the same way:

> sh.stopBalancer()
Waiting for active hosts...
Waiting for the balancer lock...
Waiting again for active hosts after balancer is off...
> db.webcrawler_empty.ensureIndex({ url : "hashed" })
> sh.shardCollection("mydb.webcrawler_empty", { url : "hashed" })
{ "collectionsharded" : "mydb.webcrawler_empty", "ok" : 1 }
> db.printShardingStatus(true)
--- Sharding Status ---
...
      mydb.webcrawler_empty
          shard key: { "url" : "hashed" }
          chunks:
              test-rs0    2
              test-rs1    2
              test-rs2    2
              test-rs3    2
          { "url" : { "$minKey" : 1 } } -->> { "url" : NumberLong("-6917529027641081850") } on : test-rs0 { "t" : 4, "i" : 2 }
          { "url" : NumberLong("-6917529027641081850") } -->> { "url" : NumberLong("-4611686018427387900") } on : test-rs0 { "t" : 4, "i" : 3 }
          { "url" : NumberLong("-4611686018427387900") } -->> { "url" : NumberLong("-2305843009213693950") } on : test-rs1 { "t" : 4, "i" : 4 }
          { "url" : NumberLong("-2305843009213693950") } -->> { "url" : NumberLong(0) } on : test-rs1 { "t" : 4, "i" : 5 }
          { "url" : NumberLong(0) } -->> { "url" : NumberLong("2305843009213693950") } on : test-rs2 { "t" : 4, "i" : 6 }
          { "url" : NumberLong("2305843009213693950") } -->> { "url" : NumberLong("4611686018427387900") } on : test-rs2 { "t" : 4, "i" : 7 }
          { "url" : NumberLong("4611686018427387900") } -->> { "url" : NumberLong("6917529027641081850") } on : test-rs3 { "t" : 4, "i" : 8 }
          { "url" : NumberLong("6917529027641081850") } -->> { "url" : { "$maxKey" : 1 } } on : test-rs3 { "t" : 4, "i" : 9 }
Copy after login

As you can see, the new empty collection is already well-distributed and ready to use. ?Be aware though - any balancing currently in progress can interfere with moving the empty initial chunks off the initial shard, balancing will take priority (hence the initial stopBalancer step). Like before, eventually the balancer will distribute all empty chunks anyway, but if you are preparing for a immediate data load it’s probably best to stop the balancer beforehand.

That’s it - you now have a pre-split collection on four shards using Hash-based shard keys. ?Queries and updates on exact urls go to randomized shards and are balanced across the cluster:

> db.webcrawler_empty.find({ url: "/wiki/OWASP" }).explain()
{
  "clusteredType" : "ParallelSort",
  "shards" : {
      "test-rs2/nuwen:31300,nuwen:31301" : [ ... ]
...
Copy after login

However, the trade-off with Hash-based shard keys is that ranged queries and multi-updates must hit all shards:

> db.webcrawler_empty.find({ url: /^\/wiki\/OWASP/ }).explain()
{
  "clusteredType" : "ParallelSort",
  "shards" : {
      "test-rs0/nuwen:31100,nuwen:31101" : [ ... ],
     "test-rs1/nuwen:31200,nuwen:31201" : [ ... ],
     "test-rs2/nuwen:31300,nuwen:31301" : [ ... ],
     "test-rs3/nuwen:31400,nuwen:31401" : [ ... ]
...
Copy after login

Manual chunk assignment and other caveats

The core benefits of the new Hash-based shard keys are:

  • Easy setup of randomized shard key

  • Automated pre-splitting of empty collections

  • Better distribution of chunks on shards for isolated document writes and reads

The standard split and moveChunk functions do work with Hash-based shard keys, so it’s still possible to balance your collection’s chunks in any way you like. ?However, the usual “find” mechanism used to select chunks can behave a bit unexpectedly since the specifier is a document which is hashed to get the containing chunk. ?To keep things simple, just use the new “bounds” parameter when manually manipulating chunks of hashed collections (or all collections, if you prefer):

> use admin
> db.runCommand({ split : "mydb.webcrawler_empty", bounds : [{ "url" : NumberLong("2305843009213693950") }, { "url" : NumberLong("4611686018427387900") }] })
> db.runCommand({ moveChunk : "mydb.webcrawler_empty", bounds : [{ "url" : NumberLong("2305843009213693950") }, { "url" : NumberLong("4611686018427387900") }], to : "test-rs3" })
Copy after login

There are a few other caveats as well - in particular with tag-aware sharding. ?Tag-aware sharding is a feature we released in MongoDB 2.2, which allows you to attach labels to a subset of shards in a cluster. This is valuable for “pinning" collection data to particular shards (which might be hosted on more powerful hardware, for example). ?You can also tag ranges of a collection differently, such that a collection sharded by { “countryCode" : 1 } would have chunks only on servers in that country.

Hash-based shard keys are compatible with tag-aware sharding. ?As in any sharded collection, you may assign chunks to specific shards, but since the chunk ranges are based on the value of the randomized hash of the shard key instead of the shard key itself, this is usually only useful for tagging the whole range to a specific set of shards:

> sh.addShardTag("test-rs2", "DC1")
sh.addShardTag("test-rs3", "DC1")
Copy after login

The above commands assign a hypothetical data center tag “DC1” to shards -rs2 and -rs3, which could indicate that -rs2 and -rs3 are in a particular location. ?Then, by running:

> sh.addTagRange("mydb.webcrawler_empty", { url : MinKey }, { url : MaxKey }, "DC1" )
Copy after login

we indicate to the cluster that the mydb.webcrawler_empty collection should only be stored on “DC1” shards. ?After letting the balancer work:

> db.printShardingStatus(true)
--- Sharding Status ---
...
      mydb.webcrawler_empty
          shard key: { "url" : "hashed" }
          chunks:
              test-rs2    4
              test-rs3    4
          { "url" : { "$minKey" : 1 } } -->> { "url" : NumberLong("-6917529027641081850") } on : test-rs2 { "t" : 5, "i" : 0 }
          { "url" : NumberLong("-6917529027641081850") } -->> { "url" : NumberLong("-4611686018427387900") } on : test-rs3 { "t" : 6, "i" : 0 }
          { "url" : NumberLong("-4611686018427387900") } -->> { "url" : NumberLong("-2305843009213693950") } on : test-rs2 { "t" : 7, "i" : 0 }
          { "url" : NumberLong("-2305843009213693950") } -->> { "url" : NumberLong(0) } on : test-rs3 { "t" : 8, "i" : 0 }
          { "url" : NumberLong(0) } -->> { "url" : NumberLong("2305843009213693950") } on : test-rs2 { "t" : 4, "i" : 6 }
          { "url" : NumberLong("2305843009213693950") } -->> { "url" : NumberLong("4611686018427387900") } on : test-rs2 { "t" : 4, "i" : 7 }
          { "url" : NumberLong("4611686018427387900") } -->> { "url" : NumberLong("6917529027641081850") } on : test-rs3 { "t" : 4, "i" : 8 }
          { "url" : NumberLong("6917529027641081850") } -->> { "url" : { "$maxKey" : 1 } } on : test-rs3 { "t" : 4, "i" : 9 }
           tag: DC1  { "url" : { "$minKey" : 1 } } -->> { "url" : { "$maxKey" : 1 } }
Copy after login

Again, it doesn’t usually make a lot of sense to tag anything other than the full hashed shard key collection to particular shards - by design, there’s no real way to know or control what data is in what range.

Finally, remember that Hash-based shard keys can (right now) only distribute documents based on the value of a single field. ?So, continuing the example above, it isn’t directly possible to use “url" + “timestamp" as a Hash-based shard key without storing the combination in a single field in your application, for example:

url_and_ts : { url : <url>, timestamp : <timestamp> }</timestamp></url>
Copy after login

The sub-document will be hashed as a unit.

If you’re interested in learning more about Hash-based sharding, register for the Hash-based sharding feature demo on May 2.

Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

Video Face Swap

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Tools

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)

What is the difference between make and new in go language What is the difference between make and new in go language Jan 09, 2023 am 11:44 AM

Differences: 1. Make can only be used to allocate and initialize data of types slice, map, and chan; while new can allocate any type of data. 2. New allocation returns a pointer, which is the type "*Type"; while make returns a reference, which is Type. 3. The space allocated by new will be cleared; after make allocates the space, it will be initialized.

How to use the new keyword in java How to use the new keyword in java May 03, 2023 pm 10:16 PM

1. Concept In the Java language, the "new" expression is responsible for creating an instance, in which the constructor is called to initialize the instance; the return value type of the constructor itself is void, not "the constructor returns the newly created Object reference", but the value of the new expression is a reference to the newly created object. 2. Purpose: Create an object of a new class. 3. Working mechanism: Allocate memory space for object members, and specify default values. Explicitly initialize member variables, perform construction method calculations, and return reference values. 4. Instance new operation often means opening up new memory in the memory. The memory space is allocated in the heap area in the memory. It is controlled by jvm and automatically manages the memory. Here we use the String class as an example. Pu

Detailed explanation of data segmentation (Sharding) implemented by Redis Detailed explanation of data segmentation (Sharding) implemented by Redis Jun 20, 2023 pm 04:52 PM

Redis is a high-performance key-value storage system, which is often used in application scenarios such as caching and rankings. When the amount of data becomes larger and larger, Redis on a single machine may encounter performance bottlenecks. At this time, we can achieve horizontal expansion by segmenting the data and storing it on multiple Redis nodes. This is Redis's data segmentation (Sharding). Redis data segmentation can be completed through the following steps: To set the sharding rules, you first need to set the sharding rules. Redis sharding can be based on key value

How does the new operator work in js? How does the new operator work in js? Feb 19, 2024 am 11:17 AM

How does the new operator in js work? Specific code examples are needed. The new operator in js is a keyword used to create objects. Its function is to create a new instance object based on the specified constructor and return a reference to the object. When using the new operator, the following steps are actually performed: create a new empty object; point the prototype of the empty object to the prototype object of the constructor; assign the scope of the constructor to the new object (so this points to new object); execute the code in the constructor and give the new object

New Fujifilm fixed-lens GFX camera to debut new medium format sensor, could kick off all-new series New Fujifilm fixed-lens GFX camera to debut new medium format sensor, could kick off all-new series Sep 27, 2024 am 06:03 AM

Fujifilm has seen a lot of success in recent years, largely due to its film simulations and the popularity of its compact rangefinger-style cameras on social media. However, it doesn't seem to be resting on its laurels, according to Fujirumors. The u

How to use clone() method in Java instead of new keyword? How to use clone() method in Java instead of new keyword? Apr 22, 2023 pm 07:55 PM

Using clone() instead of new The most common way to create a new object instance in Java is to use the new keyword. JDK's support for new is very good. When using the new keyword to create lightweight objects, it is very fast. However, for heavyweight objects, the execution time of the constructor may be longer because the object may perform some complex and time-consuming operations in the constructor. As a result, the system cannot obtain a large number of instances in the short term. To solve this problem, you can use the Object.clone() method. The Object.clone() method can bypass the constructor and quickly copy an object instance. However, by default, the instance generated by the clone() method is only a shallow copy of the original object.

How to use new instantiation in java How to use new instantiation in java May 16, 2023 pm 07:04 PM

1. The concept is "create a Java object" ----- allocate memory and return a reference to that memory. 2. Notes (1) The Java keyword new is an operator. It has the same or similar precedence as +, -, *, / and other operators. (2) Creating a Java object requires three steps: declaring reference variables, instantiating, and initializing the object instance. (3) Before instantiation, the parameterless constructor of the parent class will be called by default, that is, an object of the parent class must be created. 3. Two instantiation methods (1) Object name = new class name (parameter 1, parameter 2... Parameter n); object name. method (); (2) new class name (parameter 1, parameter 2... parameter n). method; 4. Example uses a simple code to illustrate how to make the object real

See all articles