


How to use Redis's Bloomfilter to remove duplicates during the crawler process
The content of this article is about how to use Redis's Bloomfilter to remove duplicates. It not only uses Bloomfilter's massive duplicate removal capabilities, but also uses Redis's persistence capabilities. It has certain reference value. Friends in need can refer to it, I hope it will be helpful to you.
Foreword:
"Removal" is a skill that is often used in daily work. It is even more commonly used in the crawler field and is of average scale. All are relatively large. Two points need to be considered for deduplication: the amount of data to be deduplicated and the speed of deduplication. In order to maintain a fast deduplication speed, deduplication is generally performed in memory.
When the amount of data is not large, it can be placed directly in the memory for deduplication. For example, python can use set() for deduplication.
When deduplication data needs to be persisted, the set data structure of redis can be used.
When the amount of data is larger, you can use different encryption algorithms to compress the long string into 16/32/40 characters, and then use the above two methods to remove duplicates;
When the amount of data reaches the order of hundreds of millions (or even billions or tens of billions), the memory is limited, and "bits" must be used to remove duplicates to meet the demand. Bloomfilter maps deduplication objects to several memory "bits" and uses the 0/1 values of several bits to determine whether an object already exists.
However, Bloomfilter runs on the memory of a machine, which is not convenient for persistence (there will be nothing if the machine is down), and it is not convenient for unified deduplication of distributed crawlers. If you can apply for memory on Redis for Bloomfilter, both of the above problems will be solved.
Code:
# encoding=utf-8import redisfrom hashlib import md5class SimpleHash(object): def __init__(self, cap, seed): self.cap = cap self.seed = seed def hash(self, value): ret = 0 for i in range(len(value)): ret += self.seed * ret + ord(value[i]) return (self.cap - 1) & retclass BloomFilter(object): def __init__(self, host='localhost', port=6379, db=0, blockNum=1, key='bloomfilter'): """ :param host: the host of Redis :param port: the port of Redis :param db: witch db in Redis :param blockNum: one blockNum for about 90,000,000; if you have more strings for filtering, increase it. :param key: the key's name in Redis """ self.server = redis.Redis(host=host, port=port, db=db) self.bit_size = 1 << 31 # Redis的String类型最大容量为512M,现使用256M self.seeds = [5, 7, 11, 13, 31, 37, 61] self.key = key self.blockNum = blockNum self.hashfunc = [] for seed in self.seeds: self.hashfunc.append(SimpleHash(self.bit_size, seed)) def isContains(self, str_input): if not str_input: return False m5 = md5() m5.update(str_input) str_input = m5.hexdigest() ret = True name = self.key + str(int(str_input[0:2], 16) % self.blockNum) for f in self.hashfunc: loc = f.hash(str_input) ret = ret & self.server.getbit(name, loc) return ret def insert(self, str_input): m5 = md5() m5.update(str_input) str_input = m5.hexdigest() name = self.key + str(int(str_input[0:2], 16) % self.blockNum) for f in self.hashfunc: loc = f.hash(str_input) self.server.setbit(name, loc, 1)if __name__ == '__main__':""" 第一次运行时会显示 not exists!,之后再运行会显示 exists! """ bf = BloomFilter() if bf.isContains('http://www.baidu.com'): # 判断字符串是否存在 print 'exists!' else: print 'not exists!' bf.insert('http://www.baidu.com')
Description:
How is Bloomfilter algorithm There are many explanations on Baidu about using bit deduplication. To put it simply, there are several seeds. Now apply for a section of memory space. A seed can be hashed with a string and mapped to a bit on this memory. If several bits are 1, it means that the string already exists. The same is true when inserting, setting all mapped bits to 1.
It should be reminded that the Bloomfilter algorithm has a missing probability, that is, there is a certain probability that a non-existent string will be misjudged as already existing. The size of this probability is related to the number of seeds, the memory size requested, and the number of deduplication objects. There is a table below, m represents the memory size (how many bits), n represents the number of deduplication objects, and k represents the number of seeds. For example, I applied for 256M in my code, which is 1
Bloomfilter deduplication based on Redis actually uses the String data structure of Redis, but a Redis String can only be up to 512M, so if the deduplication data The volume is large and you need to apply for multiple deduplication blocks (blockNum in the code represents the number of deduplication blocks).
The code uses MD5 encryption and compression to compress the string to 32 characters (hashlib.sha1() can also be used to compress it to 40 characters). It has two functions. First, Bloomfilter will make errors when hashing a very long string, often misjudging it as already existing. This problem no longer exists after compression; second, the compressed characters are 0~f. There are 16 possibilities in total. I intercepted the first two characters, and then assigned the string to different deduplication blocks based on blockNum for deduplication.
Summary:
Bloomfilter deduplication based on Redis uses both Bloomfilter's massive deduplication capabilities and Redis's Persistence capability, based on Redis, also facilitates deduplication of distributed machines. During use, it is necessary to budget the amount of data to be deduplicated, and appropriately adjust the number of seeds and blockNum according to the above table (the fewer seeds, the faster the deduplication will be, but the greater the leakage rate).
The above is the detailed content of How to use Redis's Bloomfilter to remove duplicates during the crawler process. For more information, please follow other related articles on the PHP Chinese website!

Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

Video Face Swap
Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Hot Tools

Notepad++7.3.1
Easy-to-use and free code editor

SublimeText3 Chinese version
Chinese version, very easy to use

Zend Studio 13.0.1
Powerful PHP integrated development environment

Dreamweaver CS6
Visual web development tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Hot Topics

PHP 8.4 brings several new features, security improvements, and performance improvements with healthy amounts of feature deprecations and removals. This guide explains how to install PHP 8.4 or upgrade to PHP 8.4 on Ubuntu, Debian, or their derivati

If you are an experienced PHP developer, you might have the feeling that you’ve been there and done that already.You have developed a significant number of applications, debugged millions of lines of code, and tweaked a bunch of scripts to achieve op

Visual Studio Code, also known as VS Code, is a free source code editor — or integrated development environment (IDE) — available for all major operating systems. With a large collection of extensions for many programming languages, VS Code can be c

JWT is an open standard based on JSON, used to securely transmit information between parties, mainly for identity authentication and information exchange. 1. JWT consists of three parts: Header, Payload and Signature. 2. The working principle of JWT includes three steps: generating JWT, verifying JWT and parsing Payload. 3. When using JWT for authentication in PHP, JWT can be generated and verified, and user role and permission information can be included in advanced usage. 4. Common errors include signature verification failure, token expiration, and payload oversized. Debugging skills include using debugging tools and logging. 5. Performance optimization and best practices include using appropriate signature algorithms, setting validity periods reasonably,

A string is a sequence of characters, including letters, numbers, and symbols. This tutorial will learn how to calculate the number of vowels in a given string in PHP using different methods. The vowels in English are a, e, i, o, u, and they can be uppercase or lowercase. What is a vowel? Vowels are alphabetic characters that represent a specific pronunciation. There are five vowels in English, including uppercase and lowercase: a, e, i, o, u Example 1 Input: String = "Tutorialspoint" Output: 6 explain The vowels in the string "Tutorialspoint" are u, o, i, a, o, i. There are 6 yuan in total

This tutorial demonstrates how to efficiently process XML documents using PHP. XML (eXtensible Markup Language) is a versatile text-based markup language designed for both human readability and machine parsing. It's commonly used for data storage an

Static binding (static::) implements late static binding (LSB) in PHP, allowing calling classes to be referenced in static contexts rather than defining classes. 1) The parsing process is performed at runtime, 2) Look up the call class in the inheritance relationship, 3) It may bring performance overhead.

What are the magic methods of PHP? PHP's magic methods include: 1.\_\_construct, used to initialize objects; 2.\_\_destruct, used to clean up resources; 3.\_\_call, handle non-existent method calls; 4.\_\_get, implement dynamic attribute access; 5.\_\_set, implement dynamic attribute settings. These methods are automatically called in certain situations, improving code flexibility and efficiency.
