How to use Redis's Bloomfilter to remove duplicates during the crawler process-PHP Tutorial-php.cn

Table of Contents

Foreword:

Code:

Description:

Summary:

Home

Backend Development

PHP Tutorial

How to use Redis's Bloomfilter to remove duplicates during the crawler process

坏嘻嘻

Sep 15, 2018 am 11:21 AM

The content of this article is about how to use Redis's Bloomfilter to remove duplicates. It not only uses Bloomfilter's massive duplicate removal capabilities, but also uses Redis's persistence capabilities. It has certain reference value. Friends in need can refer to it, I hope it will be helpful to you.

Foreword:

"Removal" is a skill that is often used in daily work. It is even more commonly used in the crawler field and is of average scale. All are relatively large. Two points need to be considered for deduplication: the amount of data to be deduplicated and the speed of deduplication. In order to maintain a fast deduplication speed, deduplication is generally performed in memory.

When the amount of data is not large, it can be placed directly in the memory for deduplication. For example, python can use set() for deduplication.
When deduplication data needs to be persisted, the set data structure of redis can be used.
When the amount of data is larger, you can use different encryption algorithms to compress the long string into 16/32/40 characters, and then use the above two methods to remove duplicates;
When the amount of data reaches the order of hundreds of millions (or even billions or tens of billions), the memory is limited, and "bits" must be used to remove duplicates to meet the demand. Bloomfilter maps deduplication objects to several memory "bits" and uses the 0/1 values of several bits to determine whether an object already exists.
However, Bloomfilter runs on the memory of a machine, which is not convenient for persistence (there will be nothing if the machine is down), and it is not convenient for unified deduplication of distributed crawlers. If you can apply for memory on Redis for Bloomfilter, both of the above problems will be solved.

Code:

# encoding=utf-8import redisfrom hashlib import md5class SimpleHash(object):
    def __init__(self, cap, seed):
        self.cap = cap
        self.seed = seed    def hash(self, value):
        ret = 0
        for i in range(len(value)):
            ret += self.seed * ret + ord(value[i])        return (self.cap - 1) & retclass BloomFilter(object):
    def __init__(self, host=&#39;localhost&#39;, port=6379, db=0, blockNum=1, key=&#39;bloomfilter&#39;):
        """
        :param host: the host of Redis
        :param port: the port of Redis
        :param db: witch db in Redis
        :param blockNum: one blockNum for about 90,000,000; if you have more strings for filtering, increase it.
        :param key: the key&#39;s name in Redis
        """
        self.server = redis.Redis(host=host, port=port, db=db)
        self.bit_size = 1 << 31  # Redis的String类型最大容量为512M，现使用256M
        self.seeds = [5, 7, 11, 13, 31, 37, 61]
        self.key = key
        self.blockNum = blockNum
        self.hashfunc = []        for seed in self.seeds:
            self.hashfunc.append(SimpleHash(self.bit_size, seed))    def isContains(self, str_input):
        if not str_input:            return False
        m5 = md5()
        m5.update(str_input)
        str_input = m5.hexdigest()
        ret = True
        name = self.key + str(int(str_input[0:2], 16) % self.blockNum)        for f in self.hashfunc:
            loc = f.hash(str_input)
            ret = ret & self.server.getbit(name, loc)        return ret    def insert(self, str_input):
        m5 = md5()
        m5.update(str_input)
        str_input = m5.hexdigest()
        name = self.key + str(int(str_input[0:2], 16) % self.blockNum)        for f in self.hashfunc:
            loc = f.hash(str_input)
            self.server.setbit(name, loc, 1)if __name__ == &#39;__main__&#39;:""" 第一次运行时会显示 not exists!，之后再运行会显示 exists! """
    bf = BloomFilter()    if bf.isContains(&#39;http://www.baidu.com&#39;):   # 判断字符串是否存在
        print &#39;exists!&#39;
    else:        print &#39;not exists!&#39;
        bf.insert(&#39;http://www.baidu.com&#39;)

Copy after login

Description:

How is Bloomfilter algorithm There are many explanations on Baidu about using bit deduplication. To put it simply, there are several seeds. Now apply for a section of memory space. A seed can be hashed with a string and mapped to a bit on this memory. If several bits are 1, it means that the string already exists. The same is true when inserting, setting all mapped bits to 1.
It should be reminded that the Bloomfilter algorithm has a missing probability, that is, there is a certain probability that a non-existent string will be misjudged as already existing. The size of this probability is related to the number of seeds, the memory size requested, and the number of deduplication objects. There is a table below, m represents the memory size (how many bits), n represents the number of deduplication objects, and k represents the number of seeds. For example, I applied for 256M in my code, which is 1
Bloomfilter deduplication based on Redis actually uses the String data structure of Redis, but a Redis String can only be up to 512M, so if the deduplication data The volume is large and you need to apply for multiple deduplication blocks (blockNum in the code represents the number of deduplication blocks).
The code uses MD5 encryption and compression to compress the string to 32 characters (hashlib.sha1() can also be used to compress it to 40 characters). It has two functions. First, Bloomfilter will make errors when hashing a very long string, often misjudging it as already existing. This problem no longer exists after compression; second, the compressed characters are 0~f. There are 16 possibilities in total. I intercepted the first two characters, and then assigned the string to different deduplication blocks based on blockNum for deduplication.

Summary:

Bloomfilter deduplication based on Redis uses both Bloomfilter's massive deduplication capabilities and Redis's Persistence capability, based on Redis, also facilitates deduplication of distributed machines. During use, it is necessary to budget the amount of data to be deduplicated, and appropriately adjust the number of seeds and blockNum according to the above table (the fewer seeds, the faster the deduplication will be, but the greater the leakage rate).

The above is the detailed content of How to use Redis's Bloomfilter to remove duplicates during the crawler process. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Assassin's Creed Shadows: Seashell Riddle Solution

4 weeks ago By DDD

What's New in Windows 11 KB5054979 & How to Fix Update Issues

3 weeks ago By DDD

Where to find the Crane Control Keycard in Atomfall

4 weeks ago By DDD

Roblox: Dead Rails - How To Complete Every Challenge

1 months ago By DDD

How to fix KB5055523 fails to install in Windows 11?

2 weeks ago By DDD

Hot Tools

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Hot Topics

Where is the login entrance for gmail email?

7716

Java Tutorial

1641

CakePHP Tutorial

1395

Laravel Tutorial

1289

PHP Tutorial

1232

Related knowledge

PHP 8.4 Installation and Upgrade guide for Ubuntu and Debian Dec 24, 2024 pm 04:42 PM

PHP 8.4 brings several new features, security improvements, and performance improvements with healthy amounts of feature deprecations and removals. This guide explains how to install PHP 8.4 or upgrade to PHP 8.4 on Ubuntu, Debian, or their derivati

7 PHP Functions I Regret I Didn't Know Before Nov 13, 2024 am 09:42 AM

If you are an experienced PHP developer, you might have the feeling that you’ve been there and done that already.You have developed a significant number of applications, debugged millions of lines of code, and tweaked a bunch of scripts to achieve op

How To Set Up Visual Studio Code (VS Code) for PHP Development Dec 20, 2024 am 11:31 AM

Visual Studio Code, also known as VS Code, is a free source code editor — or integrated development environment (IDE) — available for all major operating systems. With a large collection of extensions for many programming languages, VS Code can be c

Explain JSON Web Tokens (JWT) and their use case in PHP APIs. Apr 05, 2025 am 12:04 AM

JWT is an open standard based on JSON, used to securely transmit information between parties, mainly for identity authentication and information exchange. 1. JWT consists of three parts: Header, Payload and Signature. 2. The working principle of JWT includes three steps: generating JWT, verifying JWT and parsing Payload. 3. When using JWT for authentication in PHP, JWT can be generated and verified, and user role and permission information can be included in advanced usage. 4. Common errors include signature verification failure, token expiration, and payload oversized. Debugging skills include using debugging tools and logging. 5. Performance optimization and best practices include using appropriate signature algorithms, setting validity periods reasonably,

PHP Program to Count Vowels in a String Feb 07, 2025 pm 12:12 PM

A string is a sequence of characters, including letters, numbers, and symbols. This tutorial will learn how to calculate the number of vowels in a given string in PHP using different methods. The vowels in English are a, e, i, o, u, and they can be uppercase or lowercase. What is a vowel? Vowels are alphabetic characters that represent a specific pronunciation. There are five vowels in English, including uppercase and lowercase: a, e, i, o, u Example 1 Input: String = "Tutorialspoint" Output: 6 explain The vowels in the string "Tutorialspoint" are u, o, i, a, o, i. There are 6 yuan in total

How do you parse and process HTML/XML in PHP? Feb 07, 2025 am 11:57 AM

This tutorial demonstrates how to efficiently process XML documents using PHP. XML (eXtensible Markup Language) is a versatile text-based markup language designed for both human readability and machine parsing. It's commonly used for data storage an

Explain late static binding in PHP (static::). Apr 03, 2025 am 12:04 AM

Static binding (static::) implements late static binding (LSB) in PHP, allowing calling classes to be referenced in static contexts rather than defining classes. 1) The parsing process is performed at runtime, 2) Look up the call class in the inheritance relationship, 3) It may bring performance overhead.

What are PHP magic methods (__construct, __destruct, __call, __get, __set, etc.) and provide use cases? Apr 03, 2025 am 12:03 AM

What are the magic methods of PHP? PHP's magic methods include: 1.\_\_construct, used to initialize objects; 2.\_\_destruct, used to clean up resources; 3.\_\_call, handle non-existent method calls; 4.\_\_get, implement dynamic attribute access; 5.\_\_set, implement dynamic attribute settings. These methods are automatically called in certain situations, improving code flexibility and efficiency.

See all articles