


Detailed example of how PHP finds the same records in two large files
(Recommended tutorial: PHP video tutorial)
1. Introduction
Given two a and b There are files with x and y rows of data respectively. (x and y are both greater than 1 billion). The machine memory is limited to 100M. How to find the same records?
2. Idea
- The difficulty in dealing with this problem is mainly that it is impossible to read this massive data into the memory at one time.
- If it cannot be read into the memory at one time, can it be considered multiple times? If it is possible, how can we calculate the same value after reading it multiple times?
- We can use divide and conquer thinking to reduce the big to the small. If the values of the same string are equal after hashing, then we can consider using hash modulo to disperse the records into n files. How to get this n? PHP has 100M memory, and the array can store approximately 1 million data. So considering that records a and b only have 1 billion rows, n must be at least greater than 200.
- There are 200 files at this time. The same records must be in the same file, and each file can be read into the memory. Then you can find the same records in these 200 files in sequence, and then output them to the same file. The final result is the same records in the two files a and b.
- It is very simple to find the same record in a small file. Just use each row of records as the key of the hash table and count the number of occurrences of the key >= 2.
3. Practical operation
1 billion files are too big. Practical operation is a waste of time. Just achieve the practical purpose.
The problem size is reduced to: 1M memory limit, a and b each have 100,000 rows of records. The memory limit can be limited by PHP's ini_set('memory_limit', '1M');
.
4. Generate test file
Generate random numbers to fill the file:
/** * 生成随机数填充文件 * Author: ClassmateLin * Email: classmatelin.site@gmail.com * Site: https://www.classmatelin.top * @param string $filename 输出文件名 * @param int $batch 按多少批次生成数据 * @param int $batchSize 每批数据的大小 */ function generate(string $filename, int $batch=1000, int $batchSize=10000) { for ($i=0; $i<$batch; $i++) { $str = ''; for ($j=0; $j<$batchSize; $j++) { $str .= rand($batch, $batchSize) . PHP_EOL; // 生成随机数 } file_put_contents($filename, $str, FILE_APPEND); // 追加模式写入文件 } } generate('a.txt', 10); generate('b.txt', 10);
5. Split the file
Change a.txt
, b.txt
Split into n files by hash modulus.
/** * 用hash取模方式将文件分散到n个文件中 * Author: ClassmateLin * Email: classmatelin.site@gmail.com * Site: https://www.classmatelin.top * @param string $filename 输入文件名 * @param int $mod 按mod取模 * @param string $dir 文件输出目录 */ function spiltFile(string $filename, int $mod=20, string $dir='files') { if (!is_dir($dir)){ mkdir($dir); } $fp = fopen($filename, 'r'); while (!feof($fp)){ $line = fgets($fp); $n = crc32(hash('md5', $line)) % $mod; // hash取模 $filepath = $dir . '/' . $n . '.txt'; // 文件输出路径 file_put_contents($filepath, $line, FILE_APPEND); // 追加模式写入文件 } fclose($fp); } spiltFile('a.txt'); spiltFile('b.txt');
Execute the splitFile
function and get the following picture files
20 files in the directory.
6. Find duplicate records
Now we need to find the same records in 20 files. In fact, we need to find the same records in one file and operate each 20 times.
Find the same record in a file:
/** * 查找一个文件中相同的记录输出到指定文件中 * Author: ClassmateLin * Email: classmatelin.site@gmail.com * Site: https://www.classmatelin.top * @param string $inputFilename 输入文件路径 * @param string $outputFilename 输出文件路径 */ function search(string $inputFilename, $outputFilename='output.txt') { $table = []; $fp = fopen($inputFilename, 'r'); while (!feof($fp)) { $line = fgets($fp); !isset($table[$line]) ? $table[$line] = 1 : $table[$line]++; // 未设置的值设1,否则自增 } fclose($fp); foreach ($table as $line => $count) { if ($count >= 2){ // 出现大于2次的则是相同的记录,输出到指定文件中 file_put_contents($outputFilename, $line, FILE_APPEND); } } }
Find the same record in all files:
/** * 从给定目录下文件中分别找出相同记录输出到指定文件中 * Author: ClassmateLin * Email: classmatelin.site@gmail.com * Site: https://www.classmatelin.top * @param string $dirs 指定目录 * @param string $outputFilename 输出文件路径 */ function searchAll($dirs='files', $outputFilename='output.txt') { $files = scandir($dirs); foreach ($files as $file) { $filepath = $dirs . '/' . $file; if (is_file($filepath)){ search($filepath, $outputFilename); } } }
The space problem of large file processing has been solved here, so the time problem How to deal with it? A single machine can handle it by utilizing the multi-core of the CPU. If it is not enough, it can be handled by multiple servers.
7. Complete code
Copy after login
(recommended tutorial: PHP video tutorial)
The above is the detailed content of Detailed example of how PHP finds the same records in two large files. For more information, please follow other related articles on the PHP Chinese website!

Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

Video Face Swap
Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Hot Tools

Notepad++7.3.1
Easy-to-use and free code editor

SublimeText3 Chinese version
Chinese version, very easy to use

Zend Studio 13.0.1
Powerful PHP integrated development environment

Dreamweaver CS6
Visual web development tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Hot Topics

PHP 8.4 brings several new features, security improvements, and performance improvements with healthy amounts of feature deprecations and removals. This guide explains how to install PHP 8.4 or upgrade to PHP 8.4 on Ubuntu, Debian, or their derivati

If you are an experienced PHP developer, you might have the feeling that you’ve been there and done that already.You have developed a significant number of applications, debugged millions of lines of code, and tweaked a bunch of scripts to achieve op

Visual Studio Code, also known as VS Code, is a free source code editor — or integrated development environment (IDE) — available for all major operating systems. With a large collection of extensions for many programming languages, VS Code can be c

JWT is an open standard based on JSON, used to securely transmit information between parties, mainly for identity authentication and information exchange. 1. JWT consists of three parts: Header, Payload and Signature. 2. The working principle of JWT includes three steps: generating JWT, verifying JWT and parsing Payload. 3. When using JWT for authentication in PHP, JWT can be generated and verified, and user role and permission information can be included in advanced usage. 4. Common errors include signature verification failure, token expiration, and payload oversized. Debugging skills include using debugging tools and logging. 5. Performance optimization and best practices include using appropriate signature algorithms, setting validity periods reasonably,

A string is a sequence of characters, including letters, numbers, and symbols. This tutorial will learn how to calculate the number of vowels in a given string in PHP using different methods. The vowels in English are a, e, i, o, u, and they can be uppercase or lowercase. What is a vowel? Vowels are alphabetic characters that represent a specific pronunciation. There are five vowels in English, including uppercase and lowercase: a, e, i, o, u Example 1 Input: String = "Tutorialspoint" Output: 6 explain The vowels in the string "Tutorialspoint" are u, o, i, a, o, i. There are 6 yuan in total

This tutorial demonstrates how to efficiently process XML documents using PHP. XML (eXtensible Markup Language) is a versatile text-based markup language designed for both human readability and machine parsing. It's commonly used for data storage an

Static binding (static::) implements late static binding (LSB) in PHP, allowing calling classes to be referenced in static contexts rather than defining classes. 1) The parsing process is performed at runtime, 2) Look up the call class in the inheritance relationship, 3) It may bring performance overhead.

What are the magic methods of PHP? PHP's magic methods include: 1.\_\_construct, used to initialize objects; 2.\_\_destruct, used to clean up resources; 3.\_\_call, handle non-existent method calls; 4.\_\_get, implement dynamic attribute access; 5.\_\_set, implement dynamic attribute settings. These methods are automatically called in certain situations, improving code flexibility and efficiency.
