


Data consistency verification and repair procedures in MySQL non-master-slave environment
1. Introduction
Project address: https://github.com/seanlook/p...
The pt-table-checksum tool is often used for data consistency verification in a master-slave environment. Its principle and implementation process I wrote an article before: Use pt-table-checksum to check MySQL data consistency in the production environment. However, in the DBA work, there will be some checks for consistency between the two tables, and there is no master-slave relationship between the two tables. The pt tool replays the checking actions performed in the master database based on binlog in the slave database. is no longer applicable.
There will always be such special needs, such as migrating from Alibaba Cloud RDS instances to self-built mysql instances. Its data transmission service implementation method is table-based batch data extraction, plus binlog subscription, but forcing row mode will cause pt -table-checksum does not have permission to temporarily change the session to a statement. Another requirement is to convert the character set of the entire library: the library table definitions are all utf8, but the application connection uses the default latin1. To unify the connection character set and the table character set, the data can only be exported with latin1, and then UTF8 import, in this case data consistency check, not to mention that the binlog parsing program does not support statement (such as canal), the contents of the old and new libraries are different, and the check value calculated by pt-table-checksum will be different and invalid.
That’s why I came up with the idea of referring to pt-table-checksum and wrote one myself: px-table-checksum.
2. Implementation method
The overall idea is to learn from pt-table-checksum, take out a piece of data such as 1000 rows from the source library in batches (i.e. chunk), calculate the CRC32 value, run the same statement in the target library, and store the results Another library finally checks whether the chunk crc values of the corresponding numbers are consistent. Knowing the inconsistencies is not enough. You need to be able to fix the differences quickly and conveniently. So, continue to go to the target library and the source library to find the inconsistent rows based on those inconsistent chunks. Are they missing, redundant, or modified? Then generate the repair sql. According to Indicates whether to automatically repair.
Then the question is:
How to determine the batch, that is, how to get the next chunk? I haven't wanted to do what pt-table-checksum does, which can dynamically adjust the chunk size according to the load, or even suspend the check when the number of active threads exceeds the threshold. The workload would be too much. Currently, the number of chunk rows calculated each time is fixed, and can be configured to 1000 or 2000, etc.
So we need to use paging query. According to the (auto-increment or union) primary key and unique index, after each limit of 1000, the last one is taken in ascending order as the start of the next batch. Therefore, it is necessary to analyze the key situation on the table and combine the query conditions. Currently, only tables with primary keys or unique keys can be checked.
How to ensure that the source library and the target library run the same SQL? In the previous version, the target library and the source library used multiple threads to calculate chunks and store them in the library. Later, I realized a serious bug: for example, if 1000 rows were also fetched, if the target library had less data, the next chunk would start differently. , the comparison results are simply a mess.
So it is necessary to ensure that the chunks with the same number and the starting point must be the same, so I thought of using a queue to store all the verification sql that has been run in the source library, and simulate the pt tool to replay it in the target library. Considering that multiple threads need to compare multiple tables at the same time, the queue may consume too much memory, so a redis queue is used.
Calculate crc32 directly in the database, or take out the data and calculate it in memory? I checked the source code of pt-table-checksum and found that it is calculated in the database. But as mentioned in the first section, if the target library and the source library need to use different character sets to read the correct data, they can only be queried and then compared. So px-table-checksum supports both, and only needs to specify one configuration item.
Checking multiple tables at the same time, the source database SQL is crowded in the queue, and 1 second has passed when the target database is taken out for execution. At this time, the data in the source database has been modified again and synchronized to the target database, which will lead to inconsistent calculation results. In fact, they are consistent. , how to handle it cannot be handled, which is the biggest flaw of px-table-checksum compared to pt-table-checksum.
But in order to reduce such problems as much as possible (for example, master-slave delay may also occur), multiple redis queues are specially designed, and the target library has multiple checking threads. For example, if 8 tables are specified to be checked at the same time, there will be 8 source library checks. Threads correspond, but according to the writing situation of the table, 4 redis queues (currently randomly added) and 10 target library checking threads can be configured to reduce inaccuracy factors. But from my point of view, inconsistent data will be recorded. If there are not many, they will be checked manually. If there are many, they will be checked again. If the same data is inconsistent twice, then there is something wrong. .
3. Limitations
If the source table data changes frequently during the check, the check results may be inaccurate, which is the problem in point 4 above. Obviously, each transaction checked by this program is separate, unlike the pt tool, which can strictly guarantee the transaction order of each check SQL. But if there are inconsistent data, check again and it will be ok. In fact, during my online use, it was 99.9% accurate.
There must be a primary key or unique index on the table. The program will check and exit if not.
Varbinay, blob and other binary fields do not support repair
In fact, it is not not supported at all, it depends on how to use it. If during development, characters are first converted into bytes and then stored in mysql, this does not support repair. There is a way to deal with it, that is to use the hex() function when checking from the source library, repair the unhex() in the sql and write it back.
4. Instructions for use
This python program is developed based on 2.7 and has not been tested on 2.6 and 3.x. You need to install MySQLdb and hotqueue before use:
$ sudo pip install MySQL-python hotqueue
The tables and options to be compared should be fully configured, that is, not specified through the command line (forgive the additional usage of command line parameters) increase the amount of code).
4.1 px-table-checksum.py
Main program, run python px-table-checksum.py to perform consistency check, but be sure to understand the following configuration file options.
4.2 settings_checksum.py
Configuration options
CHUNK_SIZE: The number of chunk rows extracted each time
REDIS_INFO: Specify the redis queue address to use
REDIS_QUEUE_CNT: The number of redis queues, the consumer (target library) has a one-to-one corresponding thread Guarding the queue
REDIS_POOL_CNT: Producer (source library) redis client connection pool. This design is to alleviate the problems caused by GIL and separate the enqueuing end from the dequeuing end, because if there are many tables, a large amount of SQL may be queued in a short time to avoid hotqueue contention. CALC_CRC32_DB: True means calculating the checksum value in the db , False means taking out the chunk data and calculating it in python. The default value given is based on the connection character set.
DO_COMPARE: Operation mode
0: Only extract data for calculation, not compare for consistency. You can later only compare in mode 2
1: Calculate and compare. Commonly used, the last result of the table to be checked is deleted before each calculation. The comparison result only tells which chunk numbers are inconsistent.
2: No calculation, only comparison from t_checkum results. Commonly used, calculation consumes database resources. You can only compare the inconsistencies in the existing checksum calculation results. Similar to the --replicate-check-only option of the pt tool.
GEN_DATAFIX:
Used in combination with DO_COMPARE, if True, it means to find specific inconsistent rows for inconsistent chunks and generate repair sql; if it is False, nothing will be done.
RUN_DATAFIX:
DB_CHECKSUM: A dictionary that specifies where the checksum results are stored.
4.3 settings_cs_tables.py
The above configuration file can be considered to be used to control the program. This configuration file specifies the source and target library information to be verified, and which tables to verify.
TABLES_CHECK: Dictionary, specifying which tables to check for consistency. The db name is key, and the list of multiple table names is value. Checking the entire db is not supported at the moment, and the number of tables compared at the same time is not recommended to exceed 8
DB_SOURCE: Dictionary, specifying the connection information of the source library
DB_SOURCE: Dictionary, specifying the connection information of the target library

Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

Video Face Swap
Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Hot Tools

Notepad++7.3.1
Easy-to-use and free code editor

SublimeText3 Chinese version
Chinese version, very easy to use

Zend Studio 13.0.1
Powerful PHP integrated development environment

Dreamweaver CS6
Visual web development tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Hot Topics











Laravel is a PHP framework for easy building of web applications. It provides a range of powerful features including: Installation: Install the Laravel CLI globally with Composer and create applications in the project directory. Routing: Define the relationship between the URL and the handler in routes/web.php. View: Create a view in resources/views to render the application's interface. Database Integration: Provides out-of-the-box integration with databases such as MySQL and uses migration to create and modify tables. Model and Controller: The model represents the database entity and the controller processes HTTP requests.

MySQL and phpMyAdmin are powerful database management tools. 1) MySQL is used to create databases and tables, and to execute DML and SQL queries. 2) phpMyAdmin provides an intuitive interface for database management, table structure management, data operations and user permission management.

Compared with other programming languages, MySQL is mainly used to store and manage data, while other languages such as Python, Java, and C are used for logical processing and application development. MySQL is known for its high performance, scalability and cross-platform support, suitable for data management needs, while other languages have advantages in their respective fields such as data analytics, enterprise applications, and system programming.

Article summary: This article provides detailed step-by-step instructions to guide readers on how to easily install the Laravel framework. Laravel is a powerful PHP framework that speeds up the development process of web applications. This tutorial covers the installation process from system requirements to configuring databases and setting up routing. By following these steps, readers can quickly and efficiently lay a solid foundation for their Laravel project.

In MySQL, the function of foreign keys is to establish the relationship between tables and ensure the consistency and integrity of the data. Foreign keys maintain the effectiveness of data through reference integrity checks and cascading operations. Pay attention to performance optimization and avoid common errors when using them.

The main difference between MySQL and MariaDB is performance, functionality and license: 1. MySQL is developed by Oracle, and MariaDB is its fork. 2. MariaDB may perform better in high load environments. 3.MariaDB provides more storage engines and functions. 4.MySQL adopts a dual license, and MariaDB is completely open source. The existing infrastructure, performance requirements, functional requirements and license costs should be taken into account when choosing.

Abstract of the first paragraph of the article: When choosing software to develop Yi framework applications, multiple factors need to be considered. While native mobile application development tools such as XCode and Android Studio can provide strong control and flexibility, cross-platform frameworks such as React Native and Flutter are becoming increasingly popular with the benefits of being able to deploy to multiple platforms at once. For developers new to mobile development, low-code or no-code platforms such as AppSheet and Glide can quickly and easily build applications. Additionally, cloud service providers such as AWS Amplify and Firebase provide comprehensive tools

SQL is a standard language for managing relational databases, while MySQL is a database management system that uses SQL. SQL defines ways to interact with a database, including CRUD operations, while MySQL implements the SQL standard and provides additional features such as stored procedures and triggers.
