Table of Contents
Standard Adherence
Conversions
Collations
Documentation
Home Database Mysql Tutorial MySQL 5.7 supports the GB18030 Chinese Character Set_MySQL

MySQL 5.7 supports the GB18030 Chinese Character Set_MySQL

Jun 01, 2016 pm 01:16 PM

My former boss at MySQL sent out a notice that MySQL 5.7.4 nowsupports theGB18030character set, thus responding to requests that have been appearing since2005. This is a big deal because the Chinese government demands GB18030 support, and because the older simplified-Chinese character sets (gbk and gb2312) have a much smaller repertoire (that is, they have too few characters). And this is real GB18030 support -- I can define columns and variables with CHARACTER SET GB18030. That's rare --Oracle 12candSQL Server 2012andPostsgreSQL 9.3can't do it. (They allow input from GB18030 clients but they convert it immediately to Unicode.) Among big-time DBMSs, until now,only DB2has treated GB18030 as a first-class character set.

Standard Adherence

We're talking about the current version of the standard, GB18030-2005 "IT Chinese coded character set", especially its description of 70,244 Chinese characters. I couldn't puzzle out the Chinese wording inthe official document, all I could do was use translate.google.comon some excerpts. But I've been told that the MySQL person who coded this feature is Chinese, so they'll have had better luck. What I could understand was what are the difficult characters, what are the requirements for a claim of support, and what the encoding should look like. From the coder's comments, it's clear that part was understood. I did not check whether there was adherence for non-mandatory parts, such as Tibetan script.

Conversions

The repertoire of GB18030 ought to be the same as the Unicode repertoire. So I took a list of every Unicode character, converted to GB18030, and converted back to Unicode. The result in every case was the same Unicode character that I'd started with. That's called "perfect round tripping". As I explained in an earlier blog post"The UTF-8 World Is Not Enough", storing Chinese characters with a Chinese character set has certain advantages. Well, now the biggest disadvantage has disappeared.

Hold on -- how is perfect round tripping possible, given that MySQLfrequently refers to Unicode 4.0, and some of the characters in GB18030-2005 are only in Unicode 4.1? Certainly that ought to be a problem according to theUnicode FAQandthis extract from Ken Lunde's book. But it turns out to be okay because MySQL doesn't actually disallow those characters -- it accepts encodings which are not assigned to characters. Of course I believe that MySQL should have upgraded the Unicode support first, and added GB18030 support later. But the best must not be an enemy of the good.

Also the conversions to and from gb2312 work fine, so I expect that eventually gb2312 will become obsolete. It's time for mainland Chinese users to consider switching over to gb18030 once MySQL 5.7 is GA.

Collations

The new character set comes with three collations: one trivial, one tremendous, one tsk, tsk.

The trivial collation is gb18030_bin. As always the bin stands for binary. I expect that as always this will be the most performant collation, and the only one that guarantees that no two characters will ever have the same weight.

The tremendous collation is gb18030_unicode_520_ci. The "unicode_520" part of the name really does mean that the collation table comes from"Unicode 5.2"and this is the first time that MySQL has taken to heart the maxim: what applies to the superset can apply to the subset. In fact all MySQL character sets should have Unicode collations, because all their characters are in Unicode. So to test this, I went through all the Unicode characters and their GB18030 equivalents, and compared their weights withWEIGHT_STRING:
WEIGHT_STRING(utf32_char COLLATE utf32_unicode_520_ci) to
WEIGHT_STRING(gb18030_char COLLATE gb18030_unicode_520_ci).
Every utf32 weight was exactly the same as the gb18030 weight.

The tsk, tsk collation is gb18030_chinese_ci.

The first bad thing is the suffix chinese_ci, which will make some people think that this collation is like gb2312_chinese_ci. (Such confusion has happened before for the general_ci suffix.) In fact there are thousands of differences between gb2312_chinese_ci and gb18030_chinese_ci. Here's an example.

mysql> CREATE TABLE t5	->(gb2312 CHAR CHARACTER SET gb2312 COLLATE gb2312_chinese_ci,	-> gb18030 CHAR CHARACTER SET gb18030 COLLATE gb18030_chinese_ci);Query OK, 0 rows affected (0.22 sec)mysql> INSERT INTO t5 VALUES ('[','['),(']',']');Query OK, 2 rows affected (0.01 sec)Records: 2Duplicates: 0Warnings: 0mysql> SELECT DISTINCT gb2312 from t5 ORDER BY gb2312;+--------+| gb2312 |+--------+| ]	|| [	|+--------+2 rows in set (0.00 sec)mysql> SELECT DISTINCT gb18030 from t5 ORDER BY gb18030;+---------+| gb18030 |+---------+| [	 || ]	 |+---------+2 rows in set (0.00 sec)
Copy after login

See the difference? The gb18030 order is obviously better -- ']' should be greater than '[' -- but when two collations are wildly different they shouldn't both be called "chinese_ci".

The second bad thing is the algorithm. The new chinese_ci collation is based onpinyinfor Chinese characters, and binary comparisons of the UPPER() values for non-Chinese characters. This is pretty well useless for non-Chinese. I can bet that somebody will observe "well, duh, it's a Chinese character set" -- but I can't see why one would use an algorithm for Latin/Greek/Cyrillic/etc. characters that's so poor. There's aCommon Locale Data Repositoryfor tailoring for Chinese, there are MySQL worklog tasks that explain the brave new world, there's no need to invent an idiolect when there's a received dialect.

Documentation

The documentation isn't up to date yet -- there's no attempt to explain what the new character set and its collations are about, and no mention at all inthe FAQ.

But the worklog taskWL#4024: gb18030 Chinese character setgives a rough idea of what the coder had in mind before starting. It looks as if WL#4024 was partly copied fromhttp://icu-project.org/docs/papers/unicode-gb18030-faq.htmlso that's also worth a look.

For developers who just need to know what's going on now, just re-read this blog post. What I've described should be enough for people who care about Chinese.

I didn't look for bugs with full-text or LIKE searches, and I didn't look at speed. But I did look hard for problems with the essentials, and found none. Congratulations are due.

Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

Video Face Swap

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Tools

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Hot Topics

Java Tutorial
1664
14
PHP Tutorial
1266
29
C# Tutorial
1239
24
When might a full table scan be faster than using an index in MySQL? When might a full table scan be faster than using an index in MySQL? Apr 09, 2025 am 12:05 AM

Full table scanning may be faster in MySQL than using indexes. Specific cases include: 1) the data volume is small; 2) when the query returns a large amount of data; 3) when the index column is not highly selective; 4) when the complex query. By analyzing query plans, optimizing indexes, avoiding over-index and regularly maintaining tables, you can make the best choices in practical applications.

MySQL: Simple Concepts for Easy Learning MySQL: Simple Concepts for Easy Learning Apr 10, 2025 am 09:29 AM

MySQL is an open source relational database management system. 1) Create database and tables: Use the CREATEDATABASE and CREATETABLE commands. 2) Basic operations: INSERT, UPDATE, DELETE and SELECT. 3) Advanced operations: JOIN, subquery and transaction processing. 4) Debugging skills: Check syntax, data type and permissions. 5) Optimization suggestions: Use indexes, avoid SELECT* and use transactions.

MySQL: The Ease of Data Management for Beginners MySQL: The Ease of Data Management for Beginners Apr 09, 2025 am 12:07 AM

MySQL is suitable for beginners because it is simple to install, powerful and easy to manage data. 1. Simple installation and configuration, suitable for a variety of operating systems. 2. Support basic operations such as creating databases and tables, inserting, querying, updating and deleting data. 3. Provide advanced functions such as JOIN operations and subqueries. 4. Performance can be improved through indexing, query optimization and table partitioning. 5. Support backup, recovery and security measures to ensure data security and consistency.

MySQL's Role: Databases in Web Applications MySQL's Role: Databases in Web Applications Apr 17, 2025 am 12:23 AM

The main role of MySQL in web applications is to store and manage data. 1.MySQL efficiently processes user information, product catalogs, transaction records and other data. 2. Through SQL query, developers can extract information from the database to generate dynamic content. 3.MySQL works based on the client-server model to ensure acceptable query speed.

MySQL: An Introduction to the World's Most Popular Database MySQL: An Introduction to the World's Most Popular Database Apr 12, 2025 am 12:18 AM

MySQL is an open source relational database management system, mainly used to store and retrieve data quickly and reliably. Its working principle includes client requests, query resolution, execution of queries and return results. Examples of usage include creating tables, inserting and querying data, and advanced features such as JOIN operations. Common errors involve SQL syntax, data types, and permissions, and optimization suggestions include the use of indexes, optimized queries, and partitioning of tables.

Explain the role of InnoDB redo logs and undo logs. Explain the role of InnoDB redo logs and undo logs. Apr 15, 2025 am 12:16 AM

InnoDB uses redologs and undologs to ensure data consistency and reliability. 1.redologs record data page modification to ensure crash recovery and transaction persistence. 2.undologs records the original data value and supports transaction rollback and MVCC.

MySQL's Place: Databases and Programming MySQL's Place: Databases and Programming Apr 13, 2025 am 12:18 AM

MySQL's position in databases and programming is very important. It is an open source relational database management system that is widely used in various application scenarios. 1) MySQL provides efficient data storage, organization and retrieval functions, supporting Web, mobile and enterprise-level systems. 2) It uses a client-server architecture, supports multiple storage engines and index optimization. 3) Basic usages include creating tables and inserting data, and advanced usages involve multi-table JOINs and complex queries. 4) Frequently asked questions such as SQL syntax errors and performance issues can be debugged through the EXPLAIN command and slow query log. 5) Performance optimization methods include rational use of indexes, optimized query and use of caches. Best practices include using transactions and PreparedStatemen

Why Use MySQL? Benefits and Advantages Why Use MySQL? Benefits and Advantages Apr 12, 2025 am 12:17 AM

MySQL is chosen for its performance, reliability, ease of use, and community support. 1.MySQL provides efficient data storage and retrieval functions, supporting multiple data types and advanced query operations. 2. Adopt client-server architecture and multiple storage engines to support transaction and query optimization. 3. Easy to use, supports a variety of operating systems and programming languages. 4. Have strong community support and provide rich resources and solutions.

See all articles