


How to convert Unicode and Utf-8 encoding in PHP, unicodeutf-8_PHP tutorial
How does PHP realize the conversion between Unicode and Utf-8 encoding? unicodeutf-8
I happened to need to convert unicode encoding recently, so I checked the library functions of PHP. I couldn't find a function that can encode and decode Unicode strings! Well, if you can't find it, just implement it yourself. . .
The difference between Unicode and Utf-8 encoding
Unicode is a character set, and UTF-8 is one of Unicode. Unicode is fixed-length and is double-byte, while UTF-8 is variable. For Chinese characters, Unicode occupies a byte ratio UTF-8 takes up 1 byte less. Unicode is double bytes, while Chinese characters in UTF-8 occupy three bytes.
UTF-8 encoded characters can theoretically be up to 6 bytes long, but 16-bit BMP (Basic Multilingual Plane) characters can only be up to 3 bytes long. Let’s take a look at the UTF-8 encoding table:
U-00000000 - U-0000007F: 0xxxxxxx
U-00000080 - U-000007FF: 110xxxxx 10xxxxxx
U-00000800 - U-0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx
U-00010000 - U-001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
U-00200000 - U-03FFFFFF: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
U-04000000 - U-7FFFFFFF: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
xxx is filled in by the binary representation of the character encoding number. The further to the right x has less special meaning. Only the shortest one is used to express a multi-byte string of a character encoding number. Note that in a multi-byte string, the number of "1"s at the beginning of the first byte is the number of bytes in the entire string. The first line starts with 0 to be compatible with ASCII encoding, which is one byte, the second line is a double-byte string, the third line is 3 bytes, such as Chinese characters, and so on. (Personal opinion: In fact, we can simply regard the number of 1’s in front as the number of bytes)
How to convert Unicode to Utf-8
In order to convert Unicode to UTF-8, of course you need to know what the difference is. Let’s take a look at how the encoding in Unicode is converted into UTF-8. In UTF-8, if the byte of a character is less than 0x80 (128), it is an ASCII character, occupying one byte, and no conversion is needed. Because UTF-8 is compatible with ASCII encoding. If the encoding of the Chinese character "you" in Unicode is "u4F60", convert it to binary to 100111101100000, and then convert it according to the UTF-8 method. Binary digits can be taken from the Unicode binary from low to high, taking 6 digits at a time. For example, the above binary digits can be taken out into the format shown below. The previous ones are filled according to the format, and any less than 8 bits are filled with 0.
unicode: 100111101100000 4F60
utf-8: 11100100,10111101,10100000 E4BDA0
From the above, you can intuitively see the conversion between Unicode and UTF-8. Of course, after knowing the format of UTF-8, you can perform the inverse operation, which is to put it at the corresponding position in the binary according to the format. Take it out, and then convert it to the resulting Unicode character (this operation can be completed through "displacement"). For example, in the above conversion of "you", since its value is greater than 0x800 and less than 0x10000, it can be judged as three-byte storage. Then the highest bit needs to be shifted to the right by "12" bits and then according to the three-byte format, the highest bit is 11100000 (0xE0 ) or (|) to get the highest value. In the same way, the second digit is shifted to the right by "6" bits, and the binary value of the highest digit and the second digit is left. It can be calculated by performing the position (&) operation with 111111 (0x3F), and then summed with 11000000 (0x80). or (|). There is no need to shift the third bit, just take the last six bits directly (& with 111111 (ox3F)), and then OR (|) with 11000000 (0x80).
How to convert Utf-8 back to Unicode
Of course, the conversion from UTF-8 to Unicode is also done through shifting, etc., which is to extract the binary numbers in the corresponding positions of the UTF-8 format. In the above example, "you" is three bytes, so each byte must be processed, from high bit to low bit. In UTF-8 "you" is 11100100,10111101,10100000. Starting from the high bit, that is, the first byte 11100100 is to take out the "0100". This is very simple. Just take the AND (&) with 11111 (0x1F). From the three bytes, we can know that the highest position must be before the 12th bit. , because six digits are taken each time. Therefore, the obtained result needs to be shifted to the left by 12 bits, and the highest bit is now 0100,000000,000000. The second bit is to take out "111101", so you only need to AND (&) the second byte 10111101 and 111111 (0x3F). After shifting the result to the left by 6 bits and taking the result of the highest byte or (|), the second bit is completed, and the result is 0100,111101,000000. By analogy, the last digit is directly ANDed (&) with 111111 (0x3F), and then ORed (|) with the previous result to get the result 0100,111101,100000.
PHP code implementation:
/** * utf8字符转换成Unicode字符 * @param [type] $utf8_str Utf-8字符 * @return [type] Unicode字符 */ function utf8_str_to_unicode($utf8_str) { $unicode = 0; $unicode = (ord($utf8_str[0]) & 0x1F) << 12; $unicode |= (ord($utf8_str[1]) & 0x3F) << 6; $unicode |= (ord($utf8_str[2]) & 0x3F); return dechex($unicode); } /** * Unicode字符转换成utf8字符 * @param [type] $unicode_str Unicode字符 * @return [type] Utf-8字符 */ function unicode_to_utf8($unicode_str) { $utf8_str = ''; $code = intval(hexdec($unicode_str)); //这里注意转换出来的code一定得是整形,这样才会正确的按位操作 $ord_1 = decbin(0xe0 | ($code >> 12)); $ord_2 = decbin(0x80 | (($code >> 6) & 0x3f)); $ord_3 = decbin(0x80 | ($code & 0x3f)); $utf8_str = chr(bindec($ord_1)) . chr(bindec($ord_2)) . chr(bindec($ord_3)); return $utf8_str; }
Tested it
$utf8_str = '我'; //这是汉字“你”的Unicode编码 $unicode_str = '4f6b'; //输出 6211 echo utf8_str_to_unicode($utf8_str) . "<br/>"; //输出汉字“你” echo unicode_str_to_utf8($unicode_str);
以上这些转换是针对中文汉字(非ASCII)的测试,并且只支持单个字符【一个完整的utf8字符或是一个完整的Unicode字符】互相转换,希望对大家的学习有所帮助。

Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

Video Face Swap
Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Hot Tools

Notepad++7.3.1
Easy-to-use and free code editor

SublimeText3 Chinese version
Chinese version, very easy to use

Zend Studio 13.0.1
Powerful PHP integrated development environment

Dreamweaver CS6
Visual web development tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Hot Topics

PHP 8.4 brings several new features, security improvements, and performance improvements with healthy amounts of feature deprecations and removals. This guide explains how to install PHP 8.4 or upgrade to PHP 8.4 on Ubuntu, Debian, or their derivati

If you are an experienced PHP developer, you might have the feeling that you’ve been there and done that already.You have developed a significant number of applications, debugged millions of lines of code, and tweaked a bunch of scripts to achieve op

Visual Studio Code, also known as VS Code, is a free source code editor — or integrated development environment (IDE) — available for all major operating systems. With a large collection of extensions for many programming languages, VS Code can be c

JWT is an open standard based on JSON, used to securely transmit information between parties, mainly for identity authentication and information exchange. 1. JWT consists of three parts: Header, Payload and Signature. 2. The working principle of JWT includes three steps: generating JWT, verifying JWT and parsing Payload. 3. When using JWT for authentication in PHP, JWT can be generated and verified, and user role and permission information can be included in advanced usage. 4. Common errors include signature verification failure, token expiration, and payload oversized. Debugging skills include using debugging tools and logging. 5. Performance optimization and best practices include using appropriate signature algorithms, setting validity periods reasonably,

A string is a sequence of characters, including letters, numbers, and symbols. This tutorial will learn how to calculate the number of vowels in a given string in PHP using different methods. The vowels in English are a, e, i, o, u, and they can be uppercase or lowercase. What is a vowel? Vowels are alphabetic characters that represent a specific pronunciation. There are five vowels in English, including uppercase and lowercase: a, e, i, o, u Example 1 Input: String = "Tutorialspoint" Output: 6 explain The vowels in the string "Tutorialspoint" are u, o, i, a, o, i. There are 6 yuan in total

This tutorial demonstrates how to efficiently process XML documents using PHP. XML (eXtensible Markup Language) is a versatile text-based markup language designed for both human readability and machine parsing. It's commonly used for data storage an

Static binding (static::) implements late static binding (LSB) in PHP, allowing calling classes to be referenced in static contexts rather than defining classes. 1) The parsing process is performed at runtime, 2) Look up the call class in the inheritance relationship, 3) It may bring performance overhead.

What are the magic methods of PHP? PHP's magic methods include: 1.\_\_construct, used to initialize objects; 2.\_\_destruct, used to clean up resources; 3.\_\_call, handle non-existent method calls; 4.\_\_get, implement dynamic attribute access; 5.\_\_set, implement dynamic attribute settings. These methods are automatically called in certain situations, improving code flexibility and efficiency.
