How to convert Unicode and Utf-8 encoding in PHP, unicodeutf-8

Table of Contents

How does PHP realize the conversion between Unicode and Utf-8 encoding? unicodeutf-8

Home

Backend Development

PHP Tutorial

How to convert Unicode and Utf-8 encoding in PHP, unicodeutf-8_PHP tutorial

WBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWB

Jul 13, 2016 am 09:45 AM

php unicode utf8

How does PHP realize the conversion between Unicode and Utf-8 encoding? unicodeutf-8

I happened to need to convert unicode encoding recently, so I checked the library functions of PHP. I couldn't find a function that can encode and decode Unicode strings! Well, if you can't find it, just implement it yourself. . .
The difference between Unicode and Utf-8 encoding

Unicode is a character set, and UTF-8 is one of Unicode. Unicode is fixed-length and is double-byte, while UTF-8 is variable. For Chinese characters, Unicode occupies a byte ratio UTF-8 takes up 1 byte less. Unicode is double bytes, while Chinese characters in UTF-8 occupy three bytes.
UTF-8 encoded characters can theoretically be up to 6 bytes long, but 16-bit BMP (Basic Multilingual Plane) characters can only be up to 3 bytes long. Let’s take a look at the UTF-8 encoding table:

U-00000000 - U-0000007F: 0xxxxxxx
U-00000080 - U-000007FF: 110xxxxx 10xxxxxx
U-00000800 - U-0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx
U-00010000 - U-001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
U-00200000 - U-03FFFFFF: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
U-04000000 - U-7FFFFFFF: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

The position of

xxx is filled in by the binary representation of the character encoding number. The further to the right x has less special meaning. Only the shortest one is used to express a multi-byte string of a character encoding number. Note that in a multi-byte string, the number of "1"s at the beginning of the first byte is the number of bytes in the entire string. The first line starts with 0 to be compatible with ASCII encoding, which is one byte, the second line is a double-byte string, the third line is 3 bytes, such as Chinese characters, and so on. (Personal opinion: In fact, we can simply regard the number of 1’s in front as the number of bytes)

How to convert Unicode to Utf-8

In order to convert Unicode to UTF-8, of course you need to know what the difference is. Let’s take a look at how the encoding in Unicode is converted into UTF-8. In UTF-8, if the byte of a character is less than 0x80 (128), it is an ASCII character, occupying one byte, and no conversion is needed. Because UTF-8 is compatible with ASCII encoding. If the encoding of the Chinese character "you" in Unicode is "u4F60", convert it to binary to 100111101100000, and then convert it according to the UTF-8 method. Binary digits can be taken from the Unicode binary from low to high, taking 6 digits at a time. For example, the above binary digits can be taken out into the format shown below. The previous ones are filled according to the format, and any less than 8 bits are filled with 0.

unicode: 100111101100000 4F60

utf-8: 11100100,10111101,10100000 E4BDA0

From the above, you can intuitively see the conversion between Unicode and UTF-8. Of course, after knowing the format of UTF-8, you can perform the inverse operation, which is to put it at the corresponding position in the binary according to the format. Take it out, and then convert it to the resulting Unicode character (this operation can be completed through "displacement"). For example, in the above conversion of "you", since its value is greater than 0x800 and less than 0x10000, it can be judged as three-byte storage. Then the highest bit needs to be shifted to the right by "12" bits and then according to the three-byte format, the highest bit is 11100000 (0xE0 ) or (|) to get the highest value. In the same way, the second digit is shifted to the right by "6" bits, and the binary value of the highest digit and the second digit is left. It can be calculated by performing the position (&) operation with 111111 (0x3F), and then summed with 11000000 (0x80). or (|). There is no need to shift the third bit, just take the last six bits directly (& with 111111 (ox3F)), and then OR (|) with 11000000 (0x80).

How to convert Utf-8 back to Unicode

Of course, the conversion from UTF-8 to Unicode is also done through shifting, etc., which is to extract the binary numbers in the corresponding positions of the UTF-8 format. In the above example, "you" is three bytes, so each byte must be processed, from high bit to low bit. In UTF-8 "you" is 11100100,10111101,10100000. Starting from the high bit, that is, the first byte 11100100 is to take out the "0100". This is very simple. Just take the AND (&) with 11111 (0x1F). From the three bytes, we can know that the highest position must be before the 12th bit. , because six digits are taken each time. Therefore, the obtained result needs to be shifted to the left by 12 bits, and the highest bit is now 0100,000000,000000. The second bit is to take out "111101", so you only need to AND (&) the second byte 10111101 and 111111 (0x3F). After shifting the result to the left by 6 bits and taking the result of the highest byte or (|), the second bit is completed, and the result is 0100,111101,000000. By analogy, the last digit is directly ANDed (&) with 111111 (0x3F), and then ORed (|) with the previous result to get the result 0100,111101,100000.

PHP code implementation:

/**
 * utf8字符转换成Unicode字符
 * @param [type] $utf8_str Utf-8字符
 * @return [type]      Unicode字符
 */
function utf8_str_to_unicode($utf8_str) {
  $unicode = 0;
  $unicode = (ord($utf8_str[0]) & 0x1F) << 12;
  $unicode |= (ord($utf8_str[1]) & 0x3F) << 6;
  $unicode |= (ord($utf8_str[2]) & 0x3F);
  return dechex($unicode);
}

/**
 * Unicode字符转换成utf8字符
 * @param [type] $unicode_str Unicode字符
 * @return [type]       Utf-8字符
 */
function unicode_to_utf8($unicode_str) {
  $utf8_str = '';
  $code = intval(hexdec($unicode_str));
  //这里注意转换出来的code一定得是整形，这样才会正确的按位操作
  $ord_1 = decbin(0xe0 | ($code >> 12));
  $ord_2 = decbin(0x80 | (($code >> 6) & 0x3f));
  $ord_3 = decbin(0x80 | ($code & 0x3f));
  $utf8_str = chr(bindec($ord_1)) . chr(bindec($ord_2)) . chr(bindec($ord_3));
  return $utf8_str;
}

Copy after login

Tested it

$utf8_str = '我';

//这是汉字“你”的Unicode编码
$unicode_str = '4f6b';

//输出 6211
echo utf8_str_to_unicode($utf8_str) . "<br/>";

//输出汉字“你”
echo unicode_str_to_utf8($unicode_str);

Copy after login

以上这些转换是针对中文汉字（非ASCII）的测试，并且只支持单个字符【一个完整的utf8字符或是一个完整的Unicode字符】互相转换，希望对大家的学习有所帮助。

Statement of this Website

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Assassin's Creed Shadows: Seashell Riddle Solution

1 months ago By DDD

What's New in Windows 11 KB5054979 & How to Fix Update Issues

3 weeks ago By DDD

Where to find the Crane Control Keycard in Atomfall

1 months ago By DDD

How to fix KB5055523 fails to install in Windows 11?

2 weeks ago By DDD

InZoi: How To Apply To School And University

3 weeks ago By DDD

Hot Tools

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Hot Topics

Where is the login entrance for gmail email?

7797

Java Tutorial

1644

CakePHP Tutorial

1402

Laravel Tutorial

1299

PHP Tutorial

1234

Related knowledge

PHP 8.4 Installation and Upgrade guide for Ubuntu and Debian Dec 24, 2024 pm 04:42 PM

PHP 8.4 brings several new features, security improvements, and performance improvements with healthy amounts of feature deprecations and removals. This guide explains how to install PHP 8.4 or upgrade to PHP 8.4 on Ubuntu, Debian, or their derivati

7 PHP Functions I Regret I Didn't Know Before Nov 13, 2024 am 09:42 AM

If you are an experienced PHP developer, you might have the feeling that you’ve been there and done that already.You have developed a significant number of applications, debugged millions of lines of code, and tweaked a bunch of scripts to achieve op

How To Set Up Visual Studio Code (VS Code) for PHP Development Dec 20, 2024 am 11:31 AM

Visual Studio Code, also known as VS Code, is a free source code editor — or integrated development environment (IDE) — available for all major operating systems. With a large collection of extensions for many programming languages, VS Code can be c

Explain JSON Web Tokens (JWT) and their use case in PHP APIs. Apr 05, 2025 am 12:04 AM

JWT is an open standard based on JSON, used to securely transmit information between parties, mainly for identity authentication and information exchange. 1. JWT consists of three parts: Header, Payload and Signature. 2. The working principle of JWT includes three steps: generating JWT, verifying JWT and parsing Payload. 3. When using JWT for authentication in PHP, JWT can be generated and verified, and user role and permission information can be included in advanced usage. 4. Common errors include signature verification failure, token expiration, and payload oversized. Debugging skills include using debugging tools and logging. 5. Performance optimization and best practices include using appropriate signature algorithms, setting validity periods reasonably,

PHP Program to Count Vowels in a String Feb 07, 2025 pm 12:12 PM

A string is a sequence of characters, including letters, numbers, and symbols. This tutorial will learn how to calculate the number of vowels in a given string in PHP using different methods. The vowels in English are a, e, i, o, u, and they can be uppercase or lowercase. What is a vowel? Vowels are alphabetic characters that represent a specific pronunciation. There are five vowels in English, including uppercase and lowercase: a, e, i, o, u Example 1 Input: String = "Tutorialspoint" Output: 6 explain The vowels in the string "Tutorialspoint" are u, o, i, a, o, i. There are 6 yuan in total

How do you parse and process HTML/XML in PHP? Feb 07, 2025 am 11:57 AM

This tutorial demonstrates how to efficiently process XML documents using PHP. XML (eXtensible Markup Language) is a versatile text-based markup language designed for both human readability and machine parsing. It's commonly used for data storage an

Explain late static binding in PHP (static::). Apr 03, 2025 am 12:04 AM

Static binding (static::) implements late static binding (LSB) in PHP, allowing calling classes to be referenced in static contexts rather than defining classes. 1) The parsing process is performed at runtime, 2) Look up the call class in the inheritance relationship, 3) It may bring performance overhead.

What are PHP magic methods (__construct, __destruct, __call, __get, __set, etc.) and provide use cases? Apr 03, 2025 am 12:03 AM

What are the magic methods of PHP? PHP's magic methods include: 1.\_\_construct, used to initialize objects; 2.\_\_destruct, used to clean up resources; 3.\_\_call, handle non-existent method calls; 4.\_\_get, implement dynamic attribute access; 5.\_\_set, implement dynamic attribute settings. These methods are automatically called in certain situations, improving code flexibility and efficiency.

See all articles