What encoding is utf-8?-Common Problem-php.cn

Home

Common Problem

What encoding is utf-8?

青灯夜游

Oct 21, 2020 pm 04:25 PM

utf-8 coding

UTF-8 is a variable-length character encoding for Unicode; it can be used to represent any character in the Unicode standard, and the first byte in its encoding is still compatible with ASCII, making The original software that processes ASCII characters can continue to be used without or with only minor modifications.

What encoding is utf-8?

UTF-8 (8-bit, Universal Character Set/Unicode Transformation Format) is a variable-length character encoding for Unicode. It can be used to represent any character in the Unicode standard, and the first byte in its encoding is still compatible with ASCII, so that the original software that processes ASCII characters can continue to be used without or with only a few modifications. Therefore, it has gradually become the preferred encoding for email, web pages, and other applications that store or transmit text.

Basic features

UCS characters U 0000 to U 007F (ASCII) are encoded as bytes 0×00 to 0x7F (ASCIⅡ compatible). This means that files containing only 7-bit ASCII characters are the same in both ASCII and UTF-8 encodings.

All UCS characters greater than 0x007F are encoded as a string of multiple bytes, each byte has a flag bit set. Therefore, it is impossible for ASCII bytes (0x00-0x7F) to be part of any other characters. The first byte of a multibyte string representing a non-ASCII character is always in the range 0xC0 to 0XFD and indicates how many bytes the character contains. The remaining bytes of the multi-byte string are in the range 0x80 to 0xBF. This makes resynchronization very easy and makes encodings borderless and rarely affected by missing bytes.

UTF-8 encoded characters can theoretically be up to 6 bytes long. However, 16-bit BMP characters can only be up to 3 bytes long. The arrangement order of Bigendian UCS-4 byte strings is predetermined. Bytes 0xFE and OxFF are never used in UTF-8 encoding.

Number of encoding bytes

UTF-8 uses 1~4 bytes to encode each character:

·One US-ASCIl character only Requires 1 byte encoding (Unicode range is U 0000~U 007F).

·Latin, Greek, Cyrillic, Armenian, Hebrew, Arabic, Syriac and other letters with diacritical marks require 2-byte encoding (Unicode range is U 0080 ~U 07FF).

·Characters in other languages (including Chinese, Japanese and Korean characters, Southeast Asian characters, Middle Eastern characters, etc.) include most commonly used characters and use 3-byte encoding.

·Other rarely used language characters use 4-byte encoding.

UTF-8 encoding rules:

If there is only one byte, its highest binary bit is 0; if it is multiple bytes, its first byte starts from Starting from the highest bit, the number of consecutive binary bits with a value of 1 determines the number of bytes encoded, and the remaining bytes start with 10.

The above is the detailed content of What encoding is utf-8?. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Assassin's Creed Shadows: Seashell Riddle Solution

4 weeks ago By DDD

What's New in Windows 11 KB5054979 & How to Fix Update Issues

3 weeks ago By DDD

Where to find the Crane Control Keycard in Atomfall

4 weeks ago By DDD

Roblox: Dead Rails - How To Complete Every Challenge

1 months ago By DDD

Atomfall guide: item locations, quest guides, and tips

1 months ago By DDD

Hot Tools

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Hot Topics

Where is the login entrance for gmail email?

7702

Java Tutorial

1640

CakePHP Tutorial

1393

Laravel Tutorial

1287

PHP Tutorial

1230

Related knowledge

11 common classification feature encoding techniques Apr 12, 2023 pm 12:16 PM

Machine learning algorithms only accept numerical input, so if we encounter categorical features, we will encode the categorical features. This article summarizes 11 common categorical variable encoding methods. 1. ONE HOT ENCODING The most popular and commonly used encoding method is One Hot Enoding. A single variable with n observations and d distinct values is converted into d binary variables with n observations, each binary variable is identified by a bit (0, 1). For example: the simplest implementation after coding is to use pandas' get_dummiesnew_df=pd.get_dummies(columns=[‘Sex’], data=df)2,

How many bytes do utf8 encoded Chinese characters occupy? Feb 21, 2023 am 11:40 AM

UTF8 encoded Chinese characters occupy 3 bytes. In UTF-8 encoding, one Chinese character is equal to three bytes, and one Chinese punctuation mark occupies three bytes; while in Unicode encoding, one Chinese character (including traditional Chinese) is equal to two bytes. UTF-8 uses 1~4 bytes to encode each character. One US-ASCIl character only needs 1 byte to encode. Latin, Greek, Cyrillic, Armenian, and Hebrew with diacritical marks. , Arabic, Syriac and other letters require 2-byte encoding.

Knowledge graph: the ideal partner for large models Jan 29, 2024 am 09:21 AM

Large language models (LLMs) have the ability to generate smooth and coherent text, bringing new prospects to areas such as artificial intelligence conversation and creative writing. However, LLM also has some key limitations. First, their knowledge is limited to patterns recognized from training data, lacking a true understanding of the world. Second, reasoning skills are limited and cannot make logical inferences or fuse facts from multiple data sources. When faced with more complex and open-ended questions, LLM's answers may become absurd or contradictory, known as "illusions." Therefore, although LLM is very useful in some aspects, it still has certain limitations when dealing with complex problems and real-world situations. In order to bridge these gaps, retrieval-augmented generation (RAG) systems have emerged in recent years. The core idea is

Several common encoding methods Oct 24, 2023 am 10:09 AM

Common encoding methods include ASCII encoding, Unicode encoding, UTF-8 encoding, UTF-16 encoding, GBK encoding, etc. Detailed introduction: 1. ASCII encoding is the earliest character encoding standard, using 7-bit binary numbers to represent 128 characters, including English letters, numbers, punctuation marks, control characters, etc.; 2. Unicode encoding is a method used to represent all characters in the world The standard encoding method of characters, which assigns a unique digital code point to each character; 3. UTF-8 encoding, etc.

PHP coding tips: How to generate a QR code with anti-counterfeiting verification function? Aug 17, 2023 pm 02:42 PM

PHP coding tips: How to generate a QR code with anti-counterfeiting verification function? With the development of e-commerce and the Internet, QR codes are increasingly used in various industries. In the process of using QR codes, in order to ensure product safety and prevent counterfeiting, it is very important to add anti-counterfeiting verification functions to the QR codes. This article will introduce how to use PHP to generate a QR code with anti-counterfeiting verification function, and attach corresponding code examples. Before starting, we need to prepare the following necessary tools and libraries: PHPQRCode: PHP

What are the hdb3 encoding rules? Aug 29, 2023 pm 01:38 PM

The coding rules are: 1. If the previous code is 0 and the current data bit is 0, the code is 0; 2. If the previous code is 0 and the current data bit is 1, the code is bipolar pulse (+A or - A), and the counter is increased by 1; 3. If the previous code is 1 and the current data bit is 1, the code is 0, and the counter is increased by 1; 4. If the previous code is 1, the current data bit is 0, The encoding method is determined based on the parity of the counter. If it is an even number, the encoding is (+B or -B). If it is an odd number, the encoding is zero level and the counter is cleared and so on.

How to solve the problem of encoding of php database query results Mar 21, 2023 am 11:49 AM

PHP is a popular web programming language that can be used to write dynamic web pages and applications. In practical applications, PHP often needs to interact with the database to query and process data. However, when using PHP to get results from a database, you may encounter encoding problems, which often result in garbled characters. So, how to solve the problem of encoding of PHP database query results?

Learn how to improve coding performance based on GenAI in one article Apr 01, 2024 pm 06:49 PM

Hellofolks, my name is Luga, and today we will talk about technologies related to the artificial intelligence (AI) ecological field - GenAI. Facing the challenges of rapid technological innovation and differentiated business scenarios, traditional coding methods have begun to become acclimated and cannot fully cope with the growing demands. At the same time, emerging general-purpose GenAI (artificial intelligence technology) has great potential to meet this demand. As a representative of artificial intelligence technology, GenAI has begun to be widely used in all walks of life with its strong potential and capabilities. It can automatically learn and adapt to coding needs in different scenarios, greatly improving coding efficiency and quality. Through deep learning and model optimization, GenAI is able to accurately understand different