Home Backend Development PHP Tutorial Detailed explanation of various php encoding sets and under what circumstances they should be used_PHP tutorial

Detailed explanation of various php encoding sets and under what circumstances they should be used_PHP tutorial

Jul 21, 2016 pm 03:24 PM
php use and in what Multiple character character set case yes of type coding Detailed explanation conduct gather

A character set is a collection of multiple characters. There are many types of character sets, and each character set contains a different number of characters. Common character set names: ASCII character set, GB2312 character set, BIG5 character set, GB 18030 character set, Unicode characters Set etc. In order for a computer to accurately process text in various character sets, character encoding is required so that the computer can recognize and store various text.

Chinese has a large number of characters, and it is also divided into two characters with different writing rules: Simplified Chinese and Traditional Chinese. Computers were originally designed based on English single-byte characters. Therefore, encoding Chinese characters is Technical basis for information exchange in Chinese. This article will discuss several typical character sets in chronological order of character sets, select several representative Chinese character sets, and study the historical origin, characteristics, and technical features.

ASCII character set

1. Origin of the name

ASCII (American Standard Code for Information Interchange, American Standard Code for Information Interchange) is a set based on the Roman alphabet Computer coding system.

2. Features

It is mainly used to display modern English and other Western European languages. It is the most common single-byte encoding system today and is equivalent to the international standard ISO 646.

3. Contains content

Control characters: Enter key, backspace, line feed key, etc.

Characters that can be displayed: English upper and lower case characters, Arabic numerals and Western symbols

4. Technical features

7 bits represent one character, a total of 128 characters

5. ASCII extended character set

The 7-bit encoded character set can only support 128 characters. In order to represent more commonly used European characters, ASCII has been extended. The ASCII extended character set uses 8 Bits represent a character, a total of 256 characters.

The symbols extended by the ASCII extended character set include tabular symbols, calculation symbols, Greek letters and special Latin symbols.

GB2312 character set

1. Origin of the name

GB2312 is also known as GB2312-80 character set, the full name is "Chinese Coded Character Set for Information Exchange·Basic Set" , issued by the former China State Administration of Standards and implemented on May 1, 1981.

2. Features

GB2312 is China’s national standard simplified Chinese character set. The Chinese characters it contains have covered 99.75% of the frequency of use, basically meeting the computer processing needs of Chinese characters. It is widely used in mainland China and Singapore.

3. Content included

GB2312 includes simplified Chinese characters and general symbols, serial numbers, numbers, Latin letters, Japanese kana, Greek letters, Russian letters, Chinese pinyin symbols, and Chinese phonetic letters, in total 7445 graphic characters. It includes 6763 Chinese characters, including 3755 first-level Chinese characters and 3008 second-level Chinese characters; 682 full-width characters including Latin letters, Greek letters, Japanese hiragana and katakana letters, and Russian Cyrillic letters.

4. Technical features

(1) Partition representation:

In GB2312, the collected Chinese characters are "partitioned", and each zone contains 94 Chinese characters/symbols. This representation is also called location code.

The characters included in each area are as follows: Areas 01-09 are special symbols; Areas 16-55 are first-level Chinese characters, sorted by pinyin; Areas 56-87 are second-level Chinese characters, sorted by radicals/strokes; 10 Areas -15 and 88-94 are not coded.

(2) Double-byte representation

The first byte of the two bytes is the first byte, and the following byte is the second byte. It is customary to call the first byte the "high byte" and the second byte the "low byte".

The "high byte" uses 0xA1-0xF7 (add 0xA0 to the area code of area 01-87), and the "low byte" uses 0xA1-0xFE (add 01-94 to 0xA0).

5. Encoding example

Take the first Chinese character "ah" in the GB2312 character set as an example. Its area code is 16 and the bit number is 01. The area code is 1601. In most cases, the area code is 1601. In the computer program, add 0xA0 to the high byte and low byte respectively to obtain the Chinese character processing code 0xB0A1 of the program. The calculation formula is: 0xB0=0xA0+16, 0xA1=0xA0+1.

BIG5 character set

1. Origin of the name

Also known as Big Five or Big Five, it was established in 1984 by the Taiwan Information Industry Promotion Association and five Software companies Acer, MiTAC, Allison, Zero One, and FIC were founded, so it is called the Big Five.

The Big5 code was created because different manufacturers in Taiwan at that time launched different codes, such as Yitian code, IBM PS55, Wangan code, etc., which were incompatible with each other; on the other hand, the Taiwan government had not yet launched an official code. Chinese character encoding, and the GB2312 encoding in mainland China does not include traditional Chinese characters.

2. Features

The Big5 character set contains a total of 13,053 Chinese characters. This character set is used in Taiwan, China. What is intriguing is that this character set repeatedly contains the same two characters: "兀" (0xA461 and 0xC94A) and "嗀" (0xDCD1 and 0xDDFC).

 3. Character encoding method

  Big5 code uses a double-byte storage method, using two bytes to encode a word. The first byte is called the "high byte" and the second byte is called the "low byte".The encoding range of the high-order byte is 0xA1-0xF9, and the encoding range of the low-order byte is 0x40-0x7E and 0xA1-0xFE.

The character types corresponding to each encoding range are as follows: 0xA140-0xA3BF are punctuation marks, Greek letters and special symbols. In addition, 0xA259-0xA261 stores the words for the two-syllable unit of measurement: 兙兛兞兝兡兣嗧瓩玳; 0xA440-0xC67E are commonly used Chinese characters, sorted by strokes first and then by radicals; 0xC940-0xF9D5 are the next most commonly used Chinese characters, also sorted by strokes first and then by radicals.

4. Limitations of Big5

Although the Big5 code contains more than 10,000 characters, it does not take into account the names of people, place names, dialects, chemistry and biology that are circulating in society. The characters used do not include Japanese hiragana and katakana letters.

For example, in Taiwan, the word "Zhu" is regarded as a variant of "Zhu", so the word "Zhu" is not included. Some radicals in the Kangxi dictionary (such as "亠", "疒", "辵", "綶", etc.), common names (such as "kun", "xuan", "cypress", "喆") ", etc.) are not included in the Big5.

GB18030 character set

1. Origin of the name

The full name of GB 18030 is GB18030-2000 "Expansion of the Basic Set of Chinese Coded Character Sets for Information Exchange", which is the The government issued a new national standard for Chinese character encoding on March 17, 2000. Software released on the Chinese market after August 31, 2001 must comply with this standard

 2. Features

GB The 18030 character set standard was introduced after extensive participation and demonstration, and was jointly implemented by well-known companies in the information technology industry at home and abroad, the Ministry of Information Industry and the former State Administration of Quality and Technical Supervision.

GB 18030 character set standard solves the problem of computer encoding of large character sets composed of Chinese characters, Japanese kana, Korean and Chinese minority characters. The total character encoding space of this standard exceeds 1.5 million encoding bits and contains 27,484 Chinese characters, covering Chinese, Japanese, Korean and Chinese minority scripts. It meets the multi-language, large font size, multi-purpose, and unified coding format requirements for information exchange in East Asia, including mainland China, Hong Kong, Taiwan, Japan, and South Korea. And it is compatible with Unicode version 3.0, filling in the content of the Unicode extended character vocabulary "Unified Chinese Character Extension A". And it is compatible with the previous national character encoding standards (GB2312, GB13000.1).

 3. Encoding method

 GB 18030 standard uses three methods of single byte, double byte and four byte to encode characters. The single-byte part uses codes 0×00 to 0×7F (corresponding to the corresponding codes of the ASCII code). In the double-byte part, the first byte code ranges from 0×81 to 0×FE, and the last byte code bits are 0×40 to 0×7E and 0×80 to 0×FE respectively. The four-byte part uses 0×30 to 0×39 that are not used in GB/T 11383 as the suffix for the double-byte encoding expansion. The expanded four-byte encoding ranges from 0×81308130 to 0×FE39FE39. The first and three byte encoding code bits are all from 0×81 to 0×FE, and the second and four byte encoding code bits are all from 0×30 to 0×39.

4. Content included

The content included in the double-byte part mainly includes 20902 all CJK Chinese characters in GB13000.1, 13 related punctuation marks, ideographic descriptors, supplementary Chinese characters and parts 80 headers/components, double-byte encoded euro symbols, etc. The four-byte part contains all characters in GB 13000.1, including CJK Unified Chinese Character Extension A, except the above-mentioned double-byte characters.

Unicode character set

1. Origin of the name

Unicode character set encoding is the abbreviation of Universal Multiple-Octet Coded Character Set, which is composed of A character encoding system developed by an organization called the Unicode Consortium to support the exchange, processing, and display of written text in various languages ​​around the world. The encoding began to be developed in 1990 and was officially announced in 1994. The latest version is Unicode 4.1.0 on March 31, 2005.

2. Features

Unicode is a character encoding used on computers. It sets a unified and unique binary encoding for each character in each language to meet the requirements for cross-language and cross-platform text conversion and processing.

3. Encoding method

The Unicode standard always uses hexadecimal numbers, and is prefixed with "U+" when writing. For example, the encoding of the letter "A" is 004116 and the character The encoding of "?" is 20AC16. So the encoding of "A" is written as "U+0041".

 4.UTF-8 encoding

 UTF-8 is one of the ways to use Unicode. UTF is Unicode Translation Format, which means converting Unicode into a certain format.

UTF-8 facilitates the transmission of text in different languages ​​and encodings between different computers over the network, allowing double-byte Unicode to be correctly transmitted on existing systems that handle single-byte processing.

UTF-8 uses variable length bytes to store Unicode characters. For example, ASCII letters continue to be stored with 1 byte, accented characters, Greek letters or Cyrillic letters are stored with 2 bytes, and commonly used Chinese characters require 3 bytes. Auxiliary plane characters use 4 bytes.

5. UTF-16 and UTF-32 encoding

UTF-32, UTF-16 and UTF-8 are the character encoding schemes of the Unicode standard encoding character set. UTF-16 uses a Or a sequence of two unallocated 16-bit code units to encode a Unicode code point; UTF-32 represents each Unicode code point as a 32-bit integer of the same value.

Solutions to garbled code problems in various PHP applications

1) Use tags to set page encoding

The function of this tag is to declare what character set encoding the client's browser uses for display In this page, xxx can be GB2312, GBK, UTF-8 (different from MySQL, which is UTF8), etc. Therefore, most pages can use this method to tell the browser what encoding to use when displaying this page, so as to avoid encoding errors and garbled characters. But sometimes we will find that this sentence still doesn't work. No matter which xxx is, the browser always uses the same encoding. I will talk about this later.

Please note that it belongs to HTML information and is just a statement, which only indicates that the server has passed the HTML information to the browser.

 2) header("content-type:text/html; charset=xxx");

 The function of this function header() is to send the information in the brackets to the http header. If the content in the brackets is as mentioned in the article, the function is basically the same as the label. If you compare the first one, you will find that the characters are similar. But the difference is that if there is this function, the browser will always use the xxx encoding you requested and will never be disobedient, so this function is very useful. Why is this? Then we have to talk about the difference between http headers and HTML information:

The http header is a string sent by the server before sending HTML information to the browser using the http protocol. The tag belongs to HTML information, so the content sent by header() reaches the browser first. The popular point is that header() has a higher priority (I don’t know if I can say this). If a php page has both header("content-type:text/html;charset=xxx") and header("content-type:text/html;charset=xxx"), the browser will only recognize the former http header and not the meta. Of course, this function can only be used within php pages.

There is also a question left, why does the former definitely work, but the latter sometimes does not work? This is the reason why we want to talk about Apache next.

 3) AddDefaultCharset

 In the conf folder of the Apache root directory, there is the entire Apache configuration document httpd.conf.

Use a text editor to open httpd.conf. Line 708 (different versions may be different) contains AddDefaultCharset xxx, where xxx is the encoding name. The meaning of this line of code: Set the character set in the http header of the web page file in the entire server to your default xxx character set. Having this line is equivalent to adding a line of header("content-type:text/html; charset=xxx") to each file. Now you can understand why the browser always uses gb2312 even though it is set to utf-8.

If there is header("content-type:text/html; charset=xxx") in the web page, the default character set will be changed to the character set you set, so this function will always be useful. If you add a "#" in front of AddDefaultCharset xxx, comment out this sentence, and the page does not contain header ("content-type..."), then it is the meta tag's turn to take effect.

The priority order of the above is listed below:

header("content-type:text/html; charset=xxx")

.. AddDefaultCharset xxx

 ..

  If you are a web programmer, it is recommended to add a header ("content-type: text/html; charset=xxx") to each of your pages, so as to ensure that it is Any server can display correctly and is more portable.

 4) Default_charset configuration in php.ini:

 default_charset = "gb2312" in php.ini defines the default language character set of php. It is generally recommended to comment out this line and let the browser automatically select the language based on the charset in the web page header instead of making a mandatory requirement. This way, web services in multiple languages ​​can be provided on the same server.

www.bkjia.comtruehttp: //www.bkjia.com/PHPjc/324269.htmlTechArticleA character set is a collection of multiple characters. There are many types of character sets. The number of characters contained in each character set Different, common character set names: ASCII character set, GB2312 character set, BIG5 character set,...
Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

Video Face Swap

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Tools

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)

PHP 8.4 Installation and Upgrade guide for Ubuntu and Debian PHP 8.4 Installation and Upgrade guide for Ubuntu and Debian Dec 24, 2024 pm 04:42 PM

PHP 8.4 brings several new features, security improvements, and performance improvements with healthy amounts of feature deprecations and removals. This guide explains how to install PHP 8.4 or upgrade to PHP 8.4 on Ubuntu, Debian, or their derivati

How To Set Up Visual Studio Code (VS Code) for PHP Development How To Set Up Visual Studio Code (VS Code) for PHP Development Dec 20, 2024 am 11:31 AM

Visual Studio Code, also known as VS Code, is a free source code editor — or integrated development environment (IDE) — available for all major operating systems. With a large collection of extensions for many programming languages, VS Code can be c

Explain JSON Web Tokens (JWT) and their use case in PHP APIs. Explain JSON Web Tokens (JWT) and their use case in PHP APIs. Apr 05, 2025 am 12:04 AM

JWT is an open standard based on JSON, used to securely transmit information between parties, mainly for identity authentication and information exchange. 1. JWT consists of three parts: Header, Payload and Signature. 2. The working principle of JWT includes three steps: generating JWT, verifying JWT and parsing Payload. 3. When using JWT for authentication in PHP, JWT can be generated and verified, and user role and permission information can be included in advanced usage. 4. Common errors include signature verification failure, token expiration, and payload oversized. Debugging skills include using debugging tools and logging. 5. Performance optimization and best practices include using appropriate signature algorithms, setting validity periods reasonably,

How do you parse and process HTML/XML in PHP? How do you parse and process HTML/XML in PHP? Feb 07, 2025 am 11:57 AM

This tutorial demonstrates how to efficiently process XML documents using PHP. XML (eXtensible Markup Language) is a versatile text-based markup language designed for both human readability and machine parsing. It's commonly used for data storage an

Explain late static binding in PHP (static::). Explain late static binding in PHP (static::). Apr 03, 2025 am 12:04 AM

Static binding (static::) implements late static binding (LSB) in PHP, allowing calling classes to be referenced in static contexts rather than defining classes. 1) The parsing process is performed at runtime, 2) Look up the call class in the inheritance relationship, 3) It may bring performance overhead.

PHP Program to Count Vowels in a String PHP Program to Count Vowels in a String Feb 07, 2025 pm 12:12 PM

A string is a sequence of characters, including letters, numbers, and symbols. This tutorial will learn how to calculate the number of vowels in a given string in PHP using different methods. The vowels in English are a, e, i, o, u, and they can be uppercase or lowercase. What is a vowel? Vowels are alphabetic characters that represent a specific pronunciation. There are five vowels in English, including uppercase and lowercase: a, e, i, o, u Example 1 Input: String = "Tutorialspoint" Output: 6 explain The vowels in the string "Tutorialspoint" are u, o, i, a, o, i. There are 6 yuan in total

What are PHP magic methods (__construct, __destruct, __call, __get, __set, etc.) and provide use cases? What are PHP magic methods (__construct, __destruct, __call, __get, __set, etc.) and provide use cases? Apr 03, 2025 am 12:03 AM

What are the magic methods of PHP? PHP's magic methods include: 1.\_\_construct, used to initialize objects; 2.\_\_destruct, used to clean up resources; 3.\_\_call, handle non-existent method calls; 4.\_\_get, implement dynamic attribute access; 5.\_\_set, implement dynamic attribute settings. These methods are automatically called in certain situations, improving code flexibility and efficiency.

PHP and Python: Comparing Two Popular Programming Languages PHP and Python: Comparing Two Popular Programming Languages Apr 14, 2025 am 12:13 AM

PHP and Python each have their own advantages, and choose according to project requirements. 1.PHP is suitable for web development, especially for rapid development and maintenance of websites. 2. Python is suitable for data science, machine learning and artificial intelligence, with concise syntax and suitable for beginners.

See all articles