Home Backend Development PHP Tutorial Bringing Unicode to PHP with Portable UTF-8

Bringing Unicode to PHP with Portable UTF-8

Feb 23, 2025 am 09:29 AM

Bringing Unicode to PHP with Portable UTF-8

Core points

  • Although PHP is able to handle multi-byte variable names and Unicode strings, the language lacks comprehensive Unicode support because of treating strings as single-byte character sequences. This limitation affects all aspects of string operation, including substring extraction, determining string length, and string segmentation.
  • Portable UTF-8 is a user space library that brings Unicode support to PHP applications. It is built on top of mbstring and iconv, provides about 60 Unicode-based string manipulation, testing and verification functions, and uses UTF-8 as its main character encoding scheme. The library is fully portable and can be used with any PHP 4.2 or later installation.
  • Portable UTF-8 library provides multiple functions for processing Unicode strings, including UTF-8 input verification, removing invalid bytes, encoding text into HTML entities to prevent XSS attacks, trimming spaces, removing duplicate spaces, creating inclusions UTF-8 characters URL fragments and forced limits on input character length. This ensures that in Unicode-enabled applications, the focus shifts from byte and byte lengths to character and character lengths.

PHP allows multi-byte variable names (e.g. $a∩b, $Ʃxy and $Δx), mbstring and other extensions can handle Unicode strings, and utf8_encode() and utf8_decode() functions can be used in UTF Convert strings between -8 and ISO-8859-1 encoding. However, it is widely believed that PHP lacks Unicode support. This article describes the meaning of lack of Unicode support and demonstrates how to use a library that brings Unicode support to PHP applications - Portable UTF-8.

Unicode support in PHP

PHP's lack of Unicode/multi-byte support means that standard string processing functions treat strings as single-byte character sequences. In fact, the official PHP manual defines a string in PHP as "a series of characters, one of which is the same as a byte". PHP supports only 8-bit characters, while Unicode (and many other character sets) may require multiple bytes to represent a character. This limitation of PHP affects almost all aspects of string operation, including (but not limited to) substring extraction, determining string length, string segmentation, mixing and so on. Efforts to solve this problem began in early 2005, but in 2010, the work of bringing native Unicode support to PHP was stopped and put on hold for a variety of reasons. Since native Unicode support in PHP can take years to implement (if it does), developers must rely on available extensions such as mbstring and iconv to fill this gap, but these extensions offer only limited Unicode support. These libraries are not Unicode-centric and can also be converted between non-Unicode encodings. They make positive contributions to simplifying Unicode string processing. However, the above extension also has some disadvantages. They only provide limited Unicode string processing capabilities, and none of them are enabled by default. Server administrators must explicitly enable any or all extensions to access them through PHP applications. Shared hosting providers often make things worse by installing one or two extensions, which makes it difficult for developers to rely on an always-available API to meet their Unicode needs. Still, the good news is that PHP can output Unicode text. This is because PHP doesn't really care whether we are sending English text encoded in ASCII or other text belonging to the language whose characters are encoded in multiple bytes. Knowing this, PHP developers now only need an API that provides comfortable Unicode-based string manipulation.

Portable UTF-8

The recent solution is to create a user space library written in PHP. Even if the server/language level lacks support, these libraries can be easily bundled with the application to ensure the presence of Unicode support. Many open source applications already include their own libraries of this kind, and many more use free third-party libraries; Portable UTF-8 is such a library. Portable UTF-8 is a free lightweight library built on top of mbstring and iconv. It extends the functionality of these two extensions, providing about 60 Unicode-based string manipulation, testing and verification functions; it provides UTF-8-aware corresponding functions for nearly all PHP common string handling functions. As the name implies, Portable UTF-8 uses UTF-8 as its primary character encoding scheme. The library uses available extensions (mbstring and iconv) for speed reasons and bridges some inconsistencies when using them directly, but if there are no these extensions on the server, it falls back to using pure PHP A UTF-8 routine written. Portable-UT8 is fully portable and can be used with any PHP 4.2 or later installation.

Stand processing using Portable UTF-8

Text editors with poor Unicode support can corrupt text when reading text, and text copied and pasted into web forms from such an editor may be the source of invalid UTF-8 for the application. When processing user-submitted input, be sure to make sure the input is exactly in line with the application's expectations. To detect whether the text is valid UTF-8, you can use the library's is_utf8() function.

1

2

3

if (is_utf8($_POST['title'])) {

    // 执行某些操作...

}

Copy after login

Recovering characters from invalid bytes is impossible, so removing bytes that are not recognized as valid UTF-8 characters may be your only choice. The utf8_clean() function can be used to remove invalid bytes.

1

$title = utf8_clean($_POST['title']);

Copy after login

Each Unicode character can be encoded as the corresponding HTML entity, and you may want to encode the text in this way to help prevent XSS attacks before outputting it to the browser.

1

echo utf8_html_encode($title);

Copy after login

Usually, spaces are trimmed at the beginning and end of a string. Unicode lists about 20 space characters, and some ASCII-based control characters should also be considered objects that need to be pruned.

1

$title = utf8_trim($title);

Copy after login

On the other hand, duplicates of such spaces may exist in the middle of a string and should be deleted. The following shows how to use utf8_remove_duplicates() and utf8_ws() in combination:

1

$title = utf8_remove_duplicates($title, utf8_ws());

Copy after login

The traditional solution for creating URL fragments for SEO purposes uses transliteration and removes all non-ASCII characters from the fragment. This makes the URL less valuable than it is. While the URL can support UTF-8 encoded characters, without such removal or transliteration, we can create rich snippets containing characters in any language:

1

$slug = utf8_url_slug($title, 30); // 字符长度30

Copy after login

From the start of input verification to saving data to a database, Unicode-enabled applications focus on character and character lengths, not byte and byte lengths. This shift in focus requires a new interface to understand this difference. It is usually necessary to limit the length of the input character, so if the input is more than 60 characters in length, we will create a substring.

1

2

3

if (utf8_strlen($title) > 60) {

    $title  = utf8_substr($title, 0, 60);

}

Copy after login

Or:

1

2

3

if (!utf8_fits_inside($title , 60)) {

    $title  = utf8_substr($title, 0 ,60);

}

Copy after login

There are three different ways to access a single character using the Portable-UT8 library. We can use utf8_access() to access a single character.

1

echo '第六个字符是:' . utf8_access($string, 5);

Copy after login

utf8_chr_map() Allows iterative access of a single character using a callback function.

1

utf8_chr_map('some_callback', $string);

Copy after login

We can split the string into a character array using utf8_split() and process the array elements as a single character.

1

array_map('some_callback', utf8_split($string));

Copy after login

Training Unicode may also require us to find the minimum/maximum code point in the string, segment the string, process byte order markers, string case conversion, randomization/mixing, replacement, etc. All of this is supported by Portable-UT8.

Conclusion

PHP 6 development has been stopped, resulting in the long-term need for native Unicode support being delayed, which is crucial for the development of multilingual applications. Therefore, server-side extensions and user space libraries such as Portable UTF-8 play an important role in helping developers create better standardized webs to meet local needs.

(The FAQs part is omitted here due to space limitations)

The above is the detailed content of Bringing Unicode to PHP with Portable UTF-8. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

Video Face Swap

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Tools

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Hot Topics

Java Tutorial
1662
14
PHP Tutorial
1261
29
C# Tutorial
1234
24
Explain different error types in PHP (Notice, Warning, Fatal Error, Parse Error). Explain different error types in PHP (Notice, Warning, Fatal Error, Parse Error). Apr 08, 2025 am 12:03 AM

There are four main error types in PHP: 1.Notice: the slightest, will not interrupt the program, such as accessing undefined variables; 2. Warning: serious than Notice, will not terminate the program, such as containing no files; 3. FatalError: the most serious, will terminate the program, such as calling no function; 4. ParseError: syntax error, will prevent the program from being executed, such as forgetting to add the end tag.

PHP and Python: Comparing Two Popular Programming Languages PHP and Python: Comparing Two Popular Programming Languages Apr 14, 2025 am 12:13 AM

PHP and Python each have their own advantages, and choose according to project requirements. 1.PHP is suitable for web development, especially for rapid development and maintenance of websites. 2. Python is suitable for data science, machine learning and artificial intelligence, with concise syntax and suitable for beginners.

Explain secure password hashing in PHP (e.g., password_hash, password_verify). Why not use MD5 or SHA1? Explain secure password hashing in PHP (e.g., password_hash, password_verify). Why not use MD5 or SHA1? Apr 17, 2025 am 12:06 AM

In PHP, password_hash and password_verify functions should be used to implement secure password hashing, and MD5 or SHA1 should not be used. 1) password_hash generates a hash containing salt values ​​to enhance security. 2) Password_verify verify password and ensure security by comparing hash values. 3) MD5 and SHA1 are vulnerable and lack salt values, and are not suitable for modern password security.

PHP in Action: Real-World Examples and Applications PHP in Action: Real-World Examples and Applications Apr 14, 2025 am 12:19 AM

PHP is widely used in e-commerce, content management systems and API development. 1) E-commerce: used for shopping cart function and payment processing. 2) Content management system: used for dynamic content generation and user management. 3) API development: used for RESTful API development and API security. Through performance optimization and best practices, the efficiency and maintainability of PHP applications are improved.

What are HTTP request methods (GET, POST, PUT, DELETE, etc.) and when should each be used? What are HTTP request methods (GET, POST, PUT, DELETE, etc.) and when should each be used? Apr 09, 2025 am 12:09 AM

HTTP request methods include GET, POST, PUT and DELETE, which are used to obtain, submit, update and delete resources respectively. 1. The GET method is used to obtain resources and is suitable for read operations. 2. The POST method is used to submit data and is often used to create new resources. 3. The PUT method is used to update resources and is suitable for complete updates. 4. The DELETE method is used to delete resources and is suitable for deletion operations.

PHP: A Key Language for Web Development PHP: A Key Language for Web Development Apr 13, 2025 am 12:08 AM

PHP is a scripting language widely used on the server side, especially suitable for web development. 1.PHP can embed HTML, process HTTP requests and responses, and supports a variety of databases. 2.PHP is used to generate dynamic web content, process form data, access databases, etc., with strong community support and open source resources. 3. PHP is an interpreted language, and the execution process includes lexical analysis, grammatical analysis, compilation and execution. 4.PHP can be combined with MySQL for advanced applications such as user registration systems. 5. When debugging PHP, you can use functions such as error_reporting() and var_dump(). 6. Optimize PHP code to use caching mechanisms, optimize database queries and use built-in functions. 7

How does PHP handle file uploads securely? How does PHP handle file uploads securely? Apr 10, 2025 am 09:37 AM

PHP handles file uploads through the $\_FILES variable. The methods to ensure security include: 1. Check upload errors, 2. Verify file type and size, 3. Prevent file overwriting, 4. Move files to a permanent storage location.

Explain the difference between self::, parent::, and static:: in PHP OOP. Explain the difference between self::, parent::, and static:: in PHP OOP. Apr 09, 2025 am 12:04 AM

In PHPOOP, self:: refers to the current class, parent:: refers to the parent class, static:: is used for late static binding. 1.self:: is used for static method and constant calls, but does not support late static binding. 2.parent:: is used for subclasses to call parent class methods, and private methods cannot be accessed. 3.static:: supports late static binding, suitable for inheritance and polymorphism, but may affect the readability of the code.

See all articles