


Introduction to the principles of implementing Chinese full-text search in PHP
Relevant articles or content in general development are searched through keyword tags and titles, but this search will basically use inefficient like statements. Due to the low efficiency, in the development of slightly larger projects We cannot conduct detailed field searches for articles or related content (the server is under too much pressure and the efficiency is extremely low).
Common solutions
1. sphinx coreseek
Advantages: Mature and stable technology
Disadvantages: sphinx does not support Chinese coressk has currently stopped maintenance [if it is a pure English environment, sphinx is excellent]
2. Xunsearch(Xunsearch)
Advantages: Mature and stable technology
Disadvantages: The installation process is complicated and the configuration is not flexible enough
3. Mysql full-text search
Advantages: Easy installation and high efficiency
Disadvantages: Yes Chinese support is not good enough
Solution from hcoder (self-configured word segmentation)
Advantages: Simple installation (php component), the bottom layer is written by the developer himself Clearer bottom layer, easier optimization
Disadvantages: Developers need to have a PHP mysql foundation and need to write the code for the entire process themselves
Principle
1、获取词语环节 文章数据表 -> 逐行读取文章信息 -> 组合所有文字内容 -> 分词、去重 -> 记录到新的数据表 2、搜索环节 搜索关键字记录表 -> 合并文章数据 -> 去重 -> 展示数据
The third party used Component (scws)
http://www.xunsearch.com/scws/
SCWS is the acronym for Simple Chinese Word Segmentation (ie: Simple Chinese Word Segmentation System).
This is a mechanical Chinese word segmentation engine based on word frequency dictionary, which can basically correctly divide a whole paragraph of Chinese text into words. Word is the smallest morpheme unit in Chinese, but when written, words are not separated by spaces like English. Therefore, how to segment words accurately and quickly has always been a difficult problem in Chinese word segmentation.
SCWS is developed in pure C language and does not rely on any external library functions. It can directly use dynamic link libraries to embed applications. Supported Chinese encodings include GBK, UTF-8, etc. In addition, a PHP extension module is provided to quickly and easily use the word segmentation function in PHP.
There are not many innovative elements in the word segmentation algorithm. It uses the word frequency dictionary collected by itself, supplemented by certain proper names, names of people, place names, digital ages and other rule recognition to achieve basic word segmentation. The range test accuracy is between 90% and 95%, which can basically meet the needs of some small search engines, keyword extraction and other occasions. The first prototype version was released in late 2005.
SCWS was developed by hightman and released as open source under the BSD license. The source code is hosted on github.
The above is the detailed content of Introduction to the principles of implementing Chinese full-text search in PHP. For more information, please follow other related articles on the PHP Chinese website!

Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

Video Face Swap
Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Hot Tools

Notepad++7.3.1
Easy-to-use and free code editor

SublimeText3 Chinese version
Chinese version, very easy to use

Zend Studio 13.0.1
Powerful PHP integrated development environment

Dreamweaver CS6
Visual web development tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Hot Topics

PHP 8.4 brings several new features, security improvements, and performance improvements with healthy amounts of feature deprecations and removals. This guide explains how to install PHP 8.4 or upgrade to PHP 8.4 on Ubuntu, Debian, or their derivati

If you are an experienced PHP developer, you might have the feeling that you’ve been there and done that already.You have developed a significant number of applications, debugged millions of lines of code, and tweaked a bunch of scripts to achieve op

Visual Studio Code, also known as VS Code, is a free source code editor — or integrated development environment (IDE) — available for all major operating systems. With a large collection of extensions for many programming languages, VS Code can be c

JWT is an open standard based on JSON, used to securely transmit information between parties, mainly for identity authentication and information exchange. 1. JWT consists of three parts: Header, Payload and Signature. 2. The working principle of JWT includes three steps: generating JWT, verifying JWT and parsing Payload. 3. When using JWT for authentication in PHP, JWT can be generated and verified, and user role and permission information can be included in advanced usage. 4. Common errors include signature verification failure, token expiration, and payload oversized. Debugging skills include using debugging tools and logging. 5. Performance optimization and best practices include using appropriate signature algorithms, setting validity periods reasonably,

A string is a sequence of characters, including letters, numbers, and symbols. This tutorial will learn how to calculate the number of vowels in a given string in PHP using different methods. The vowels in English are a, e, i, o, u, and they can be uppercase or lowercase. What is a vowel? Vowels are alphabetic characters that represent a specific pronunciation. There are five vowels in English, including uppercase and lowercase: a, e, i, o, u Example 1 Input: String = "Tutorialspoint" Output: 6 explain The vowels in the string "Tutorialspoint" are u, o, i, a, o, i. There are 6 yuan in total

This tutorial demonstrates how to efficiently process XML documents using PHP. XML (eXtensible Markup Language) is a versatile text-based markup language designed for both human readability and machine parsing. It's commonly used for data storage an

Static binding (static::) implements late static binding (LSB) in PHP, allowing calling classes to be referenced in static contexts rather than defining classes. 1) The parsing process is performed at runtime, 2) Look up the call class in the inheritance relationship, 3) It may bring performance overhead.

What are the magic methods of PHP? PHP's magic methods include: 1.\_\_construct, used to initialize objects; 2.\_\_destruct, used to clean up resources; 3.\_\_call, handle non-existent method calls; 4.\_\_get, implement dynamic attribute access; 5.\_\_set, implement dynamic attribute settings. These methods are automatically called in certain situations, improving code flexibility and efficiency.
