Home Backend Development Python Tutorial Build a high-speed retrieval engine using python and xapian

Build a high-speed retrieval engine using python and xapian

Oct 18, 2016 am 10:03 AM

First understand a few concepts: Documents, terms and posting In information retrieval (IR), the item we are trying to obtain is called "document", and each document is described by a set of terms. The two words "document" and "term" are terms in IR, which come from "library management". Usually a document is thought of as a piece of text, most likely in a machine readable form, and a term is a word or phrase used to describe the document, usually in the document. Most of them will have multiple terms. For example, if a document is related to _oral_ _hygiene_, then the following terms may exist: "tooth", "teeth", "toothbrush", "decay", "cavity" ”, “plaque” or “diet” etc.

If there is a document named D in an IR system, and this document is described by a term named t, then t is considered to index D, which can be expressed by the following formula: t->D. In an actual application, an IR system is usually a collection of multiple documents, such as D1, D2, D3..., and a collection of multiple terms, such as t1, t2, t3..., so there is the following relationship: ti -> Dj.

If a specific term indexes a specific document, it is called posting. To put it bluntly, posting is a term with position information, which may have certain uses in relevance retrieval.

Given a document named D, there is a terms list indexing it, which we call D’s term list.

Given a term named t, it indexes a list of documents, which is called t's posting list (using "Document list" may be more consistent in naming, but it sounds too vague).

In an IR system that exists on a computer, terms are stored in index files. Term can be used to effectively search its posting list. In the posting list, each document has a short identifier, which is the document id. Simply put, a posting list can be thought of as a collection of document ids, while a term list is a collection of strings. Some IR systems use numbers to represent terms internally, so in these systems, the term list is a collection of numbers. This is not the case with Xapian. It uses original terms and uses prefixes to compress storage. space.

Terms do not necessarily have to be words that appear in the document. Usually they will be converted to lowercase, and they are often processed by the stemming algorithm, so a series of words may be retrieved through a term with the value "connect" , such as "connect", "connects", "connection" or "connected", etc., and one word may also produce multiple terms. For example, you will index both the extracted stems and the unextracted words. Of course, this may only apply to European and American languages ​​such as English, French or Latin, while Chinese participles are very different. In general, the European and American language participles have the following differences from Chinese participles:

l. Take English as an example. Usually, each word in English is separated by spaces, but this is not the case in Chinese. It can even be so extreme that there are no spaces or punctuation marks in the entire article. 2. As mentioned above, "connect", "connects", "connection" or "connected" respectively mean "connection of verb nature", "connection of the third person of verb nature", "connection of name nature" or "The past tense of connection", but in Chinese, "connection" can be used to express everything, and there is almost no need for stemming. This means that most of the various parts of speech in English are rules-based, while the Chinese parts of speech are wild and unconstrained. 3. The second point is just a microcosm of the difficulty of Chinese word segmentation. It is very difficult to completely and correctly identify the semantic meaning of a sentence. For example, in the sentence "The People's Republic of China was established", it can be distinguished between "China" and "Chinese". ", "people", "republic", "founded" and other words, but "Chinese" among them actually has little to do with this sentence. It seems simple at first glance, but how can a machine understand the secrets so easily?

Values

Values ​​is a kind of metadata attached to the document. Each document can have multiple values, and these values ​​are identified by different numbers. Values ​​are designed to be quickly accessed during the matching process. They can be used for purposes such as sorting, queuing redundant duplicate documents, and range retrieval. Although there is no length limit for values, it is best to keep them as short as possible. If you just want to store a field to display as a result, it is recommended that you save them in the document's data.

Document data

Each Document has only one data, which can be data in any type of format. Of course, please convert it to a string first when storing. This may sound a bit weird, but the reality is this: if the data to be stored is in text format, it can be stored directly; if the data to be stored is various objects, please serialize it into a binary stream first and then save it, and then read it. When deserializing and reading.

posting

posting is a term with position.

# -*- coding: gb18030 -*-
import xapian
testdatas = [u'abc test python1',u'abcd testing python2']
def buildtest():
    database = xapian.WritableDatabase('indexes/', xapian.DB_CREATE_OR_OPEN)
    stemmer = xapian.Stem("english")
    for data in testdatas:
        doc = xapian.Document()
        doc.set_data(data)
        for term in data.split():
            doc.add_term(term)
        database.add_document(doc)
if __name__ == '__main__':
    buildtest()
Copy after login

After execution, an index library is generated in the current directory.

[sh]

[ec2-user@ip-10-167-6-221 indexes]$ ll

Total usage 52

-rw-rw-r-- 1 ec2-user ec2-user 0 0 July 28 16:06 flintlock

-rw-rw-r-- 1 ec2-user ec2-user 28 July 28 16:06 iamchert

-rw-rw-r-- 1 ec2-user ec2-user 13 July 28 16:06 postlist.baseA

-rw-rw-r-- 1 ec2-user ec2-user 14 July 28 16:06 postlist.baseB

-rw-rw-r-- 1 ec2-user ec2-user 8192 July 28 16:06 postlist.DB

-rw-rw-r-- 1 ec2-user ec2-user 13 July 28 16:06 record.baseA

-rw-rw-r-- 1 ec2-user ec2-user 14 July 28 16:06 record.baseB

-rw-rw-r-- 1 ec2-user ec2- user 8192 July 28 16:06 record.DB

-rw-rw-r-- 1 ec2-user ec2-user 13 July 28 16:06 termlist.baseA

-rw-rw-r-- 1 ec2 -user ec2-user 14 July 28 16:06 termlist.baseB

-rw-rw-r-- 1 ec2-user ec2-user 8192 July 28 16:06 termlist.DB

We will introduce how in the next article Go to query index.


Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

Video Face Swap

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Roblox: Bubble Gum Simulator Infinity - How To Get And Use Royal Keys
3 weeks ago By 尊渡假赌尊渡假赌尊渡假赌
Nordhold: Fusion System, Explained
4 weeks ago By 尊渡假赌尊渡假赌尊渡假赌
Mandragora: Whispers Of The Witch Tree - How To Unlock The Grappling Hook
3 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

Hot Tools

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Hot Topics

Java Tutorial
1670
14
PHP Tutorial
1274
29
C# Tutorial
1256
24
Python vs. C  : Learning Curves and Ease of Use Python vs. C : Learning Curves and Ease of Use Apr 19, 2025 am 12:20 AM

Python is easier to learn and use, while C is more powerful but complex. 1. Python syntax is concise and suitable for beginners. Dynamic typing and automatic memory management make it easy to use, but may cause runtime errors. 2.C provides low-level control and advanced features, suitable for high-performance applications, but has a high learning threshold and requires manual memory and type safety management.

Python and Time: Making the Most of Your Study Time Python and Time: Making the Most of Your Study Time Apr 14, 2025 am 12:02 AM

To maximize the efficiency of learning Python in a limited time, you can use Python's datetime, time, and schedule modules. 1. The datetime module is used to record and plan learning time. 2. The time module helps to set study and rest time. 3. The schedule module automatically arranges weekly learning tasks.

Python vs. C  : Exploring Performance and Efficiency Python vs. C : Exploring Performance and Efficiency Apr 18, 2025 am 12:20 AM

Python is better than C in development efficiency, but C is higher in execution performance. 1. Python's concise syntax and rich libraries improve development efficiency. 2.C's compilation-type characteristics and hardware control improve execution performance. When making a choice, you need to weigh the development speed and execution efficiency based on project needs.

Learning Python: Is 2 Hours of Daily Study Sufficient? Learning Python: Is 2 Hours of Daily Study Sufficient? Apr 18, 2025 am 12:22 AM

Is it enough to learn Python for two hours a day? It depends on your goals and learning methods. 1) Develop a clear learning plan, 2) Select appropriate learning resources and methods, 3) Practice and review and consolidate hands-on practice and review and consolidate, and you can gradually master the basic knowledge and advanced functions of Python during this period.

Python vs. C  : Understanding the Key Differences Python vs. C : Understanding the Key Differences Apr 21, 2025 am 12:18 AM

Python and C each have their own advantages, and the choice should be based on project requirements. 1) Python is suitable for rapid development and data processing due to its concise syntax and dynamic typing. 2)C is suitable for high performance and system programming due to its static typing and manual memory management.

Which is part of the Python standard library: lists or arrays? Which is part of the Python standard library: lists or arrays? Apr 27, 2025 am 12:03 AM

Pythonlistsarepartofthestandardlibrary,whilearraysarenot.Listsarebuilt-in,versatile,andusedforstoringcollections,whereasarraysareprovidedbythearraymoduleandlesscommonlyusedduetolimitedfunctionality.

Python: Automation, Scripting, and Task Management Python: Automation, Scripting, and Task Management Apr 16, 2025 am 12:14 AM

Python excels in automation, scripting, and task management. 1) Automation: File backup is realized through standard libraries such as os and shutil. 2) Script writing: Use the psutil library to monitor system resources. 3) Task management: Use the schedule library to schedule tasks. Python's ease of use and rich library support makes it the preferred tool in these areas.

Python for Web Development: Key Applications Python for Web Development: Key Applications Apr 18, 2025 am 12:20 AM

Key applications of Python in web development include the use of Django and Flask frameworks, API development, data analysis and visualization, machine learning and AI, and performance optimization. 1. Django and Flask framework: Django is suitable for rapid development of complex applications, and Flask is suitable for small or highly customized projects. 2. API development: Use Flask or DjangoRESTFramework to build RESTfulAPI. 3. Data analysis and visualization: Use Python to process data and display it through the web interface. 4. Machine Learning and AI: Python is used to build intelligent web applications. 5. Performance optimization: optimized through asynchronous programming, caching and code

See all articles