Table of Contents
Whoosh Introduction
Index & query
Sample code
Data
Fields
Create index file
Query
Home Backend Development Python Tutorial Whoosh: A lightweight search tool for Python

Whoosh: A lightweight search tool for Python

Apr 14, 2023 pm 09:07 PM
python tool whoosh

Whoosh: A lightweight search tool for Python

Whoosh Introduction

Whoosh was created by Matt Chaput. It started as a simple and fast search service tool for the online documentation of the Houdini 3D animation software package. It has gradually become a mature search solution tool and has been open source.

Whoosh is written purely in Python. It is a flexible, convenient and lightweight search engine tool. It now supports both Python2 and 3. Its advantages are as follows:

  • Whoosh is purely written in Python, but it is very fast. It only requires a Python environment and does not require a compiler;
  • The Okapi BM25F sorting algorithm is used by default, and other sorting algorithms are also supported;
  • Compared with other search engines, Whoosh will create smaller index files;
  • The index file encoding in Whoosh must be unicode;
  • Whoosh can store any Python object.

The official introduction website of Whoosh is: https://whoosh.readthedocs.io/en/latest/intro.html. Compared with mature search engine tools such as ElasticSearch or Solr, Whoosh is lighter and simpler to operate, and can be considered for use in small search projects.

Index & query

For those familiar with ES, the two important aspects of search are mapping and query, that is, index construction and query. Behind the scenes are complex index storage, Query parsing and sorting algorithms, etc. If you have experience in ES, then Whoosh is very easy to get started with.

According to the author’s understanding and Whoosh’s official documentation, the main introductory uses of Whoosh are index and query. One of the powerful features of a search engine is that it can provide full-text search, which depends on the sorting algorithm, such as BM25, and also depends on how we store fields. Therefore, when index is used as a noun, it refers to the index of the field, and when index is used as a verb, it refers to establishing the index of the field. The query will use the sorting algorithm to give reasonable search results based on the statements we need to query.

Regarding the use of Whoosh, detailed instructions have been given in the official documents. The author only gives a simple example here to illustrate how Whoosh can easily improve our search experience.

Sample code

Data

The sample data for this project is poetry.csv. The following picture is the first ten rows of the data set:

Whoosh: A lightweight search tool for Python

poem.csv

Fields

Based on the characteristics of the data set, we create four fields (fields): title, dynasty, poet, content. The created code is as follows:

# -*- coding: utf-8 -*-
import os
from whoosh.index import create_in
from whoosh.fields import *
from jieba.analyse import ChineseAnalyzer
import json
# 创建schema, stored为True表示能够被检索
schema = Schema(title=TEXT(stored=True, analyzer=ChineseAnalyzer()),
 dynasty=ID(stored=True),
 poet=ID(stored=True),
 content=TEXT(stored=True, analyzer=ChineseAnalyzer())
 )
Copy after login

Among them, the ID can only be a unit value and cannot be divided into several words. It is often used for file paths, URLs, dates, and categories;

The text of the TEXT file Content, index and store text, and support word search; Analyzer selects the stuttering Chinese word segmenter.

Create index file

Next, we need to create an index file. We use the program to first parse the poem.csv file, convert it into index, and write it to the indexdir directory. The Python code is as follows:

# 解析poem.csv文件
with open('poem.csv', 'r', encoding='utf-8') as f:
 texts = [_.strip().split(',') for _ in f.readlines() if len(_.strip().split(',')) == 4]
# 存储schema信息至indexdir目录
indexdir = 'indexdir/'
if not os.path.exists(indexdir):
 os.mkdir(indexdir)
ix = create_in(indexdir, schema)
# 按照schema定义信息,增加需要建立索引的文档
writer = ix.writer()
for i in range(1, len(texts)):
 title, dynasty, poet, content = texts[i]
 writer.add_document(title=title, dynasty=dynasty, poet=poet, content=content)
writer.commit()
Copy after login

After the index is successfully created, the indexdir directory will be generated, which contains the index files for each field of the above poem.csv data.

Query

After the index is successfully created, we will use it to query.

For example, if we want to query the poems containing the bright moon in the content, we can enter the following code:

# 创建一个检索器
searcher = ix.searcher()
# 检索content中出现'明月'的文档
results = searcher.find("content", "明月")
print('一共发现%d份文档。' % len(results))
for i in range(min(10, len(results))):
 print(json.dumps(results[i].fields(), ensure_ascii=False))
Copy after login

The output results are as follows:

一共发现44份文档。
前10份文档如下:
{"content": "床前明月光,疑是地上霜。举头望明月,低头思故乡。", "dynasty": "唐代", "poet": "李白 ", "title": "静夜思"}
{"content": "边草,边草,边草尽来兵老。山南山北雪晴,千里万里月明。明月,明月,胡笳一声愁绝。", "dynasty": "唐代", "poet": "戴叔伦 ", "title": "调笑令·边草"}
{"content": "独坐幽篁里,弹琴复长啸。深林人不知,明月来相照。", "dynasty": "唐代", "poet": "王维 ", "title": "竹里馆"}
{"content": "汉江明月照归人,万里秋风一叶身。休把客衣轻浣濯,此中犹有帝京尘。", "dynasty": "明代", "poet": "边贡 ", "title": "重赠吴国宾"}
{"content": "秦时明月汉时关,万里长征人未还。但使龙城飞将在,不教胡马度阴山。", "dynasty": "唐代", "poet": "王昌龄 ", "title": "出塞二首·其一"}
{"content": "京口瓜洲一水间,钟山只隔数重山。春风又绿江南岸,明月何时照我还?", "dynasty": "宋代", "poet": "王安石 ", "title": "泊船瓜洲"}
{"content": "四顾山光接水光,凭栏十里芰荷香。清风明月无人管,并作南楼一味凉。", "dynasty": "宋代", "poet": "黄庭坚 ", "title": "鄂州南楼书事"}
{"content": "青山隐隐水迢迢,秋尽江南草未凋。二十四桥明月夜,玉人何处教吹箫?", "dynasty": "唐代", "poet": "杜牧 ", "title": "寄扬州韩绰判官"}
{"content": "露气寒光集,微阳下楚丘。猿啼洞庭树,人在木兰舟。广泽生明月,苍山夹乱流。云中君不见,竟夕自悲秋。", "dynasty": "唐代", "poet": "马戴 ", "title": "楚江怀古三首·其一"}
{"content": "海上生明月,天涯共此时。情人怨遥夜,竟夕起相思。灭烛怜光满,披衣觉露滋。不堪盈手赠,
Copy after login

The above is the detailed content of Whoosh: A lightweight search tool for Python. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

Video Face Swap

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Tools

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Bitcoin price today Bitcoin price today Apr 28, 2025 pm 07:39 PM

Bitcoin’s price fluctuations today are affected by many factors such as macroeconomics, policies, and market sentiment. Investors need to pay attention to technical and fundamental analysis to make informed decisions.

How much is Bitcoin worth How much is Bitcoin worth Apr 28, 2025 pm 07:42 PM

Bitcoin’s price ranges from $20,000 to $30,000. 1. Bitcoin’s price has fluctuated dramatically since 2009, reaching nearly $20,000 in 2017 and nearly $60,000 in 2021. 2. Prices are affected by factors such as market demand, supply, and macroeconomic environment. 3. Get real-time prices through exchanges, mobile apps and websites. 4. Bitcoin price is highly volatile, driven by market sentiment and external factors. 5. It has a certain relationship with traditional financial markets and is affected by global stock markets, the strength of the US dollar, etc. 6. The long-term trend is bullish, but risks need to be assessed with caution.

Recommended reliable digital currency trading platforms. Top 10 digital currency exchanges in the world. 2025 Recommended reliable digital currency trading platforms. Top 10 digital currency exchanges in the world. 2025 Apr 28, 2025 pm 04:30 PM

Recommended reliable digital currency trading platforms: 1. OKX, 2. Binance, 3. Coinbase, 4. Kraken, 5. Huobi, 6. KuCoin, 7. Bitfinex, 8. Gemini, 9. Bitstamp, 10. Poloniex, these platforms are known for their security, user experience and diverse functions, suitable for users at different levels of digital currency transactions

Free coins trading market software recommendations The top ten easy-to-use coins trading apps Free coins trading market software recommendations The top ten easy-to-use coins trading apps Apr 28, 2025 pm 04:33 PM

The top ten recommended cryptocurrency trading software are: 1. OKX, 2. Binance, 3. Coinbase, 4. KuCoin, 5. Huobi, 6. Crypto.com, 7. Kraken, 8. Bitfinex, 9. Bybit, 10. Gate.io. These apps all provide real-time market data and trading tools, suitable for users at different levels.

Download the official website of Ouyi Exchange app for Apple mobile phone Download the official website of Ouyi Exchange app for Apple mobile phone Apr 28, 2025 pm 06:57 PM

The Ouyi Exchange app supports downloading of Apple mobile phones, visit the official website, click the "Apple Mobile" option, obtain and install it in the App Store, register or log in to conduct cryptocurrency trading.

Spot King Transformation Note: How to layout the next generation of on-chain ecosystem with Gate.io MeMebox 2.0? Spot King Transformation Note: How to layout the next generation of on-chain ecosystem with Gate.io MeMebox 2.0? Apr 28, 2025 pm 03:36 PM

Gate.io has achieved the transformation from spot trading to on-chain ecosystem through MeMebox 2.0. 1) Build a cross-chain infrastructure and support the interoperability of 12 main chains; 2) Create a DeFi application ecosystem and provide one-stop services; 3) Implement incentive mechanisms and reconstruct value allocation.

Top 10 safe and reliable virtual currency exchange platforms. Top 10 safe and reliable digital currency apps recommended Top 10 safe and reliable virtual currency exchange platforms. Top 10 safe and reliable digital currency apps recommended Apr 28, 2025 pm 02:36 PM

Top10 of safe and reliable virtual currency exchange platforms: 1. OKX, 2. Binance, 3. gate.io, 4. Coinbase, 5. Kraken, 6. Huobi, 7. KuCoin, 8. Bitfinex, 9. Bitstamp, 10. Poloniex, each platform has outstanding performance in trading products, user experience, security, etc., to meet the needs of different investors.

How to choose a compliant and secure Bitcoin trading platform How to choose a compliant and secure Bitcoin trading platform Apr 28, 2025 pm 05:42 PM

When choosing a compliant and secure Bitcoin trading platform, you need to evaluate its regulatory license, KYC/AML policies and security measures, and recommend three major platforms: Binance, OKX and gate.io.

See all articles