Home Backend Development Python Tutorial Detailed explanation of Python text feature extraction and vectorization algorithm learning examples

Detailed explanation of Python text feature extraction and vectorization algorithm learning examples

Dec 23, 2017 pm 05:05 PM
python extract Quantify

Suppose we have just watched Nolan's blockbuster "Interstellar", how can we let the machine automatically analyze whether the audience's evaluation of the movie is "positive" or "negative"? This type of problem is a sentiment analysis problem. The first step in dealing with this type of problem is to convert text into features. This article mainly introduces the Python text feature extraction and vectorization algorithm in detail. It has certain reference value. Interested friends can refer to it. I hope it can help everyone.

Therefore, in this chapter we only learn the first step, how to extract features from text and vectorize them.

Since the processing of Chinese involves word segmentation, this article uses a simple example to illustrate how to use Python's machine learning library to extract features from English.

1. Data preparation

Python's sklearn.datasets supports reading all classified texts from the directory. However, the directories must be placed according to the rules of one folder and one label name. For example, the data set used in this article has a total of 2 labels, one is "net" and the other is "pos", and there are 6 text files under each directory. The directory is as follows:

neg
1.txt
2.txt
......
pos
1.txt
2 .txt
....

The contents of the 12 files are summarized as follows:


##

neg: 
  shit. 
  waste my money. 
  waste of money. 
  sb movie. 
  waste of time. 
  a shit movie. 
pos: 
  nb! nb movie! 
  nb! 
  worth my money. 
  I love this movie! 
  a nb movie. 
  worth it!
Copy after login

2. Text features

How to extract emotional attitudes from these English words and classify them?


The most intuitive way is to extract words. It is generally believed that many keywords can reflect the speaker's attitude. For example, in the simple data set above, it is easy to find that anything that says "shit" must belong to the neg category.

Of course, the above data set is simply designed for convenience of description. In reality, a word often has ambiguous attitudes. But there is still reason to believe that the more a word appears in the neg category, the greater the probability that it expresses the neg attitude.

We also noticed that some words are meaningless for sentiment classification. For example, words such as "of" and "I" in the above data. This type of word has a name, called "
Stop_Word" (stop word). Such words can be completely ignored and not counted. Obviously by ignoring these words, the storage space of word frequency records can be optimized and the construction speed is faster. There is also a problem in using the word frequency of each word as an important feature. For example, "movie" in the above data appears 5 times in 12 samples, but the number of positive and negative occurrences is almost the same, and there is no distinction. And "worth" appears twice, but only in the pos category. It obviously has a strong strong color, that is, the distinction is very high.

Therefore, we need to introduce

TF-IDF (Term Frequency-Inverse Document Frequency, Term frequency and reverse document frequency) to further consider each word .

TF (Word Frequency) is calculated very simply, that is, for a document t, the frequency of a certain word Nt appearing in the document. For example, in the document "I love this movie", the TF of the word "love" is 1/4. If you remove the stop words "I" and "it", it is 1/2.

IDF (Inverse Document Frequency) means that for a certain word t, the number of documents Dt in which the word appears accounts for the proportion of all test documents D. Then find the natural logarithm. For example, the word "movie" appears 5 times in total, and the total number of documents is 12, so the IDF is ln(5/12).
Obviously, IDF is to highlight the words that appear rarely but have strong emotional color. For example, the IDF of a word like "movie" is ln(12/5)=0.88, which is much smaller than the IDF of "love"=ln(12/1)=2.48.

TF-IDF is simply multiplying the two together. In this way, finding the TF-IDF of each word in each document is the text feature value we extracted.

3. Vectorization

With the above foundation, the document can be vectorized. Let’s look at the code first, and then analyze the meaning of vectorization:



# -*- coding: utf-8 -*- 
import scipy as sp 
import numpy as np 
from sklearn.datasets import load_files 
from sklearn.cross_validation import train_test_split 
from sklearn.feature_extraction.text import TfidfVectorizer 
 
'''''加载数据集,切分数据集80%训练,20%测试''' 
movie_reviews = load_files('endata')  
doc_terms_train, doc_terms_test, y_train, y_test\ 
  = train_test_split(movie_reviews.data, movie_reviews.target, test_size = 0.3) 
   
'''''BOOL型特征下的向量空间模型,注意,测试样本调用的是transform接口''' 
count_vec = TfidfVectorizer(binary = False, decode_error = 'ignore',\ 
              stop_words = 'english') 
x_train = count_vec.fit_transform(doc_terms_train) 
x_test = count_vec.transform(doc_terms_test) 
x    = count_vec.transform(movie_reviews.data) 
y    = movie_reviews.target 
print(doc_terms_train) 
print(count_vec.get_feature_names()) 
print(x_train.toarray()) 
print(movie_reviews.target)
Copy after login

运行结果如下:
[b'waste of time.', b'a shit movie.', b'a nb movie.', b'I love this movie!', b'shit.', b'worth my money.', b'sb movie.', b'worth it!']
['love', 'money', 'movie', 'nb', 'sb', 'shit', 'time', 'waste', 'worth']
[[ 0.          0.          0.          0.          0.          0.   0.70710678  0.70710678  0.        ]
 [ 0.          0.          0.60335753  0.          0.          0.79747081   0.          0.          0.        ]
 [ 0.          0.          0.53550237  0.84453372  0.          0.          0.   0.          0.        ]
 [ 0.84453372  0.          0.53550237  0.          0.          0.          0.   0.          0.        ]
 [ 0.          0.          0.          0.          0.          1.          0.   0.          0.        ]
 [ 0.          0.76642984  0.          0.          0.          0.          0.   0.          0.64232803]
 [ 0.          0.          0.53550237  0.          0.84453372  0.          0.   0.          0.        ]
 [ 0.          0.          0.          0.          0.          0.          0.   0.          1.        ]]
[1 1 0 1 0 1 0 1 1 0 0 0]

python输出的比较混乱。我这里做了一个表格如下:

从上表可以发现如下几点:

1、停用词的过滤。

初始化count_vec的时候,我们在count_vec构造时传递了stop_words = 'english',表示使用默认的英文停用词。可以使用count_vec.get_stop_words()查看TfidfVectorizer内置的所有停用词。当然,在这里可以传递你自己的停用词list(比如这里的“movie”)

2、TF-IDF的计算。

这里词频的计算使用的是sklearn的TfidfVectorizer。这个类继承于CountVectorizer,在后者基本的词频统计基础上增加了如TF-IDF之类的功能。
我们会发现这里计算的结果跟我们之前计算不太一样。因为这里count_vec构造时默认传递了max_df=1,因此TF-IDF都做了规格化处理,以便将所有值约束在[0,1]之间。

3. The result of count_vec.fit_transform is a huge matrix. We can see that there are a lot of 0's in the above table, so sklearn uses a sparse matrix for its internal implementation. The data in this example is small. If readers are interested, you can try real data used by machine learning researchers, from Cornell University: http://www.cs.cornell.edu/people/pabo/movie-review-data/. This website provides many data sets, including several databases of about 2M, with about 700 positive and negative examples. The scale of this kind of data is not large and can still be completed within 1 minute. I suggest you give it a try. However, be aware that these data sets may have illegal character issues. So when constructing count_vec, decode_error = 'ignore' is passed in to ignore these illegal characters.

The results in the above table are the results of training 8 features of 8 samples. This result can be classified using various classification algorithms.

Related recommendations:

Share Python text generation QR code example

Detailed explanation of edit distance for Python text similarity calculation

Example detailed explanation of Python implementation of simple web page image grabbing

The above is the detailed content of Detailed explanation of Python text feature extraction and vectorization algorithm learning examples. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

Video Face Swap

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Tools

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)

PHP and Python: Different Paradigms Explained PHP and Python: Different Paradigms Explained Apr 18, 2025 am 12:26 AM

PHP is mainly procedural programming, but also supports object-oriented programming (OOP); Python supports a variety of paradigms, including OOP, functional and procedural programming. PHP is suitable for web development, and Python is suitable for a variety of applications such as data analysis and machine learning.

Choosing Between PHP and Python: A Guide Choosing Between PHP and Python: A Guide Apr 18, 2025 am 12:24 AM

PHP is suitable for web development and rapid prototyping, and Python is suitable for data science and machine learning. 1.PHP is used for dynamic web development, with simple syntax and suitable for rapid development. 2. Python has concise syntax, is suitable for multiple fields, and has a strong library ecosystem.

Python vs. JavaScript: The Learning Curve and Ease of Use Python vs. JavaScript: The Learning Curve and Ease of Use Apr 16, 2025 am 12:12 AM

Python is more suitable for beginners, with a smooth learning curve and concise syntax; JavaScript is suitable for front-end development, with a steep learning curve and flexible syntax. 1. Python syntax is intuitive and suitable for data science and back-end development. 2. JavaScript is flexible and widely used in front-end and server-side programming.

Can vs code run in Windows 8 Can vs code run in Windows 8 Apr 15, 2025 pm 07:24 PM

VS Code can run on Windows 8, but the experience may not be great. First make sure the system has been updated to the latest patch, then download the VS Code installation package that matches the system architecture and install it as prompted. After installation, be aware that some extensions may be incompatible with Windows 8 and need to look for alternative extensions or use newer Windows systems in a virtual machine. Install the necessary extensions to check whether they work properly. Although VS Code is feasible on Windows 8, it is recommended to upgrade to a newer Windows system for a better development experience and security.

PHP and Python: A Deep Dive into Their History PHP and Python: A Deep Dive into Their History Apr 18, 2025 am 12:25 AM

PHP originated in 1994 and was developed by RasmusLerdorf. It was originally used to track website visitors and gradually evolved into a server-side scripting language and was widely used in web development. Python was developed by Guidovan Rossum in the late 1980s and was first released in 1991. It emphasizes code readability and simplicity, and is suitable for scientific computing, data analysis and other fields.

Can visual studio code be used in python Can visual studio code be used in python Apr 15, 2025 pm 08:18 PM

VS Code can be used to write Python and provides many features that make it an ideal tool for developing Python applications. It allows users to: install Python extensions to get functions such as code completion, syntax highlighting, and debugging. Use the debugger to track code step by step, find and fix errors. Integrate Git for version control. Use code formatting tools to maintain code consistency. Use the Linting tool to spot potential problems ahead of time.

How to run programs in terminal vscode How to run programs in terminal vscode Apr 15, 2025 pm 06:42 PM

In VS Code, you can run the program in the terminal through the following steps: Prepare the code and open the integrated terminal to ensure that the code directory is consistent with the terminal working directory. Select the run command according to the programming language (such as Python's python your_file_name.py) to check whether it runs successfully and resolve errors. Use the debugger to improve debugging efficiency.

Is the vscode extension malicious? Is the vscode extension malicious? Apr 15, 2025 pm 07:57 PM

VS Code extensions pose malicious risks, such as hiding malicious code, exploiting vulnerabilities, and masturbating as legitimate extensions. Methods to identify malicious extensions include: checking publishers, reading comments, checking code, and installing with caution. Security measures also include: security awareness, good habits, regular updates and antivirus software.

See all articles