Home headlines I used Python to crawl more than 4,000 Taobao product data and discovered these rules! ! !

I used Python to crawl more than 4,000 Taobao product data and discovered these rules! ! !

Mar 07, 2018 pm 04:07 PM
python merchandise

Use Python to crawl the entire process of a Taobao product, mine and analyze the product data, and finally draw a conclusion.

I used Python to crawl more than 4,000 Taobao product data and discovered these rules! ! !

Project content

I used Python to crawl more than 4,000 Taobao product data and discovered these rules! ! !

In this case, the product category is selected: sofa.

Quantity: 100 pages, 4400 products in total.

Filter conditions: Tmall, sales volume from high to low, price above 500 yuan.

Project Purpose

Conduct text analysis on product titles and word cloud visualization

Statistical analysis of sales corresponding to different keyword words

Price distribution of products Situation analysis

Sales distribution analysis of commodities

Average sales distribution of commodities in different price ranges

Analysis of the impact of commodity prices on sales

Commodity prices Analysis of the impact on sales

Distribution of product quantity in different provinces or cities

Average sales distribution of products in different provinces

Note: This project only uses the above analysis as the basis example.

Project steps

Data collection: Python crawls Taobao product data

Clean and process the data

Text analysis: jieba word segmentation, wordcloud visualization

Data histogram visualization: barh

Data histogram visualization: hist

Data scatter plot visualization: scatter

Data regression analysis visualization: regplot

Tools & Modules

Tools: Spyder of Anaconda, the code editing tool in this case.

Modules: requests, retrying, missingno, jieba, matplotlib, wordcloud, imread, seaborn, etc.

Crawling data

Because Taobao is anti-crawler, although it uses multi-threading and modifies the headers parameters, it still cannot guarantee 100% crawling every time, so I added a loop crawling , crawling unsuccessful pages each time in a loop until all pages are successfully crawled.

Note: The Taobao product page is in JSON format, and regular expressions are used for parsing here.

The code is as follows:

I used Python to crawl more than 4,000 Taobao product data and discovered these rules! ! !

Data cleaning and processing

Data cleaning and processing steps can also be completed in Excel and then read in data.

The code is as follows:

I used Python to crawl more than 4,000 Taobao product data and discovered these rules! ! !

I used Python to crawl more than 4,000 Taobao product data and discovered these rules! ! !

Description: According to the requirements, in this case only item_loc, raw_title, view_price, The four columns of data in view_sales mainly analyze region, title, price, and sales volume.

The code is as follows:

I used Python to crawl more than 4,000 Taobao product data and discovered these rules! ! !

Data Mining and Analysis

Perform text analysis on the raw_title column title

Use stuttering word segmentation Tool, install the module pip install jieba:

I used Python to crawl more than 4,000 Taobao product data and discovered these rules! ! !

Filter the elements (str) of each list in title_s (list of list format) and remove unnecessary words. That is, all the words in the stopwords list are removed:

I used Python to crawl more than 4,000 Taobao product data and discovered these rules! ! !

#Because the number of each word is counted below, for the sake of accuracy, here is the filtered Each list element in the data title_clean is deduplicated, that is, each title is divided into unique words.

I used Python to crawl more than 4,000 Taobao product data and discovered these rules! ! !

#Observing the words in the word_count table, we found that jieba's default dictionary cannot meet the needs.

Some words (such as removable, non-removable, etc.) are cut. Here, new words are added to the dictionary according to the needs (you can also add or delete directly in the dictionary dict.txt, and then load the modified dict. txt).

I used Python to crawl more than 4,000 Taobao product data and discovered these rules! ! !

#Word cloud visualization requires the wordcloud module to be installed.

There are two ways to install the module:

pip install wordcloud

Download Packages installation: pip install package name

Note: Please download the software The package is placed in the Python installation path.

The code is as follows:

I used Python to crawl more than 4,000 Taobao product data and discovered these rules! ! !

I used Python to crawl more than 4,000 Taobao product data and discovered these rules! ! !

Analytical conclusion:

Combined and complete products account for a large proportion high.

Looking at the sofa material: Fabric sofas account for a high proportion, more than leather sofas.

Looking at sofa styles: simple style is the most popular, followed by Nordic style, and other styles are ranked in order: American, Chinese, Japanese, French, etc.

Looking at house types: small houses account for the highest proportion, followed by large and small houses, and large houses the least.

Statistical analysis of the sum of sales corresponding to different keyword words

Explanation: For example, with the word "simplistic", the sum of sales of products containing the word "simplistic" in the product title will be counted. That is, find the sum of sales of products with a "simple" style.

The code is as follows:

I used Python to crawl more than 4,000 Taobao product data and discovered these rules! ! !

Visualize the data in the word and w_s_sum columns in the table df_word_sum. (In this example, the top 30 sales words are used for drawing)

I used Python to crawl more than 4,000 Taobao product data and discovered these rules! ! !

I used Python to crawl more than 4,000 Taobao product data and discovered these rules! ! !

It can be seen from the chart:

combination products The highest sales volume.

From a category perspective: Fabric sofa sales are very high, far exceeding leather sofas.

Looking at apartment types: sales of sofas are highest in small apartments, followed by large and small apartments, and sales in large apartments are the least.

In terms of style: simple style has the highest sales volume, followed by Nordic style, followed by Chinese style, American style, Japanese style, etc.

Removable and washable and corner sofas have considerable sales volume and are also very popular among consumers.

Analysis of price distribution of commodities

The analysis found that some values ​​are too large. In order to make the visualization effect more intuitive, here we combine our own product conditions and select commodities with a price less than 20,000.

The code is as follows:

I used Python to crawl more than 4,000 Taobao product data and discovered these rules! ! !

I used Python to crawl more than 4,000 Taobao product data and discovered these rules! ! !

It can be seen from the chart:

The quantity of goods is generally displayed with the price In the descending ladder situation, the higher the price, the fewer goods are on sale.

There are mostly low-priced products, with the most products priced between 500-1500, followed by those between 1500-3000, and less products priced above 10,000.

There is not much difference in the number of products on sale for products with a price of more than 10,000 yuan.

Sales distribution analysis of goods

I used Python to crawl more than 4,000 Taobao product data and discovered these rules! ! !

I used Python to crawl more than 4,000 Taobao product data and discovered these rules! ! !

Similarly, in order to make the visualization more intuitive, here we choose the sales volume to be greater than 100's of merchandise.

The code is as follows:

It can be seen from the chart and data:

Only 3.4% of the products have a sales volume of more than 100, among which the products with a sales volume of 100-200 are the most, and 200- The next best between 300.

Sales between 100-500, the number of products shows a downward trend with sales, and the trend is steep, with mostly low-selling products.

There are very few products with sales of more than 500.

The average sales volume distribution of goods in different price ranges

The code is as follows:

I used Python to crawl more than 4,000 Taobao product data and discovered these rules! ! !

I used Python to crawl more than 4,000 Taobao product data and discovered these rules! ! !

From the chart It can be seen that:

The average sales volume of products with prices between 1331-1680 yuan is the highest, followed by those with prices between 951-1331 yuan, and those with prices above 9684 yuan are the lowest.

The overall trend is to increase first and then decrease, but the highest peak is at a relatively low price stage.

It shows that consumers’ demand for sofas is more at the low price stage. The higher the price above 1,680 yuan, the smaller the average sales volume is.

Analysis of the impact of commodity prices on sales

Same as above, in order to make the visualization effect more intuitive, here we combine our own product conditions and select products with a price less than 20,000.

The code is as follows:

I used Python to crawl more than 4,000 Taobao product data and discovered these rules! ! !

I used Python to crawl more than 4,000 Taobao product data and discovered these rules! ! !

It can be seen from the chart:

The overall trend: with the price of goods increases, its sales volume decreases, and commodity prices have a great impact on its sales volume.

The sales volume of a few products priced between 500-2500 is very high. The sales volume of most products priced between 2500-5000 is low, and a few products are relatively high. However, the sales volume of products priced above 5000 are very low. There are no products with outstanding sales.

Analysis of the impact of commodity prices on sales

The code is as follows:

I used Python to crawl more than 4,000 Taobao product data and discovered these rules! ! !

I used Python to crawl more than 4,000 Taobao product data and discovered these rules! ! !

It can be seen from the chart :

Overall trend: It can be seen from the linear regression fitting line that product sales show an upward trend with price growth.

The prices of most products are on the low side and sales are also on the low side.

Only a few products with prices ranging from 0 to 20,000 have high sales. Only 3 products with prices from 20,000 to 60,000 have high sales. One product with prices from 60,000 to 100,000 has high sales, and it is the largest one. value.

The distribution of commodity quantity in different provinces

The codes are as follows:

I used Python to crawl more than 4,000 Taobao product data and discovered these rules! ! !

I used Python to crawl more than 4,000 Taobao product data and discovered these rules! ! !

##It can be seen from the chart:

Guangdong has the most, followed by Shanghai, and Jiangsu third. Especially the number in Guangdong far exceeds that of Jiangsu, Zhejiang, Shanghai and other places, which shows that in the sofa sub-category, Guangdong stores dominate.

The numbers in Jiangsu, Zhejiang and Shanghai are not much different and are basically the same.

Average sales distribution of goods in different provinces

The codes are as follows:

I used Python to crawl more than 4,000 Taobao product data and discovered these rules! ! !

I used Python to crawl more than 4,000 Taobao product data and discovered these rules! ! !##Thermodynamic map

Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

Video Face Swap

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Roblox: Bubble Gum Simulator Infinity - How To Get And Use Royal Keys
4 weeks ago By 尊渡假赌尊渡假赌尊渡假赌
Nordhold: Fusion System, Explained
4 weeks ago By 尊渡假赌尊渡假赌尊渡假赌
Mandragora: Whispers Of The Witch Tree - How To Unlock The Grappling Hook
3 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

Hot Tools

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Hot Topics

Java Tutorial
1673
14
PHP Tutorial
1278
29
C# Tutorial
1257
24
PHP and Python: Different Paradigms Explained PHP and Python: Different Paradigms Explained Apr 18, 2025 am 12:26 AM

PHP is mainly procedural programming, but also supports object-oriented programming (OOP); Python supports a variety of paradigms, including OOP, functional and procedural programming. PHP is suitable for web development, and Python is suitable for a variety of applications such as data analysis and machine learning.

Choosing Between PHP and Python: A Guide Choosing Between PHP and Python: A Guide Apr 18, 2025 am 12:24 AM

PHP is suitable for web development and rapid prototyping, and Python is suitable for data science and machine learning. 1.PHP is used for dynamic web development, with simple syntax and suitable for rapid development. 2. Python has concise syntax, is suitable for multiple fields, and has a strong library ecosystem.

How to run sublime code python How to run sublime code python Apr 16, 2025 am 08:48 AM

To run Python code in Sublime Text, you need to install the Python plug-in first, then create a .py file and write the code, and finally press Ctrl B to run the code, and the output will be displayed in the console.

PHP and Python: A Deep Dive into Their History PHP and Python: A Deep Dive into Their History Apr 18, 2025 am 12:25 AM

PHP originated in 1994 and was developed by RasmusLerdorf. It was originally used to track website visitors and gradually evolved into a server-side scripting language and was widely used in web development. Python was developed by Guidovan Rossum in the late 1980s and was first released in 1991. It emphasizes code readability and simplicity, and is suitable for scientific computing, data analysis and other fields.

Python vs. JavaScript: The Learning Curve and Ease of Use Python vs. JavaScript: The Learning Curve and Ease of Use Apr 16, 2025 am 12:12 AM

Python is more suitable for beginners, with a smooth learning curve and concise syntax; JavaScript is suitable for front-end development, with a steep learning curve and flexible syntax. 1. Python syntax is intuitive and suitable for data science and back-end development. 2. JavaScript is flexible and widely used in front-end and server-side programming.

Golang vs. Python: Performance and Scalability Golang vs. Python: Performance and Scalability Apr 19, 2025 am 12:18 AM

Golang is better than Python in terms of performance and scalability. 1) Golang's compilation-type characteristics and efficient concurrency model make it perform well in high concurrency scenarios. 2) Python, as an interpreted language, executes slowly, but can optimize performance through tools such as Cython.

Where to write code in vscode Where to write code in vscode Apr 15, 2025 pm 09:54 PM

Writing code in Visual Studio Code (VSCode) is simple and easy to use. Just install VSCode, create a project, select a language, create a file, write code, save and run it. The advantages of VSCode include cross-platform, free and open source, powerful features, rich extensions, and lightweight and fast.

How to run python with notepad How to run python with notepad Apr 16, 2025 pm 07:33 PM

Running Python code in Notepad requires the Python executable and NppExec plug-in to be installed. After installing Python and adding PATH to it, configure the command "python" and the parameter "{CURRENT_DIRECTORY}{FILE_NAME}" in the NppExec plug-in to run Python code in Notepad through the shortcut key "F6".