Home Common Problem Four steps of data preprocessing

Four steps of data preprocessing

Mar 05, 2021 am 10:36 AM
Data preprocessing

The four steps of data preprocessing are data cleaning, data integration, data transformation and data reduction; data preprocessing refers to the review and screening before classifying or grouping the collected data. , sorting and other necessary processing; data preprocessing, on the one hand, is to improve the quality of the data, on the other hand, it is also to adapt to the software or methods of data analysis.

Four steps of data preprocessing

The operating environment of this article: Windows 7 system, Dell G3 computer.

Data preprocessing refers to the necessary processing such as review, screening, sorting, etc. before classifying or grouping the collected data.

On the one hand, data preprocessing is to improve the quality of data, and on the other hand, it is also to adapt to the software or method of data analysis. Generally speaking, the data preprocessing steps are: data cleaning, data integration, data transformation, data reduction, and each major step has some small subdivisions. Of course, these four major steps may not necessarily be performed when doing data preprocessing.

1. Data Cleaning

Data cleaning, as the name suggests, turns “black” data into “white” data and “dirty” data into To become "clean", dirty data is dirty in form and content.

Dirty in form, such as missing values ​​and special symbols;

Dirty in content, such as outliers.

1. Missing values

Missing values ​​include the identification of missing values ​​and the processing of missing values.

In R, the function is.na is used to identify missing values, and the function complete.cases is used to identify whether the sample data is complete.

Commonly used methods for dealing with missing values ​​are: deletion, replacement and interpolation.

  • Deletion method: The deletion method can be divided into deleting observation samples and variables according to different angles of deletion, deleting observation samples (line deletion method), and the na.omit function in R can delete Rows containing missing values.

    This is equivalent to reducing the sample size in exchange for the completeness of the information. However, when there are large missing variables and little impact on the research objectives, you can consider deleting the statement mydata[,-p] in the variable R. To be done. mydata represents the name of the deleted data set, p is the number of columns of the deleted variable, and - represents deletion.

  • Replacement method: The replacement method, as the name suggests, replaces missing values. There are different replacement rules according to different variables. The variable where the missing value is located is a numeric type. Use other numbers under this variable. The missing values ​​are replaced by the mean; when the variable is a non-numeric variable, the median or mode of other observed values ​​under the variable is used.

  • Interpolation method: The interpolation method is divided into regression interpolation and multiple interpolation.

    Regression interpolation refers to treating the interpolated variable as the dependent variable y, and other variables as independent variables, using the regression model for fitting, and using the lm regression function in R to interpolate missing values. ;

    Multiple imputation refers to generating a complete set of data from a data set containing missing values. It is performed multiple times to generate a random sample of missing values. The mice package in R can perform multiple imputation.

2. Outliers

Outlier values, like missing values, include the identification and processing of outliers.

  • The identification of outliers is usually handled with a univariate scatter plot or a box plot. In R, dotchart is a function that draws a univariate scatter plot, and the boxplot function draws a box plot. ; In the graph, points far away from the normal range are regarded as outliers.

  • The processing of outliers includes deleting observations containing outliers (direct deletion, when there are few samples, direct deletion will cause insufficient sample size and change the distribution of variables), treat them as missing values ​​( Use the existing information to fill in missing values), average correction (use the average of the two observations before and after to correct the outlier), and do not process it. When handling outliers, you must first review the possible reasons for the occurrence of outliers, and then determine whether the outliers should be discarded.

2. Data integration

The so-called data integration is to merge multiple data sources into one data storage , of course, if the data being analyzed is originally in a data store, there is no need for data integration (all-in-one).

The implementation of data integration is to combine two data frames based on keywords and use the merge function in R. The statement is merge (dataframe1, dataframe2, by="keyword"), and the default is in ascending order. Arrangement.

The following problems may occur when performing data integration:

  1. The same name has different meanings, the name of an attribute in data source A and the name of an attribute in data source B The same, but the entities represented are different and cannot be used as keywords;

  2. has synonymous names, that is, the name of an attribute in the two data sources is different but the entity it represents is the same. Can be used as keywords;

  3. Data integration often results in data redundancy. The same attribute may appear multiple times, or it may be duplication caused by inconsistent attribute names. For duplicate attributes, do the related work first. Analyze and detect, and delete it if there is any.

3. Data transformation

Data transformation is to transform it into an appropriate form to meet the needs of software or analysis theory.

1. Simple function transformation

Simple function transformation is used to transform data without normal distribution into data with normal distribution. Commonly used ones include square, Square root, logarithm, difference, etc. For example, in time series, logarithm or difference operations are often performed on data to convert non-stationary sequences into stationary sequences.

2. Standardization

Normalization is to remove the influence of the variable dimension, such as directly comparing the difference in height and weight, the difference in units and the range of values. The differences make this not directly comparable.

  • Minimum-maximum normalization: also called dispersion standardization, linearly transforms the data and changes its range to [0,1]

  • Zero-mean normalization: also called standard deviation standardization, the mean value of the processed data is equal to 0, and the standard deviation is 1

  • ##Decimal scaling normalization: move the decimal places of the attribute value, and Attribute values ​​are mapped to [-1,1]

3. Continuous attribute discretization

Convert continuous attribute variables into categorical attributes, that is Discretization of continuous attributes, especially some classification algorithms require data to be categorical attributes, such as the ID3 algorithm.

Commonly used discretization methods include the following:

  1. Equal-width method: Divide the value range of the attribute into intervals with the same width, similar to making a frequency distribution table;

  2. Equal frequency method: put the same records into each interval;

  3. One-dimensional clustering: two steps, first put the continuous The values ​​of the attributes are clustered using a clustering algorithm, and then the clustered sets are merged into a continuous value and marked with the same label.

4. Data reduction

Data reduction refers to the understanding of the mining task and the content of the data itself Basically, find useful features of the data that depend on the discovery target to reduce the size of the data, thereby minimizing the amount of data while maintaining the original appearance of the data as much as possible.

Data curation can reduce the impact of invalid and erroneous data on modeling, reduce time, and reduce the space for storing data.

1. Attribute reduction

Attribute reduction is to find the smallest attribute subset and determine the probability distribution of the subset that is close to the probability distribution of the original data.

  1. Merge attributes: merge some old attributes into a new attribute;

  2. Select forward step by step: start from an empty attribute set, Each time, a current optimal attribute is selected from the original attribute set and added to the current subset, until the optimal attribute cannot be selected or a constraint value is satisfied;

  3. Step by step selection: from one Starting from an empty attribute set, each time the current worst attribute is selected from the original attribute set and eliminated from the current subset, until the worst attribute cannot be selected or a constraint value is satisfied;

  4. Decision Tree induction: attributes that do not appear in this decision tree are deleted from the initial set to obtain a better attribute subset;

  5. Principal component analysis: use fewer variables to explain Most of the variables in the original data (convert highly correlated variables into independent or uncorrelated variables).

2. Numerical reduction

By reducing the amount of data, including parametric and non-parametric methods, with parameters such as linear regression and multiple regression , parameterless methods such as histogram, sampling, etc.

For more related knowledge, please visit the

FAQ column!

The above is the detailed content of Four steps of data preprocessing. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

Video Face Swap

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Tools

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Hot Topics

Java Tutorial
1662
14
PHP Tutorial
1261
29
C# Tutorial
1234
24
How to use PHP functions for data preprocessing? How to use PHP functions for data preprocessing? May 02, 2024 pm 03:03 PM

PHP data preprocessing functions can be used for type conversion, data cleaning, date and time processing. Specifically, type conversion functions allow variable type conversion (such as int, float, string); data cleaning functions can delete or replace invalid data (such as is_null, trim); date and time processing functions can perform date conversion and formatting (such as date, strtotime, date_format).

How to use Vue form processing to implement data preprocessing before form submission How to use Vue form processing to implement data preprocessing before form submission Aug 10, 2023 am 09:21 AM

Overview of how to use Vue form processing to implement data preprocessing before form submission: In web development, forms are one of the most common elements. Before submitting the form, we often need to perform some preprocessing on the data entered by the user, such as format verification, data conversion, etc. The Vue framework provides convenient and easy-to-use form processing functions. This article will introduce how to use Vue form processing to implement data preprocessing before form submission. 1. Create a Vue instance and form control First, we need to create a Vue instance and define a containing table

Unlock the code of data analysis with Python Unlock the code of data analysis with Python Feb 19, 2024 pm 09:30 PM

Data Preprocessing Data preprocessing is a crucial step in the data analysis process. It involves cleaning and transforming data to make it suitable for analysis. Python's pandas library provides rich functionality to handle this task. Sample code: importpandasaspd#Read data from CSV file df=pd.read_csv("data.csv")#Handle missing values ​​df["age"].fillna(df["age"].mean(),inplace=True )#Convert data type df["gender"]=df["gender"].astype("cateGory")Scik for machine learning Python

Go language and MySQL database: how to perform data preprocessing? Go language and MySQL database: how to perform data preprocessing? Jun 17, 2023 am 08:27 AM

In modern software development, for most applications, it is necessary to be able to interact with various relational databases in order to be able to share data between the application and the database. MySQL is a widely used open source relational database management system, and the Go language is a modern programming language with excellent performance. It provides many built-in libraries to easily interact with the MySQL database. This article will explore how to use Go language to write prepared statements to improve the performance of MySQL database. What is preprocessing? Preprocessing is to make

Use PHP to develop and implement data preprocessing and compression transmission of Baidu Wenxinyiyan API interface Use PHP to develop and implement data preprocessing and compression transmission of Baidu Wenxinyiyan API interface Aug 25, 2023 pm 09:12 PM

Use PHP to develop and implement data preprocessing and compression transmission of Baidu Wenxin Yiyan API interface. With the development of the Internet, people have more and more demands for interfaces. The Baidu Wenxin Yiyan API interface is a very popular interface, which can provide some interesting sentences, famous sayings and aphorisms. In order to improve the efficiency and performance of the interface, we can perform some preprocessing and compression transmission on the interface data, thereby speeding up data transmission and reducing bandwidth usage. First, we need to apply for an APIKey on Baidu Open Platform. This

How to implement server-side rendering and data preprocessing in JavaScript How to implement server-side rendering and data preprocessing in JavaScript Jun 15, 2023 pm 04:44 PM

A Way to Implement Server-Side Rendering and Data Preprocessing in JavaScript In modern web applications, building high-performance and scalable websites has become increasingly important. Server-side rendering and data preprocessing are two key technologies to achieve this goal, and they can significantly improve the performance and responsiveness of the application. This article will introduce how to use JavaScript to implement server-side rendering and data preprocessing. Server-side rendering Server-side rendering refers to generating HTML code on the server side and sending it to

What are data preprocessing techniques in Python? What are data preprocessing techniques in Python? Jun 04, 2023 am 09:11 AM

Python, as a commonly used programming language, can process and analyze a variety of different data. Data preprocessing is a very important and necessary step in data analysis. It includes steps such as data cleaning, feature extraction, data conversion and data standardization. The purpose of preprocessing is to improve the quality and analyzability of data. There are many data preprocessing techniques and tools available in Python. Some commonly used techniques and tools are introduced below. Data Cleaning In the data cleaning stage, we need to deal with missing values, duplicate values, and differences in some original data.

How to use Vue Router to implement data preprocessing before page jump? How to use Vue Router to implement data preprocessing before page jump? Jul 21, 2023 am 08:45 AM

How to use VueRouter to implement data preprocessing before page jump? Introduction: When using Vue to develop single-page applications, we often use VueRouter to manage jumps between pages. Sometimes, we need to preprocess some data before jumping, such as obtaining data from the server, or verifying user permissions, etc. This article will introduce how to use VueRouter to implement data preprocessing before page jump. 1. Install and configure VueRouter First, we need to install Vu