Table of Contents
Univariate Feature Selection
Recursive Feature Elimination (RFE)
Principal Component Analysis (PCA)
Feature importance
Home Backend Development Python Tutorial An introduction to four methods to implement machine learning functions in Python

An introduction to four methods to implement machine learning functions in Python

Apr 13, 2019 am 11:41 AM
python develop machine learning programming language

This article brings you an introduction to the four methods of implementing machine learning functions in Python. It has certain reference value. Friends in need can refer to it. I hope it will be helpful to you.

In this article, we will introduce different methods of selecting features from a dataset; and discuss types of feature selection algorithms and their implementation in Python using the Scikit-learn (sklearn) library:

  1. Univariate feature selection
  2. Recursive feature elimination (RFE)
  3. Principal component analysis (PCA)
  4. Feature selection (feature importance)

Univariate Feature Selection

Statistical tests can be used to select those features that have the strongest relationship with the output variable.

The scikit-learn library provides the SelectKBest class that can be used with a different set of statistical tests to select a specific number of features.

The following example uses the chi squared (chi ^ 2) statistic to test non-negative features to select the four best features in the Pima Indians Diabetes dataset:

#Feature Extraction with Univariate Statistical Tests (Chi-squared for classification)

#Import the required packages

#Import pandas to read csv import pandas

#Import numpy for array related operations import numpy

#Import sklearn's feature selection algorithm

from sklearn.feature_selection import SelectKBest

#Import chi2 for performing chi square test from sklearn.feature_selection import chi2

#URL for loading the dataset

url ="https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians diabetes/pima-indians-diabetes.data"

#Define the attribute names

names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']

#Create pandas data frame by loading the data from URL

dataframe = pandas.read_csv(url, names=names)

#Create array from data values

array = dataframe.values

#Split the data into input and target

X = array[:,0:8]

Y = array[:,8]

#We will select the features using chi square

test = SelectKBest(score_func=chi2, k=4)

#Fit the function for ranking the features by score

fit = test.fit(X, Y)

#Summarize scores numpy.set_printoptions(precision=3) print(fit.scores_)

#Apply the transformation on to dataset

features = fit.transform(X)

#Summarize selected features print(features[0:5,:])
Copy after login

Each The score of the attribute and the four selected attributes (the ones with the highest scores): plas, test, mass and age.

Score per feature:

[111.52   1411.887 17.605 53.108  2175.565   127.669 5.393

181.304]
Copy after login

Features:

[[148. 0. 33.6 50. ]

[85. 0. 26.6 31. ]

[183. 0. 23.3 32. ]

[89. 94. 28.1 21. ]

[137. 168. 43.1 33. ]]
Copy after login

Recursive Feature Elimination (RFE)

RFE via recursive deletion properties and build models on the remaining properties to work on. It uses model accuracy to identify which attributes (and attribute combinations) contribute most to predicting the target attribute. The following example uses RFE and logistic regression algorithms to select the top three features. The choice of algorithm does not matter as long as it is skillful and consistent:

#Import the required packages

#Import pandas to read csv import pandas

#Import numpy for array related operations import numpy

#Import sklearn's feature selection algorithm from sklearn.feature_selection import RFE

#Import LogisticRegression for performing chi square test from sklearn.linear_model import LogisticRegression

#URL for loading the dataset

url =

"https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-dia betes/pima-indians-diabetes.data"

#Define the attribute names

names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']

#Create pandas data frame by loading the data from URL

dataframe = pandas.read_csv(url, names=names)

#Create array from data values

array = dataframe.values

#Split the data into input and target

X = array[:,0:8]

Y = array[:,8]

#Feature extraction

model = LogisticRegression() rfe = RFE(model, 3)

fit = rfe.fit(X, Y)

print("Num Features: %d"% fit.n_features_) print("Selected Features: %s"% fit.support_) print("Feature Ranking: %s"% fit.ranking_)
Copy after login

After execution, we will obtain:

Num Features: 3

Selected Features: [ True False False False False   True  True False]

Feature Ranking: [1 2 3 5 6 1 1 4]
Copy after login

You can see that RFE selects the first three features as preg , mass and pedi. These are marked True in the support_array and Option 1 in the ranking_array.

Principal Component Analysis (PCA)

PCA uses linear algebra to transform a data set into a compressed form. Typically, it is considered a data reduction technique. One property of PCA is that you can choose to transform the number of dimensions or principal components in the result.

In the following example, we use PCA and select three principal components:

#Import the required packages

#Import pandas to read csv import pandas

#Import numpy for array related operations import numpy

#Import sklearn's PCA algorithm

from sklearn.decomposition import PCA

#URL for loading the dataset

url =

"https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians diabetes/pima-indians-diabetes.data"

#Define the attribute names

names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']

dataframe = pandas.read_csv(url, names=names)

#Create array from data values

array = dataframe.values

#Split the data into input and target

X = array[:,0:8]

Y = array[:,8]

#Feature extraction

pca = PCA(n_components=3) fit = pca.fit(X)

#Summarize components

print("Explained Variance: %s") % fit.explained_variance_ratio_

print(fit.components_)
Copy after login

You can see that the transformed data set (three principal components) has little similarity to the source data At:

Explained Variance: [ 0.88854663   0.06159078  0.02579012]

[[ -2.02176587e-03    9.78115765e-02 1.60930503e-02    6.07566861e-02

9.93110844e-01          1.40108085e-02 5.37167919e-04   -3.56474430e-03]

[ -2.26488861e-02   -9.72210040e-01              -1.41909330e-01  5.78614699e-02 9.46266913e-02   -4.69729766e-02               -8.16804621e-04  -1.40168181e-01

[ -2.24649003e-02 1.43428710e-01                 -9.22467192e-01  -3.07013055e-01 2.09773019e-02   -1.32444542e-01                -6.39983017e-04  -1.25454310e-01]]
Copy after login

Feature importance

Feature importance is a technique used to select features using a trained supervised classifier. When we train a classifier (such as a decision tree), we evaluate each attribute to create a split; we can use this measure as a feature selector. Let us know it in detail.

Random forests are one of the most popular machine learning methods because of their relatively good accuracy, robustness, and ease of use. They also provide two straightforward feature selection methods - Average Reduction in Impurity and Average Reduction in Precision.

Random forest consists of many decision trees. Each node in the decision tree is a condition on a single feature designed to split the data set into two so that similar response values ​​end up in the same set. The metric that selects (locally) optimal conditions is called Impurity. For classification it is usually the Gini coefficient

impurity or information gain/entropy, and for regression trees it is the variance. Therefore, when training a tree, it can be calculated by how much each feature reduces weighted impurity in the tree. For forests, the impurity reduction for each feature can be averaged and the features ranked according to this measure.

Let us see how to use the Random Forest classifier for feature selection and evaluate the accuracy of the classifier before and after feature selection. We will use the Otto dataset.

This dataset describes 93 fuzzy details for over 61,000 products grouped into 10 product categories (e.g., fashion, electronics, etc.) . The input attribute is some kind of count of distinct events.

The goal is to get predictions for new products as an array of probabilities for each of the 10 classes and evaluate the model using a multi-class log loss (also known as cross-entropy).

We will start by importing all the libraries:

#Import the supporting libraries

#Import pandas to load the dataset from csv file

from pandas import read_csv

#Import numpy for array based operations and calculations

import numpy as np

#Import Random Forest classifier class from sklearn

from sklearn.ensemble import RandomForestClassifier

#Import feature selector class select model of sklearn

        from sklearn.feature_selection

        import SelectFromModel

         np.random.seed(1)
Copy after login

Let us define a way to split the dataset into training and test data; we will train our dataset in the training part, test Part will be used to evaluate the trained model:

#Function to create Train and Test set from the original dataset def getTrainTestData(dataset,split):

np.random.seed(0) training = [] testing = []

np.random.shuffle(dataset) shape = np.shape(dataset)

trainlength = np.uint16(np.floor(split*shape[0]))

for i in range(trainlength): training.append(dataset[i])

for i in range(trainlength,shape[0]): testing.append(dataset[i])

training = np.array(training) testing = np.array(testing)

return training,testing
Copy after login

We also need to add a function to evaluate the accuracy of the model; it will take the predicted and actual output as input to calculate the percent accuracy:

#Function to evaluate model performance

def getAccuracy(pre,ytest): count = 0

for i in range(len(ytest)):

if ytest[i]==pre[i]: count+=1

acc = float(count)/len(ytest)

return acc
Copy after login

This is the time to load the dataset. We will load the train.csv file; this file contains over 61,000 training instances. We will use 50000 instances in our example, of which we will use 35,000 instances to train the classifier and 15,000 instances to test the performance of the classifier:

#Load dataset as pandas data frame

data = read_csv('train.csv')

#Extract attribute names from the data frame

feat = data.keys()

feat_labels = feat.get_values()

#Extract data values from the data frame

dataset = data.values

#Shuffle the dataset

np.random.shuffle(dataset)

#We will select 50000 instances to train the classifier

inst = 50000

#Extract 50000 instances from the dataset

dataset = dataset[0:inst,:]

#Create Training and Testing data for performance evaluation

train,test = getTrainTestData(dataset, 0.7)

#Split data into input and output variable with selected features

Xtrain = train[:,0:94] ytrain = train[:,94] shape = np.shape(Xtrain)

print("Shape of the dataset ",shape)

#Print the size of Data in MBs

print("Size of Data set before feature selection: %.2f MB"%(Xtrain.nbytes/1e6))
Copy after login

We pay attention to the data size here; because our dataset contains about 35000 training instances with 94 attributes; the size of our dataset is very large. Let's take a look:

Shape of the dataset (35000, 94)

Size of Data set before feature selection: 26.32 MB
Copy after login

As you can see, our dataset has 35000 rows and 94 columns, which is over 26 MB of data.

In the next code block, we will configure the random forest classifier; we will use 250 trees, the maximum depth is 30, and the number of random features is 7. The other hyperparameters will be the default values ​​​​of sklearn:

#Lets select the test data for model evaluation purpose

Xtest = test[:,0:94] ytest = test[:,94]

#Create a random forest classifier with the following Parameters

trees            = 250

max_feat     = 7

max_depth = 30

min_sample = 2

clf = RandomForestClassifier(n_estimators=trees,

max_features=max_feat,

max_depth=max_depth,

min_samples_split= min_sample, random_state=0,

n_jobs=-1)

#Train the classifier and calculate the training time

import time

start = time.time() clf.fit(Xtrain, ytrain) end = time.time()

#Lets Note down the model training time

print("Execution time for building the Tree is: %f"%(float(end)- float(start)))

pre = clf.predict(Xtest)

Let's see how much time is required to train the model on the training dataset:

Execution time for building the Tree is: 2.913641

#Evaluate the model performance for the test data

acc = getAccuracy(pre, ytest)

print("Accuracy of model before feature selection is %.2f"%(100*acc))
Copy after login

我们模型的准确性是:

特征选择前的模型精度为98.82

正如您所看到的,我们正在获得非常好的准确性,因为我们将近99%的测试数据分类到正确的类别中。这意味着我们正在对15,000个正确类中的14,823个实例进行分类。

那么,现在我的问题是:我们是否应该进一步改进?好吧,为什么不呢?如果可以的话,我们肯定会寻求更多的改进; 在这里,我们将使用功能重要性来选择功能。如您所知,在树木构建过程中,我们使用杂质测量来选择节点。选择具有最低杂质的属性值作为树中的节点。我们可以使用类似的标准进行特征选择。我们可以更加重视杂质较少的功能,这可以使用sklearn库的feature_importances_函数来完成。让我们找出每个功能的重要性:

#Once我们培养的模型中,我们的排名将所有功能的功能拉链(feat_labels,clf.feature_importances_):

print(feature)

('id', 0.33346650420175183)

('feat_1', 0.0036186958628801214)

('feat_2', 0.0037243050888530957)

('feat_3', 0.011579217472062748)

('feat_4', 0.010297382675187445)

('feat_5', 0.0010359139416194116)

('feat_6', 0.00038171336038056165)

('feat_7', 0.0024867672489765021)

('feat_8', 0.0096689721610546085)

('feat_9', 0.007906150362995093)

('feat_10', 0.0022342480802130366)
Copy after login

正如您在此处所看到的,每个要素都基于其对最终预测的贡献而具有不同的重要性。

我们将使用这些重要性分数来排列我们的功能; 在下面的部分中,我们将选择功能重要性大于0.01的模型训练功能:

#Select features which have higher contribution in the final prediction

sfm = SelectFromModel(clf, threshold=0.01) sfm.fit(Xtrain,ytrain)
Copy after login

在这里,我们将根据所选的特征属性转换输入数据集。在下一个代码块中,我们将转换数据集。然后,我们将检查新数据集的大小和形状:

#Transform input dataset

Xtrain_1 = sfm.transform(Xtrain) Xtest_1      = sfm.transform(Xtest)

#Let's see the size and shape of new dataset print("Size of Data set before feature selection: %.2f MB"%(Xtrain_1.nbytes/1e6))

shape = np.shape(Xtrain_1)

print("Shape of the dataset ",shape)

Size of Data set before feature selection: 5.60 MB Shape of the dataset (35000, 20)
Copy after login

你看到数据集的形状了吗?在功能选择过程之后,我们只剩下20个功能,这将数据库的大小从26 MB减少到5.60 MB。这比原始数据集减少了约80%

在下一个代码块中,我们将训练一个新的随机森林分类器,它具有与之前相同的超参数,并在测试数据集上进行测试。让我们看看修改训练集后得到的准确度:

#Model training time

start = time.time() clf.fit(Xtrain_1, ytrain) end = time.time()

print("Execution time for building the Tree is: %f"%(float(end)- float(start)))

#Let's evaluate the model on test data

pre = clf.predict(Xtest_1) count = 0

acc2 = getAccuracy(pre, ytest)

print("Accuracy after feature selection %.2f"%(100*acc2))

Execution time for building the Tree is: 1.711518 Accuracy after feature selection 99.97
Copy after login

你能看到!! 我们使用修改后的数据集获得了99.97%的准确率,这意味着我们在正确的类中对14,996个实例进行了分类,而之前我们只正确地对14,823个实例进行了分类。

这是我们在功能选择过程中取得的巨大进步; 我们可以总结下表中的所有结果:

评估标准 在选择特征之前 选择功能后
功能数量 94 20
数据集的大小 26.32 MB 5.60 MB
训练时间 2.91秒 1.71秒
准确性 98.82% 99.97%

上表显示了特征选择的实际优点。您可以看到我们显着减少了要素数量,从而降低了数据集的模型复杂性和维度。尺寸减小后我们的训练时间缩短,最后,我们克服了过度拟合问题,获得了比以前更高的精度。

The above is the detailed content of An introduction to four methods to implement machine learning functions in Python. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

Video Face Swap

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Tools

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)

PHP and Python: Different Paradigms Explained PHP and Python: Different Paradigms Explained Apr 18, 2025 am 12:26 AM

PHP is mainly procedural programming, but also supports object-oriented programming (OOP); Python supports a variety of paradigms, including OOP, functional and procedural programming. PHP is suitable for web development, and Python is suitable for a variety of applications such as data analysis and machine learning.

Choosing Between PHP and Python: A Guide Choosing Between PHP and Python: A Guide Apr 18, 2025 am 12:24 AM

PHP is suitable for web development and rapid prototyping, and Python is suitable for data science and machine learning. 1.PHP is used for dynamic web development, with simple syntax and suitable for rapid development. 2. Python has concise syntax, is suitable for multiple fields, and has a strong library ecosystem.

Why Use PHP? Advantages and Benefits Explained Why Use PHP? Advantages and Benefits Explained Apr 16, 2025 am 12:16 AM

The core benefits of PHP include ease of learning, strong web development support, rich libraries and frameworks, high performance and scalability, cross-platform compatibility, and cost-effectiveness. 1) Easy to learn and use, suitable for beginners; 2) Good integration with web servers and supports multiple databases; 3) Have powerful frameworks such as Laravel; 4) High performance can be achieved through optimization; 5) Support multiple operating systems; 6) Open source to reduce development costs.

PHP and Python: A Deep Dive into Their History PHP and Python: A Deep Dive into Their History Apr 18, 2025 am 12:25 AM

PHP originated in 1994 and was developed by RasmusLerdorf. It was originally used to track website visitors and gradually evolved into a server-side scripting language and was widely used in web development. Python was developed by Guidovan Rossum in the late 1980s and was first released in 1991. It emphasizes code readability and simplicity, and is suitable for scientific computing, data analysis and other fields.

Python vs. JavaScript: The Learning Curve and Ease of Use Python vs. JavaScript: The Learning Curve and Ease of Use Apr 16, 2025 am 12:12 AM

Python is more suitable for beginners, with a smooth learning curve and concise syntax; JavaScript is suitable for front-end development, with a steep learning curve and flexible syntax. 1. Python syntax is intuitive and suitable for data science and back-end development. 2. JavaScript is flexible and widely used in front-end and server-side programming.

PHP: An Introduction to the Server-Side Scripting Language PHP: An Introduction to the Server-Side Scripting Language Apr 16, 2025 am 12:18 AM

PHP is a server-side scripting language used for dynamic web development and server-side applications. 1.PHP is an interpreted language that does not require compilation and is suitable for rapid development. 2. PHP code is embedded in HTML, making it easy to develop web pages. 3. PHP processes server-side logic, generates HTML output, and supports user interaction and data processing. 4. PHP can interact with the database, process form submission, and execute server-side tasks.

How to run sublime code python How to run sublime code python Apr 16, 2025 am 08:48 AM

To run Python code in Sublime Text, you need to install the Python plug-in first, then create a .py file and write the code, and finally press Ctrl B to run the code, and the output will be displayed in the console.

The Continued Use of PHP: Reasons for Its Endurance The Continued Use of PHP: Reasons for Its Endurance Apr 19, 2025 am 12:23 AM

What’s still popular is the ease of use, flexibility and a strong ecosystem. 1) Ease of use and simple syntax make it the first choice for beginners. 2) Closely integrated with web development, excellent interaction with HTTP requests and database. 3) The huge ecosystem provides a wealth of tools and libraries. 4) Active community and open source nature adapts them to new needs and technology trends.

See all articles