python data analysis | python数据预处理(基于scikit-learn模块)

发布时间:2025/1/21 python 31 豆豆
dataset transformations| 数据转换
combining estimators|组合学习器
feature extration|特征提取
preprocessing data|数据预处理


  • dataset transformations| 数据转换
  • combining estimators|组合学习器
  • feature extration|特征提取
  • preprocessing data|数据预处理

1 dataset transformations

scikit-learn provides a library of transformers, which may clean (see preprocessing data), reduce (see unsupervised dimensionality reduction), expand (see kernel approximation) or generate (see feature extraction) feature representations.

scikit-learn 提供了数据转换的模块,包括数据清理、降维、扩展和特征提取。

like other estimators, these are represented by classes with fit method, which learns model parameters (e.g. mean and standard deviation for normalization) from a training set, and a transform method which applies this transformation model to unseen data. fit_transform may be more convenient and efficient for modelling and transforming the training data simultaneously.



1.1 combining estimators


    1.1.1 pipeline:chaining estimators

    pipeline 模块是用来组合一系列估计器的。对固定的一系列操作非常便利,如:同时结合特征选择、数据标准化、分类。
    • usage|使用
      代码: from sklearn.pipeline import pipeline from sklearn.svm import svc from sklearn.decomposition import pca from sklearn.pipeline import make_pipeline #define estimators #the arg is a list of (key,value) pairs,where the key is a string you want to give this step and value is an estimators object estimators=[('reduce_dim',pca()),('svm',svc())] #combine estimators clf1=pipeline(estimators) clf2=make_pipeline(pca(),svc()) #use func make_pipeline() can do the same thing print(clf1,'\n',clf2) 输出: pipeline(steps=[('reduce_dim', pca(copy=true, n_components=none, whiten=false)), ('svm', svc(c=1.0, cache_size=200, class_weight=none, coef0=0.0, decision_function_shape=none, degree=3, gamma='auto', kernel='rbf', max_iter=-1, probability=false, random_state=none, shrinking=true, tol=0.001, verbose=false))]) pipeline(steps=[('pca', pca(copy=true, n_components=none, whiten=false)), ('svc', svc(c=1.0, cache_size=200, class_weight=none, coef0=0.0, decision_function_shape=none, degree=3, gamma='auto', kernel='rbf', max_iter=-1, probability=false, random_state=none, shrinking=true, tol=0.001, verbose=false))]) 可以通过set_params()方法设置学习器的属性,参数形式为_ clf.set_params(svm__c=10) 上面的方法在网格搜索时很重要from sklearn.grid_search import gridsearchcv params = dict(reduce_dim__n_components=[2, 5, 10],svm__c=[0.1, 10, 100]) grid_search = gridsearchcv(clf, param_grid=params) 上面的例子相当于把pipeline生成的学习器作为一个普通的学习器,参数形式为_
    • note|说明

      >>> clf.steps[0] ('reduce_dim', pca(copy=true, n_components=none, whiten=false))

    1.1.2 featureunion: composite feature spaces


    featureunion combines several transformer objects into a new transformer that combines their output. afeatureunion takes a list of transformer objects. during fitting, each of these is fit to the data independently. for transforming data, the transformers are applied in parallel, and the sample vectors they output are concatenated end-to-end into larger vectors.

    • usage|使用

      from sklearn.pipeline import featureunion from sklearn.decomposition import pca from sklearn.decomposition import kernelpca from sklearn.pipeline import make_union #define transformers #the arg is a list of (key,value) pairs,where the key is a string you want to give this step and value is an transformer object estimators=[('linear_pca)',pca()),('kernel_pca',kernelpca())] #combine transformers clf1=featureunion(estimators) clf2=make_union(pca(),kernelpca()) print(clf1,'\n',clf2) print(dir(clf1))


      featureunion(n_jobs=1,transformer_list=[('linear_pca)', pca(copy=true, n_components=none, whiten=false)), ('kernel_pca', kernelpca(alpha=1.0, coef0=1, degree=3, eigen_solver='auto', fit_inverse_transform=false, gamma=none, kernel='linear', kernel_params=none, max_iter=none, n_components=none, remove_zero_eig=false, tol=0))], transformer_weights=none) featureunion(n_jobs=1, transformer_list=[('pca', pca(copy=true, n_components=none, whiten=false)), ('kernelpca', kernelpca(alpha=1.0, coef0=1, degree=3, eigen_solver='auto', fit_inverse_transform=false, gamma=none, kernel='linear', kernel_params=none, max_iter=none, n_components=none, remove_zero_eig=false, tol=0))], transformer_weights=none)


    • note|说明

      (a featureunion has no way of checking whether two transformers might produce identical features. it only produces a union when the feature sets are disjoint, and making sure they are is the caller’s responsibility.)

      here is a example python source code:feature_stacker.py


1.2 feature extraction

the sklearn.feature_extraction module can be used to extract features in a format supported by machine learning algorithms from datasets consisting of formats such as text and image.

feature extraction(特征提取)与feature selection(特征选择)不同,前者是用来将非数值的数据转换成数值的数据,后者是用机器学习的方法对特征进行学习(如pca降维)。


    1.2.1 loading features from dicts

    the class dictvectorizer can be used to convert feature arrays represented as lists of standard python dict
    objects to the numpy/scipy representation used by scikit-learn estimators.


    measurements=[{'city': 'dubai', 'temperature': 33.} ,{'city': 'london', 'temperature':12.} ,{'city':'san fransisco','temperature':18.},] from sklearn.feature_extraction import dictvectorizer vec=dictvectorizer() x=vec.fit_transform(measurements).toarray() print(x) print(vec.get_feature_names())


    [[ 1. 0. 0. 33.] [ 0. 1. 0. 12.] [ 0. 0. 1. 18.]] ['city=dubai', 'city=london', 'city=san fransisco', 'temperature'] [finished in 0.8s]

    1.2.2 feature hashing


    1.2.3 text feature extraction


    1.2.4 image feature extraction



1.3 preprogressing data

the sklearn.preprocessing
package provides several common utility functions and transformer classes to change raw feature vectors into a representation that is more suitable for the downstream estimators



    1.3.1 standardization, or mean removal and variance scaling

    standardization of datasets is a common requirement for many machine learning estimators implemented in the scikit; they might behave badly if the individual features do not more or less look like standard normally distributed data: gaussian with zero mean and unit variance.


    • usage|用法


    from sklearn import preprocessing import numpy as np x = np.array([[1.,-1., 2.], [2.,0.,0.], [0.,1.,-1.]]) y=x y_scaled = preprocessing.scale(y) y_mean=y_scaled.mean(axis=0) #if 0, independently standardize each feature, otherwise (if 1) standardize each sample|axis=0 时求每个特征的均值,axis=1时求每个样本的均值 y_std=y_scaled.std(axis=0) print(y_scaled) scaler= preprocessing.standardscaler().fit(y)#用standardscaler类也能完成同样的功能 print(scaler.transform(y))


    [[ 0. -1.22474487 1.33630621] [ 1.22474487 0. -0.26726124] [-1.22474487 1.22474487 -1.06904497]] [[ 0. -1.22474487 1.33630621] [ 1.22474487 0. -0.26726124] [-1.22474487 1.22474487 -1.06904497]] [finished in 1.4s]
    • note|说明
      1.func scale
      2.class standardscaler
      3.standardscaler 是一种transformer方法,可以让pipeline来使用。
      minmaxscaler (min-max标准化[0,1])类和maxabsscaler([-1,1])类是另外两个标准化的方式,用法和standardscaler类似。

      the median and the interquartile range often give better results



    1.3.2 impution of missing values|缺失值的处理

    • usage
      代码: import scipy.sparse as sp from sklearn.preprocessing import imputer x=sp.csc_matrix([[1,2],[0,3],[7,6]]) imp=preprocessing.imputer(missing_value=0,strategy='mean',axis=0) imp.fit(x) x_test=sp.csc_matrix([[0, 2], [6, 0], [7, 6]]) print(x_test) print(imp.transform(x_test)) 输出: (1, 0) 6 (2, 0) 7 (0, 1) 2 (2, 1) 6 [[ 4. 2. ] [ 6. 3.66666675] [ 7. 6. ]] [finished in 0.6s]
    • note

    1.3.3 generating polynomial features

    • usage

      import numpy as np from sklearn.preprocessing import polynomialfeatures x=np.arange(6).reshape(3,2) print(x) poly=polynomialfeatures(2) print(poly.fit_transform(x))


      [[0 1] [2 3] [4 5]] [[ 1. 0. 1. 0. 0. 1.] [ 1. 2. 3. 4. 6. 9.] [ 1. 4. 5. 16. 20. 25.]] [finished in 0.8s]
    • note
      生成多项式特征用在多项式回归中以及多项式核方法中 。


    1.3.4 custom transformers


    • usage:
      代码: import numpy as np from sklearn.preprocessing import functiontransformer transformer = functiontransformer(np.log1p) x=np.array([[0,1],[2,3]]) print(transformer.transform(x)) 输出: [[ 0. 0.69314718] [ 1.09861229 1.38629436]] [finished in 0.8s]
    • note

      for a full code example that demonstrates using a functiontransformer to do custom feature selection, see using functiontransformer to select columns





