scikit-learn:4.1. Pipeline and FeatureUnion: combining estimators(特征与预测器结合;特征与特征结合)

前端之家收集整理的这篇文章主要介绍了scikit-learn:4.1. Pipeline and FeatureUnion: combining estimators(特征与预测器结合;特征与特征结合)前端之家小编觉得挺不错的,现在分享给大家,也给大家做个参考。

带病在网吧里写,,,,给点鼓励吧。。。

http://scikit-learn.org/stable/modules/pipeline.html

1、pipeline和featureUnion是干什么的:

pipeline之前已经介绍过了,结合transformer和estimator。

featureUinon听名字就知道,将多个transformer的结果vector拼接成大的vector。



2、两者的区别:

前者相当于feature串行处理,后一个transformer处理前一个transformer的feature结果;

后者相当于feature的并行处理,将所有transformer的处理结果拼接成大的feature vector。



3、pipeline:chaining estimators

Pipelinecan be used to chain multiple estimators into one. 因为我们处理数据的过程一般都是比较固定的,比如特征选择、规范化、分类。所以pipeline主要由两个目的:@H_403_42@

方便:fit、predict一次即可处理所有estimators的结果。@H_403_42@

拼接参数选择:仅需一次即可grid search所有estimators的所有parameters。@H_403_42@@H_403_42@


pipeline的所有的estimators(除了最后一个)都必须是transformer(有transform方法),最后一个estimator可以使任何类型(transformer、classifier)


使用:通过一组(key,value)对来串联所有的estimators,key是自己对每一步骤的随意的命名,value是一个estimator object,例如:

>>> @H_403_42@from@H_403_42@ @H_403_42@sklearn.pipeline@H_403_42@ @H_403_42@import@H_403_42@ @H_403_42@Pipeline@H_403_42@
@H_403_42@sklearn.svm@H_403_42@ @H_403_42@SVC@H_403_42@
@H_403_42@sklearn.decomposition@H_403_42@ @H_403_42@PCA@H_403_42@
@H_403_42@>>> @H_403_42@estimators@H_403_42@ @H_403_42@=@H_403_42@ @H_403_42@[(@H_403_42@'reduce_dim'@H_403_42@,@H_403_42@ @H_403_42@PCA@H_403_42@()),@H_403_42@ @H_403_42@(@H_403_42@'svm'@H_403_42@SVC@H_403_42@())]@H_403_42@
@H_403_42@clf@H_403_42@ @H_403_42@=@H_403_42@ @H_403_42@Pipeline@H_403_42@(@H_403_42@estimators@H_403_42@)@H_403_42@
@H_403_42@clf@H_403_42@ 
@H_403_42@Pipeline(steps=[('reduce_dim',PCA(copy=True,n_components=None,@H_403_42@
@H_403_42@ whiten=False)),('svm',SVC(@H_403_42@C=1.0@H_403_42@,cache_size=200,class_weight=None,@H_403_42@@H_403_42@
@H_403_42@    coef0=0.0,degree=3,gamma=0.0,kernel='rbf',max_iter=-1,@H_403_42@
@H_403_42@    probability=False,random_state=None,shrinking=True,tol=0.001,51)">    verbose=False))])@H_403_42@
每一个阶段的estimators存放在 steps属性中,可以通过索引这样取出每一个estimators:
>>> @H_403_42@@H_403_185@clf@H_403_42@.@H_403_42@@H_403_185@steps@H_403_42@[@H_403_42@0@H_403_42@]@H_403_42@
('reduce_dim',whiten=False))@H_403_42@
也可以通过name这样取出每一个estimators( as a@H_403_42@ dict@H_403_42@ in@H_403_42@ named_steps@H_403_42@:@H_403_42@):
.@H_403_42@@H_403_185@named_steps@H_403_42@[@H_403_42@'reduce_dim'@H_403_42@]@H_403_42@
PCA(copy=True,whiten=False)@H_403_42@
想改变estimators的parameter值?用这样的语法: <estimator>__<parameter>@H_403_42@ Syntax,例如:@H_403_42@

@H_403_42@

clf@H_403_42@.@H_403_42@set_params@H_403_42@svm__C@H_403_42@=@H_403_42@10@H_403_42@)@H_403_42@ 
@H_403_42@    whiten=False)),SVC(@H_403_42@C=10@H_403_42@@H_403_42@,51)">    verbose=False))])@H_403_42@

终极目的,grid searches:

from@H_403_42@ sklearn.grid_search@H_403_42@ import@H_403_42@ @H_403_185@GridSearchCV@H_403_42@
>>> @H_403_42@@H_403_185@params@H_403_42@ =@H_403_42@ dict@H_403_42@(@H_403_42@@H_403_185@reduce_dim__n_components@H_403_42@=@H_403_42@[@H_403_42@2@H_403_42@,@H_403_42@ 5@H_403_42@,80)">10@H_403_42@],@H_403_42@
... @H_403_42@              @H_403_185@svm__C@H_403_42@=@H_403_42@[@H_403_42@0.1@H_403_42@,80)">10@H_403_42@,80)">100@H_403_42@])@H_403_42@
>>> @H_403_42@@H_403_185@grid_search@H_403_42@ =@H_403_42@ @H_403_185@GridSearchCV@H_403_42@(@H_403_42@@H_403_185@clf@H_403_42@,@H_403_42@ @H_403_185@param_grid@H_403_42@=@H_403_42@@H_403_185@params@H_403_42@)@H_403_42@

最经典的文本分类来了:

# define a pipeline combining a text feature extractor with a simple@H_403_42@
# classifier@H_403_42@
@H_403_185@pipeline@H_403_42@ =@H_403_42@ Pipeline([@H_403_42@
    (@H_403_42@'vect'@H_403_42@,@H_403_42@ CountVectorizer()),@H_403_42@
    (@H_403_42@'tfidf'@H_403_42@,@H_403_42@ TfidfTransformer()),160)">'clf'@H_403_42@,@H_403_42@ SGDClassifier()),@H_403_42@
])@H_403_42@

# uncommenting more parameters will give better exploring power but will@H_403_42@
# increase processing time in a combinatorial way@H_403_42@
@H_403_185@parameters@H_403_42@ =@H_403_42@ {@H_403_42@
    'vect__max_df'@H_403_42@:@H_403_42@ (@H_403_42@0.5@H_403_42@,@H_403_42@ 0.75@H_403_42@,80)">1.0@H_403_42@),@H_403_42@
    #'vect__max_features': (None,5000,10000,50000),@H_403_42@
    'vect__ngram_range'@H_403_42@:@H_403_42@ ((@H_403_42@1@H_403_42@,80)">1@H_403_42@),@H_403_42@ (@H_403_42@2@H_403_42@)),@H_403_42@  # unigrams or bigrams@H_403_42@
    #'tfidf__use_idf': (True,False),144); font-style:italic">#'tfidf__norm': ('l1','l2'),160)">'clf__alpha'@H_403_42@:@H_403_42@ (@H_403_42@0.00001@H_403_42@,80)">0.000001@H_403_42@),160)">'clf__penalty'@H_403_42@:@H_403_42@ (@H_403_42@'l2'@H_403_42@,@H_403_42@ 'elasticnet'@H_403_42@),144); font-style:italic">#'clf__n_iter': (10,50,80),@H_403_42@
}@H_403_42@

if@H_403_42@ @H_403_185@__name__@H_403_42@ ==@H_403_42@ "__main__"@H_403_42@:@H_403_42@
    # multiprocessing requires the fork to happen in a __main__ protected@H_403_42@
    # block@H_403_42@

    # find the best parameters for both the feature extraction and the@H_403_42@
    # classifier@H_403_42@
    @H_403_185@grid_search@H_403_42@ =@H_403_42@ GridSearchCV(@H_403_42@@H_403_185@pipeline@H_403_42@,@H_403_42@ @H_403_185@parameters@H_403_42@,@H_403_42@ @H_403_185@n_jobs@H_403_42@=-@H_403_42@403_42@ @H_403_185@verbose@H_403_42@1@H_403_42@)@H_403_42@

@H_403_42@

Notes:重要的事情不翻译,

Calling@H_403_42@fit@H_403_42@on the pipeline is the same as calling@H_403_42@on each estimator in turn,@H_403_42@transform@H_403_42@the input and pass it on to the next step.@H_403_42@

Thepipeline has all the methods that the last estimator in the pipeline has,i.e. if the last estimator is a classifier,the@H_403_42@can be used as a classifier. If the last estimator is a transformer,again,so is the pipeline.




4、FeatureUnion:composite feature spaces


featureUnion描述,重要的不翻译:

FeatureUnioncombines several transformer objects into a new transformer that combines their output. A@H_403_42@takes a list of transformer objects. During fitting,each of these is fit to the data independently. For transforming data,the transformers are applied in parallel,and thesample vectors they output are concatenated end-to-end into larger vectors.@H_403_42@


featureUnion和pipleline同样是为了方便和joint parameter,两者也可以结合成更加复杂的模型。


(featureUnion不管两个transformers是否产生相同的特征,他仅仅简单的拼接所有的特征,判重工作还是要你自己来做的。。。)



使用:通过一组(key,value)对来串联所有的estimators,key是自己对每一步骤的随意的命名,value是一个estimator object,例如:

sklearn.pipeline@H_403_42@ import@H_403_42@ @H_403_185@FeatureUnion@H_403_42@
sklearn.decomposition@H_403_42@ import@H_403_42@ @H_403_185@PCA@H_403_42@
import@H_403_42@ @H_403_185@KernelPCA@H_403_42@
>>> @H_403_42@@H_403_185@estimators@H_403_42@ =@H_403_42@ [(@H_403_42@'linear_pca'@H_403_42@,@H_403_42@ @H_403_185@PCA@H_403_42@()),@H_403_42@ (@H_403_42@'kernel_pca'@H_403_42@,@H_403_42@ @H_403_185@KernelPCA@H_403_42@())]@H_403_42@
>>> @H_403_42@@H_403_185@combined@H_403_42@ =@H_403_42@ @H_403_185@FeatureUnion@H_403_42@(@H_403_42@@H_403_185@estimators@H_403_42@)@H_403_42@
>>> @H_403_42@@H_403_185@combined@H_403_42@ 
FeatureUnion(n_jobs=1,transformer_list=[('linear_pca',@H_403_42@
    n_components=None,whiten=False)),('kernel_pca',KernelPCA(alpha=1.0,51)">    coef0=1,eigen_solver='auto',fit_inverse_transform=False,51)">    gamma=None,kernel='linear',kernel_params=None,max_iter=None,remove_zero_eig=False,tol=0))],51)">    transformer_weights=None)@H_403_42@

最后给个例子:

http://scikit-learn.org/stable/auto_examples/feature_stacker.html#example-feature-stacker-py

感谢

Author: Andreas Mueller <amueller@ais.uni-bonn.de>@H_403_42@
# Author: Andreas Mueller <amueller@ais.uni-bonn.de>@H_403_42@
#@H_403_42@
# License: BSD 3 clause@H_403_42@

import@H_403_42@ Pipeline,@H_403_42@ @H_403_185@FeatureUnion@H_403_42@
import@H_403_42@ GridSearchCV
sklearn.svm@H_403_42@ import@H_403_42@ SVC
sklearn.datasets@H_403_42@ import@H_403_42@ load_iris
import@H_403_42@ PCA
sklearn.feature_selection@H_403_42@ import@H_403_42@ SelectKBest

@H_403_185@iris@H_403_42@ =@H_403_42@ load_iris()@H_403_42@

@H_403_185@X@H_403_42@,@H_403_42@ @H_403_185@y@H_403_42@ =@H_403_42@ @H_403_185@iris@H_403_42@.@H_403_42@@H_403_185@data@H_403_42@,@H_403_42@ @H_403_185@iris@H_403_42@.@H_403_42@@H_403_185@target@H_403_42@

# This dataset is way to high-dimensional. Better do PCA:@H_403_42@
@H_403_185@pca@H_403_42@ =@H_403_42@ PCA(@H_403_42@@H_403_185@n_components@H_403_42@2@H_403_42@)@H_403_42@

# Maybe some original features where good,too?@H_403_42@
@H_403_185@selection@H_403_42@ =@H_403_42@ SelectKBest(@H_403_42@@H_403_185@k@H_403_42@1@H_403_42@)@H_403_42@

# Build estimator from PCA and Univariate selection:@H_403_42@

@H_403_185@combined_features@H_403_42@ =@H_403_42@ @H_403_185@FeatureUnion@H_403_42@([(@H_403_42@"pca"@H_403_42@,@H_403_42@ @H_403_185@pca@H_403_42@),160)">"univ_select"@H_403_42@,@H_403_42@ @H_403_185@selection@H_403_42@)])@H_403_42@

# Use combined features to transform dataset:@H_403_42@
@H_403_185@X_features@H_403_42@ =@H_403_42@ @H_403_185@combined_features@H_403_42@.@H_403_42@@H_403_185@fit@H_403_42@(@H_403_42@@H_403_185@X@H_403_42@,@H_403_42@ @H_403_185@y@H_403_42@)@H_403_42@.@H_403_42@@H_403_185@transform@H_403_42@(@H_403_42@@H_403_185@X@H_403_42@)@H_403_42@

@H_403_185@svm@H_403_42@ =@H_403_42@ SVC(@H_403_42@@H_403_185@kernel@H_403_42@=@H_403_42@"linear"@H_403_42@)@H_403_42@

# Do grid search over k,n_components and C:@H_403_42@

@H_403_185@pipeline@H_403_42@ @H_403_185@Pipeline@H_403_42@([(@H_403_42@"features"@H_403_42@,@H_403_42@ @H_403_185@combined_features@H_403_42@),160)">"svm"@H_403_42@,@H_403_42@ @H_403_185@svm@H_403_42@)])@H_403_42@

@H_403_185@param_grid@H_403_42@ dict@H_403_42@(@H_403_42@@H_403_185@features__pca__n_components@H_403_42@3@H_403_42@],@H_403_42@
                  @H_403_185@features__univ_select__k@H_403_42@2@H_403_42@],@H_403_42@
                  @H_403_185@svm__C@H_403_42@10@H_403_42@])@H_403_42@

@H_403_185@grid_search@H_403_42@ =@H_403_42@ GridSearchCV(@H_403_42@@H_403_185@pipeline@H_403_42@,102)">=@H_403_42@@H_403_185@param_grid@H_403_42@,80)">10@H_403_42@)@H_403_42@
@H_403_185@grid_search@H_403_42@403_42@ @H_403_185@y@H_403_42@)@H_403_42@
print@H_403_42@(@H_403_42@@H_403_185@grid_search@H_403_42@.@H_403_42@@H_403_185@best_estimator_@H_403_42@)@H_403_42@


完,看来以后提取特征有可以省很多事了。。。。。。。。

猜你在找的设计模式相关文章