带病在网吧里写,,,,给点鼓励吧。。。
http://scikit-learn.org/stable/modules/pipeline.html
1、pipeline和featureUnion是干什么的:
pipeline之前已经介绍过了,结合transformer和estimator。
featureUinon听名字就知道,将多个transformer的结果vector拼接成大的vector。
2、两者的区别:
前者相当于feature串行处理,后一个transformer处理前一个transformer的feature结果;
后者相当于feature的并行处理,将所有transformer的处理结果拼接成大的feature vector。
3、pipeline:chaining estimators
Pipelinecan be used to chain multiple estimators into one. 因为我们处理数据的过程一般都是比较固定的,比如特征选择、规范化、分类。所以pipeline主要由两个目的:@H_403_42@
方便:fit、predict一次即可处理所有estimators的结果。@H_403_42@
拼接参数选择:仅需一次即可grid search所有estimators的所有parameters。@H_403_42@@H_403_42@
pipeline的所有的estimators(除了最后一个)都必须是transformer(有transform方法),最后一个estimator可以使任何类型(transformer、classifier)
使用:通过一组(key,value)对来串联所有的estimators,key是自己对每一步骤的随意的命名,value是一个estimator object,例如:
每一个阶段的estimators存放在 steps属性中,可以通过索引这样取出每一个estimators:>>> @H_403_42@@H_403_185@clf@H_403_42@.@H_403_42@@H_403_185@steps@H_403_42@[@H_403_42@0@H_403_42@]@H_403_42@ ('reduce_dim',whiten=False))@H_403_42@也可以通过name这样取出每一个estimators( as a@H_403_42@ dict@H_403_42@ in@H_403_42@ named_steps@H_403_42@:@H_403_42@):
.@H_403_42@@H_403_185@named_steps@H_403_42@[@H_403_42@'reduce_dim'@H_403_42@]@H_403_42@ PCA(copy=True,whiten=False)@H_403_42@想改变estimators的parameter值?用这样的语法: <estimator>__<parameter>@H_403_42@ Syntax,例如:@H_403_42@
@H_403_42@
clf@H_403_42@.@H_403_42@set_params@H_403_42@svm__C@H_403_42@=@H_403_42@10@H_403_42@)@H_403_42@ @H_403_42@ whiten=False)),SVC(@H_403_42@C=10@H_403_42@@H_403_42@,51)"> verbose=False))])@H_403_42@
终极目的,grid searches:
from@H_403_42@ sklearn.grid_search@H_403_42@ import@H_403_42@ @H_403_185@GridSearchCV@H_403_42@ >>> @H_403_42@@H_403_185@params@H_403_42@ =@H_403_42@ dict@H_403_42@(@H_403_42@@H_403_185@reduce_dim__n_components@H_403_42@=@H_403_42@[@H_403_42@2@H_403_42@,@H_403_42@ 5@H_403_42@,80)">10@H_403_42@],@H_403_42@ ... @H_403_42@ @H_403_185@svm__C@H_403_42@=@H_403_42@[@H_403_42@0.1@H_403_42@,80)">10@H_403_42@,80)">100@H_403_42@])@H_403_42@ >>> @H_403_42@@H_403_185@grid_search@H_403_42@ =@H_403_42@ @H_403_185@GridSearchCV@H_403_42@(@H_403_42@@H_403_185@clf@H_403_42@,@H_403_42@ @H_403_185@param_grid@H_403_42@=@H_403_42@@H_403_185@params@H_403_42@)@H_403_42@
最经典的文本分类来了:
# define a pipeline combining a text feature extractor with a simple@H_403_42@
# classifier@H_403_42@
@H_403_185@pipeline@H_403_42@ =@H_403_42@ Pipeline([@H_403_42@
(@H_403_42@'vect'@H_403_42@,@H_403_42@ CountVectorizer()),@H_403_42@
(@H_403_42@'tfidf'@H_403_42@,@H_403_42@ TfidfTransformer()),160)">'clf'@H_403_42@,@H_403_42@ SGDClassifier()),@H_403_42@
])@H_403_42@
# uncommenting more parameters will give better exploring power but will@H_403_42@
# increase processing time in a combinatorial way@H_403_42@
@H_403_185@parameters@H_403_42@ =@H_403_42@ {@H_403_42@
'vect__max_df'@H_403_42@:@H_403_42@ (@H_403_42@0.5@H_403_42@,@H_403_42@ 0.75@H_403_42@,80)">1.0@H_403_42@),@H_403_42@
#'vect__max_features': (None,5000,10000,50000),@H_403_42@
'vect__ngram_range'@H_403_42@:@H_403_42@ ((@H_403_42@1@H_403_42@,80)">1@H_403_42@),@H_403_42@ (@H_403_42@2@H_403_42@)),@H_403_42@ # unigrams or bigrams@H_403_42@
#'tfidf__use_idf': (True,False),144); font-style:italic">#'tfidf__norm': ('l1','l2'),160)">'clf__alpha'@H_403_42@:@H_403_42@ (@H_403_42@0.00001@H_403_42@,80)">0.000001@H_403_42@),160)">'clf__penalty'@H_403_42@:@H_403_42@ (@H_403_42@'l2'@H_403_42@,@H_403_42@ 'elasticnet'@H_403_42@),144); font-style:italic">#'clf__n_iter': (10,50,80),@H_403_42@
}@H_403_42@
if@H_403_42@ @H_403_185@__name__@H_403_42@ ==@H_403_42@ "__main__"@H_403_42@:@H_403_42@
# multiprocessing requires the fork to happen in a __main__ protected@H_403_42@
# block@H_403_42@
# find the best parameters for both the feature extraction and the@H_403_42@
# classifier@H_403_42@
@H_403_185@grid_search@H_403_42@ =@H_403_42@ GridSearchCV(@H_403_42@@H_403_185@pipeline@H_403_42@,@H_403_42@ @H_403_185@parameters@H_403_42@,@H_403_42@ @H_403_185@n_jobs@H_403_42@=-@H_403_42@403_42@ @H_403_185@verbose@H_403_42@1@H_403_42@)@H_403_42@
@H_403_42@
Notes:重要的事情不翻译,
Calling@H_403_42@fit@H_403_42@on the pipeline is the same as calling@H_403_42@on each estimator in turn,@H_403_42@transform@H_403_42@the input and pass it on to the next step.@H_403_42@
Thepipeline has all the methods that the last estimator in the pipeline has,i.e. if the last estimator is a classifier,the@H_403_42@can be used as a classifier. If the last estimator is a transformer,again,so is the pipeline.
4、FeatureUnion:composite feature spaces
featureUnion描述,重要的不翻译:
FeatureUnioncombines several transformer objects into a new transformer that combines their output. A@H_403_42@takes a list of transformer objects. During fitting,each of these is fit to the data independently. For transforming data,the transformers are applied in parallel,and thesample vectors they output are concatenated end-to-end into larger vectors.@H_403_42@
featureUnion和pipleline同样是为了方便和joint parameter,两者也可以结合成更加复杂的模型。
(featureUnion不管两个transformers是否产生相同的特征,他仅仅简单的拼接所有的特征,判重工作还是要你自己来做的。。。)
使用:通过一组(key,value)对来串联所有的estimators,key是自己对每一步骤的随意的命名,value是一个estimator object,例如:
sklearn.pipeline@H_403_42@ import@H_403_42@ @H_403_185@FeatureUnion@H_403_42@ sklearn.decomposition@H_403_42@ import@H_403_42@ @H_403_185@PCA@H_403_42@ import@H_403_42@ @H_403_185@KernelPCA@H_403_42@ >>> @H_403_42@@H_403_185@estimators@H_403_42@ =@H_403_42@ [(@H_403_42@'linear_pca'@H_403_42@,@H_403_42@ @H_403_185@PCA@H_403_42@()),@H_403_42@ (@H_403_42@'kernel_pca'@H_403_42@,@H_403_42@ @H_403_185@KernelPCA@H_403_42@())]@H_403_42@ >>> @H_403_42@@H_403_185@combined@H_403_42@ =@H_403_42@ @H_403_185@FeatureUnion@H_403_42@(@H_403_42@@H_403_185@estimators@H_403_42@)@H_403_42@ >>> @H_403_42@@H_403_185@combined@H_403_42@ FeatureUnion(n_jobs=1,transformer_list=[('linear_pca',@H_403_42@ n_components=None,whiten=False)),('kernel_pca',KernelPCA(alpha=1.0,51)"> coef0=1,eigen_solver='auto',fit_inverse_transform=False,51)"> gamma=None,kernel='linear',kernel_params=None,max_iter=None,remove_zero_eig=False,tol=0))],51)"> transformer_weights=None)@H_403_42@
最后给个例子:
http://scikit-learn.org/stable/auto_examples/feature_stacker.html#example-feature-stacker-py
感谢
Author: Andreas Mueller <amueller@ais.uni-bonn.de>@H_403_42@
# Author: Andreas Mueller <amueller@ais.uni-bonn.de>@H_403_42@ #@H_403_42@ # License: BSD 3 clause@H_403_42@ import@H_403_42@ Pipeline,@H_403_42@ @H_403_185@FeatureUnion@H_403_42@ import@H_403_42@ GridSearchCV sklearn.svm@H_403_42@ import@H_403_42@ SVC sklearn.datasets@H_403_42@ import@H_403_42@ load_iris import@H_403_42@ PCA sklearn.feature_selection@H_403_42@ import@H_403_42@ SelectKBest @H_403_185@iris@H_403_42@ =@H_403_42@ load_iris()@H_403_42@ @H_403_185@X@H_403_42@,@H_403_42@ @H_403_185@y@H_403_42@ =@H_403_42@ @H_403_185@iris@H_403_42@.@H_403_42@@H_403_185@data@H_403_42@,@H_403_42@ @H_403_185@iris@H_403_42@.@H_403_42@@H_403_185@target@H_403_42@ # This dataset is way to high-dimensional. Better do PCA:@H_403_42@ @H_403_185@pca@H_403_42@ =@H_403_42@ PCA(@H_403_42@@H_403_185@n_components@H_403_42@2@H_403_42@)@H_403_42@ # Maybe some original features where good,too?@H_403_42@ @H_403_185@selection@H_403_42@ =@H_403_42@ SelectKBest(@H_403_42@@H_403_185@k@H_403_42@1@H_403_42@)@H_403_42@ # Build estimator from PCA and Univariate selection:@H_403_42@ @H_403_185@combined_features@H_403_42@ =@H_403_42@ @H_403_185@FeatureUnion@H_403_42@([(@H_403_42@"pca"@H_403_42@,@H_403_42@ @H_403_185@pca@H_403_42@),160)">"univ_select"@H_403_42@,@H_403_42@ @H_403_185@selection@H_403_42@)])@H_403_42@ # Use combined features to transform dataset:@H_403_42@ @H_403_185@X_features@H_403_42@ =@H_403_42@ @H_403_185@combined_features@H_403_42@.@H_403_42@@H_403_185@fit@H_403_42@(@H_403_42@@H_403_185@X@H_403_42@,@H_403_42@ @H_403_185@y@H_403_42@)@H_403_42@.@H_403_42@@H_403_185@transform@H_403_42@(@H_403_42@@H_403_185@X@H_403_42@)@H_403_42@ @H_403_185@svm@H_403_42@ =@H_403_42@ SVC(@H_403_42@@H_403_185@kernel@H_403_42@=@H_403_42@"linear"@H_403_42@)@H_403_42@ # Do grid search over k,n_components and C:@H_403_42@ @H_403_185@pipeline@H_403_42@ @H_403_185@Pipeline@H_403_42@([(@H_403_42@"features"@H_403_42@,@H_403_42@ @H_403_185@combined_features@H_403_42@),160)">"svm"@H_403_42@,@H_403_42@ @H_403_185@svm@H_403_42@)])@H_403_42@ @H_403_185@param_grid@H_403_42@ dict@H_403_42@(@H_403_42@@H_403_185@features__pca__n_components@H_403_42@3@H_403_42@],@H_403_42@ @H_403_185@features__univ_select__k@H_403_42@2@H_403_42@],@H_403_42@ @H_403_185@svm__C@H_403_42@10@H_403_42@])@H_403_42@ @H_403_185@grid_search@H_403_42@ =@H_403_42@ GridSearchCV(@H_403_42@@H_403_185@pipeline@H_403_42@,102)">=@H_403_42@@H_403_185@param_grid@H_403_42@,80)">10@H_403_42@)@H_403_42@ @H_403_185@grid_search@H_403_42@403_42@ @H_403_185@y@H_403_42@)@H_403_42@ print@H_403_42@(@H_403_42@@H_403_185@grid_search@H_403_42@.@H_403_42@@H_403_185@best_estimator_@H_403_42@)@H_403_42@
完,看来以后提取特征有可以省很多事了。。。。。。。。