带病在网吧里写,,,,给点鼓励吧。。。
http://scikit-learn.org/stable/modules/pipeline.html
1、pipeline和featureUnion是干什么的:
pipeline之前已经介绍过了,结合transformer和estimator。
featureUinon听名字就知道,将多个transformer的结果vector拼接成大的vector。
@H_301_15@
@H_301_15@
2、两者的区别:
前者相当于feature串行处理,后一个transformer处理前一个transformer的feature结果;
后者相当于feature的并行处理,将所有transformer的处理结果拼接成大的feature vector。
@H_301_15@
@H_301_15@
3、pipeline:chaining estimators
Pipelinecan be used to chain multiple estimators into one. 因为我们处理数据的过程一般都是比较固定的,比如特征选择、规范化、分类。所以pipeline主要由两个目的:@H_301_15@
方便:fit、predict一次即可处理所有estimators的结果。
拼接参数选择:仅需一次即可grid search所有estimators的所有parameters。
@H_301_15@
pipeline的所有的estimators(除了最后一个)都必须是transformer(有transform方法),最后一个estimator可以使任何类型(transformer、classifier)
@H_301_15@
使用:通过一组(key,value)对来串联所有的estimators,key是自己对每一步骤的随意的命名,value是一个estimator object,例如:
每一个阶段的estimators存放在 steps属性中,可以通过索引这样取出每一个estimators:>>> clf.steps[0] ('reduce_dim',whiten=False))也可以通过name这样取出每一个estimators( as a dict in named_steps:): @H_301_15@
.named_steps['reduce_dim'] PCA(copy=True,whiten=False)想改变estimators的parameter值?用这样的语法: <estimator>__<parameter> Syntax,例如:
clf.set_paramssvm__C=10) whiten=False)),SVC(C=10,51)"> verbose=False))])@H_301_15@
终极目的,grid searches:
from sklearn.grid_search import GridSearchCV >>> params = dict(reduce_dim__n_components=[2, 5,80)">10], ... svm__C=[0.1,80)">10,80)">100]) >>> grid_search = GridSearchCV(clf, param_grid=params)@H_301_15@
最经典的文本分类来了:
# define a pipeline combining a text feature extractor with a simple # classifier pipeline = Pipeline([ ('vect', CountVectorizer()), ('tfidf', TfidfTransformer()),160)">'clf', SGDClassifier()), ]) # uncommenting more parameters will give better exploring power but will # increase processing time in a combinatorial way parameters = { 'vect__max_df': (0.5, 0.75,80)">1.0), #'vect__max_features': (None,5000,10000,50000), 'vect__ngram_range': ((1,80)">1), (2)), # unigrams or bigrams #'tfidf__use_idf': (True,False),144); font-style:italic">#'tfidf__norm': ('l1','l2'),160)">'clf__alpha': (0.00001,80)">0.000001),160)">'clf__penalty': ('l2', 'elasticnet'),144); font-style:italic">#'clf__n_iter': (10,50,80), } if __name__ == "__main__": # multiprocessing requires the fork to happen in a __main__ protected # block # find the best parameters for both the feature extraction and the # classifier grid_search = GridSearchCV(pipeline, parameters, n_jobs=-verbose1)
Notes:重要的事情不翻译,
Callingfiton the pipeline is the same as callingon each estimator in turn,transformthe input and pass it on to the next step.
Thepipeline has all the methods that the last estimator in the pipeline has,i.e. if the last estimator is a classifier,thecan be used as a classifier. If the last estimator is a transformer,again,so is the pipeline.
4、FeatureUnion:composite feature spaces
featureUnion描述,重要的不翻译:
FeatureUnioncombines several transformer objects into a new transformer that combines their output. Atakes a list of transformer objects. During fitting,each of these is fit to the data independently. For transforming data,the transformers are applied in parallel,and thesample vectors they output are concatenated end-to-end into larger vectors.
@H_301_15@
featureUnion和pipleline同样是为了方便和joint parameter,两者也可以结合成更加复杂的模型。
@H_301_15@
(featureUnion不管两个transformers是否产生相同的特征,他仅仅简单的拼接所有的特征,判重工作还是要你自己来做的。。。)
@H_301_15@
@H_301_15@
使用:通过一组(key,value)对来串联所有的estimators,key是自己对每一步骤的随意的命名,value是一个estimator object,例如:@H_301_15@
sklearn.pipeline import FeatureUnion sklearn.decomposition import PCA import KernelPCA >>> estimators = [('linear_pca', PCA()), ('kernel_pca', KernelPCA())] >>> combined = FeatureUnion(estimators) >>> combined FeatureUnion(n_jobs=1,transformer_list=[('linear_pca', n_components=None,whiten=False)),('kernel_pca',KernelPCA(alpha=1.0,51)"> coef0=1,eigen_solver='auto',fit_inverse_transform=False,51)"> gamma=None,kernel='linear',kernel_params=None,max_iter=None,remove_zero_eig=False,tol=0))],51)"> transformer_weights=None)@H_301_15@
最后给个例子:
http://scikit-learn.org/stable/auto_examples/feature_stacker.html#example-feature-stacker-py@H_301_15@
感谢
Author: Andreas Mueller <amueller@ais.uni-bonn.de>
# Author: Andreas Mueller <amueller@ais.uni-bonn.de> # # License: BSD 3 clause import Pipeline, FeatureUnion import GridSearchCV sklearn.svm import SVC sklearn.datasets import load_iris import PCA sklearn.feature_selection import SelectKBest iris = load_iris() X, y = iris.data, iris.target # This dataset is way to high-dimensional. Better do PCA: pca = PCA(n_components2) # Maybe some original features where good,too? selection = SelectKBest(k1) # Build estimator from PCA and Univariate selection: combined_features = FeatureUnion([("pca", pca),160)">"univ_select", selection)]) # Use combined features to transform dataset: X_features = combined_features.fit(X, y).transform(X) svm = SVC(kernel="linear") # Do grid search over k,n_components and C: pipeline Pipeline([("features", combined_features),160)">"svm", svm)]) param_grid dict(features__pca__n_components3], features__univ_select__k2], svm__C10]) grid_search = GridSearchCV(pipeline,102)">=param_grid,80)">10) grid_searchy) print(grid_search.best_estimator_)@H_301_15@
@H_301_15@
完,看来以后提取特征有可以省很多事了。。。。。。。。