中文文档:http://sklearn.apachecn.org/cn/stable/modules/pipeline.html

英文文档:http://sklearn.apachecn.org/en/stable/modules/pipeline.html

GitHub:https://github.com/apachecn/scikit-learn-doc-zh（觉得不错麻烦给个 Star，我们一直在努力）

贡献者:https://github.com/apachecn/scikit-learn-doc-zh#贡献者

关于我们:http://www.apachecn.org/organization/209.html

4.1. Pipeline（管道）和 FeatureUnion（特征联合）: 合并的评估器

4.1.1. Pipeline: 链式评估器

Pipeline可以把多个评估器链接成一个。这个是很有用的，因为处理数据的步骤一般都是固定的，例如特征选择、标准化和分类。: Pipeline主要有两个目的:
便捷性和封装性: 你只要对数据调用 ``fit``和 ``predict``一次来适配所有的一系列评估器。
联合的参数选择: 你可以一次 :ref: `grid search <grid_search>`管道中所有评估器的参数。
安全性: 训练转换器和预测器使用的是相同样本，管道有助于防止来自测试数据的统计数据泄露到交叉验证的训练模型中。

管道中的所有评估器，除了最后一个评估器，管道的所有评估器必须是转换器。 (例如，必须有transform方法). 最后一个评估器的类型不限（转换器、分类器等等）

4.1.1.1. 用法

Pipeline使用一系列(key,value)键值对来构建,其中key是你给这个步骤起的名字，value是一个评估器对象:

 
     >>> 
     >>> from sklearn.pipeline import Pipeline
sklearn.svm import SVC
sklearn.decomposition import PCA
>>> estimators = [('reduce_dim', PCA()), ('clf', SVC())]
>>> pipe = Pipeline(estimators)
>>> pipe 
Pipeline(memory=None,
         steps=[('reduce_dim',PCA(copy=True,...)),51)">                ('clf',SVC(C=1.0,...))])
 
    

功能函数 make_pipeline是构建管道的缩写; 它接收多个评估器并返回一个管道，自动填充评估器名:

 
    >>> 
    import make_pipeline
sklearn.naive_bayes import MultinomialNB
sklearn.preprocessing import Binarizer
>>> make_pipeline(Binarizer(), MultinomialNB()) 
         steps=[('binarizer',Binarizer(copy=True,threshold=0.0)),51)">                ('multinomialnb',MultinomialNB(alpha=1.0,51)">                                                class_prior=None,51)">                                                fit_prior=True))])
 
   

管道中的评估器作为一个列表保存在steps属性内:

 
    >>> pipe.steps[0]
('reduce_dim',iterated_power='auto',n_components=None,random_state=None,51)">  svd_solver='auto',tol=0.0,whiten=False))

并作为dict保存在named_steps:

 
    .named_steps['reduce_dim']
PCA(copy=True,whiten=False)

管道中的评估器参数可以通过<estimator>__<parameter>语义来访问:

.set_params(clf__C=10) named_steps 的属性映射到多个值,在交互环境支持 tab 补全:

 
    .named_steps.reduce_dim is pipeTrue

这对网格搜索尤其重要:

 
    sklearn.model_selection import GridSearchCV
>>> param_grid = dict(reduce_dim__n_components=[2, 5,80)">10],
...                   clf__C=[0.1,80)">10,80)">100])
>>> grid_search = GridSearchCV(pipe, param_grid=param_grid)

单独的步骤可以用多个参数替换，除了最后步骤，其他步骤都可以设置为None来跳过

 
    sklearn.linear_model import LogisticRegression
dict(reduce_dim=[None, PCA(5),80)">10)],9); font-weight:bold">...                   clf=[SVC(), LogisticRegression()],102)">=param_grid)

例子:

也可以参阅:

调整估计器的超参数

4.1.1.2. 注意点

对管道调用fit方法的效果跟依次对每个评估器调用fit方法一样,都是``transform`` 输入并传递给下个步骤。管道中最后一个评估器的所有方法，管道都有,例如，如果最后的评估器是一个分类器，Pipeline可以当做分类器来用。如果最后一个评估器是转换器，管道也一样可以。

4.1.1.3. 缓存转换器：避免重复计算

适配转换器是很耗费计算资源的。设置了``memory`` 参数，Pipeline将会在调用``fit``方法后缓存每个转换器。如果参数和输入数据相同，这个特征用于避免重复计算适配的转换器。典型的例子是网格搜索转换器，该转化器只要适配一次就可以多次使用。

memory参数用于缓存转换器。

memory可以是包含要缓存的转换器的目录的字符串或一个joblib.Memory对象:

 
    tempfile import mkdtemp
shutil import rmtree
>>> cachedir = mkdtemp()
= Pipeline(estimators, memory=cachedir)
Pipeline(...,...))])
>>> # Clear the cache directory when you don't need it anymore
>>> rmtree(cachedir)
 
   

Warning

Side effect of caching transfomers

使用Pipeline而不开启缓存功能,还是可以通过查看原始实例的，例如:

 
     sklearn.datasets import load_digits
>>> digits = load_digits()
>>> pca1 = PCA()
>>> svm1 = SVC()
= Pipeline([(pca1), svm1)])
.fit(digits.data, digits.target)
... 
# The pca instance can be inspected directly
>>> print(pca1.components_) 
    [[ -1.77484909e-19  ... 4.07058917e-18]]
 
    

开启缓存会在适配前触发转换器的克隆。因此，管道的转换器实例不能被直接查看。在下面例子中，访问PCA实例pca2将会引发AttributeError因为pca2是一个未适配的转换器。这时应该使用属性named_steps来检查管道的评估器:

 
     >>> pca2 >>> svm2 >>> cached_pipe pca2), svm2)],9); font-weight:bold">...                        memory>>> cached_pipe Pipeline(memory=...,51)">          steps=[('reduce_dim',32)">print(cached_pipe'reduce_dim'].components_)
    [[ -1.77484909e-19  ... 4.07058917e-18]]
# Remove the cache directory
>>> rmtree(cachedir)

例子:

Selecting dimensionality reduction with Pipeline and GridSearchCV

4.1.2. FeatureUnion（特征联合）: 个特征层面

FeatureUnion合并了多个转换器对象形成一个新的转换器，该转换器合并了他们的输出。一个FeatureUnion可以接收多个转换器对象。在适配期间，每个转换器都单独的和数据适配。对于转换数据，转换器可以并发使用，且输出的样本向量被连接成更大的向量。

FeatureUnion功能与Pipeline一样- 便捷性和联合参数的估计和验证。

可以结合:class:FeatureUnion和Pipeline来创造出复杂模型。

(一个FeatureUnion没办法检查两个转换器是否会产出相同的特征。它仅仅在特征集合不相关时产生联合并确认是调用者的职责。)

4.1.2.1. 用法

一个FeatureUnion是通过一系列value)键值对来构建的,其中的key给转换器指定的名字 (一个绝对的字符串; 他只是一个代号)，value是一个评估器对象:

 
    import FeatureUnion
import KernelPCA
'linear_pca',160)">'kernel_pca', KernelPCA())]
>>> combined = FeatureUnion(estimators)
>>> combined 
FeatureUnion(n_jobs=1,51)">             transformer_list=[('linear_pca',51)">                               ('kernel_pca',KernelPCA(alpha=1.0,...))],51)">             transformer_weights=None)
 
   

跟管道一样，特征联合有一个精简版的构造器叫做:func:make_union，该构造器不需要显式给每个组价起名字。

正如Pipeline,单独的步骤可能用``set_params``替换,并设置为``None``来跳过:

 
    >>> combined.set_params(kernel_pca=None)
             transformer_weights=None)

例子:

有兴趣的们也可以和我们一起来维护，持续更新中。。。

机器学习交流群:629470233

上一篇：语言无关 – 依赖注入最佳实践和反下一篇：学界 | GANs中的明星StarGAN：使用

【Scikit-Learn 中文文档】Pipeline（管道）和 FeatureUnion（特征联合）: 合并的评估器 - 数据集转换 - 用户指南 | ApacheCN