scikit-learn:0.4 使用“Pipeline”统一vectorizer => transformer => classifier、网格搜索调参

前端之家收集整理的这篇文章主要介绍了scikit-learn:0.4 使用“Pipeline”统一vectorizer => transformer => classifier、网格搜索调参前端之家小编觉得挺不错的,现在分享给大家,也给大家做个参考。


http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html


<strong>1、使用“Pipeline”统一vectorizer => transformer => classifier</strong>
from sklearn.pipeline import Pipeline

text_clf = Pipeline([('vect',CountVectorizer()),...                      ('tfidf',TfidfTransformer()),...                      ('clf',MultinomialNB()),... ])

text_clf = text_clf.fit(rawData.data,rawData.target)
predicted = text_clf.predict(docs_new) 
<strong>#注意,这里是未经任何处理的原始文件,不是X_new_tfidf,否则出现下面错误。</strong>

np.mean(predicted == y_new_target)
Out[51]: 0.5

predicted = text_clf.predict(X_new_tfidf)
Traceback (most recent call last):

  File "<ipython-input-52-20002e79f960>",line 1,in <module>
    predicted = text_clf.predict(X_new_tfidf)

  File "D:\Anaconda\lib\site-packages\sklearn\pipeline.py",line 149,in predict
    Xt = transform.transform(Xt)

  File "D:\Anaconda\lib\site-packages\sklearn\feature_extraction\text.py",line 867,in transform
    _,X = self._count_vocab(raw_documents,fixed_vocab=True)

  File "D:\Anaconda\lib\site-packages\sklearn\feature_extraction\text.py",line 748,in _count_vocab
    for feature in analyze(doc):

  File "D:\Anaconda\lib\site-packages\sklearn\feature_extraction\text.py",line 234,in <lambda>
    tokenize(preprocess(self.decode(doc))),stop_words)

  File "D:\Anaconda\lib\site-packages\sklearn\feature_extraction\text.py",line 200,in <lambda>
    return lambda x: strip_accents(x.lower())

  File "D:\Anaconda\lib\site-packages\scipy\sparse\base.py",line 499,in __getattr__
    raise AttributeError(attr + " not found")

AttributeError: lower not found


<strong>2、使用网格搜索调参</strong>

from sklearn.grid_search import GridSearchCV
parameters = {'vect__ngram_range': [(1,1),(1,2)],...               'tfidf__use_idf': (True,False),...               'clf__alpha': (1e-2,1e-3),... }
gs_clf = GridSearchCV(text_clf,parameters,n_jobs=-1)
#这里n_jobs=-1告诉grid search要自动识别机器有几个核,并使用所有的核并行跑程序。

gs_clf = gs_clf.fit(rawData.data,rawData.target)
rawData.target_names[gs_clf.predict(['i love this book'])]
'positive folder'

输出效果最好的参数:
>>> best_parameters,score,_ = max(gs_clf.grid_scores_,key=lambda x: x[1])
>>> for param_name in sorted(parameters.keys()):
...     print("%s: %r" % (param_name,best_parameters[param_name]))
...
clf__alpha: 0.001
tfidf__use_idf: True
vect__ngram_range: (1,1)

>>> score                                              
1.000

猜你在找的设计模式相关文章