http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html
<strong>1、使用“Pipeline”统一vectorizer => transformer => classifier</strong> from sklearn.pipeline import Pipeline text_clf = Pipeline([('vect',CountVectorizer()),... ('tfidf',TfidfTransformer()),... ('clf',MultinomialNB()),... ]) text_clf = text_clf.fit(rawData.data,rawData.target) predicted = text_clf.predict(docs_new) <strong>#注意,这里是未经任何处理的原始文件,不是X_new_tfidf,否则出现下面错误。</strong> np.mean(predicted == y_new_target) Out[51]: 0.5 predicted = text_clf.predict(X_new_tfidf) Traceback (most recent call last): File "<ipython-input-52-20002e79f960>",line 1,in <module> predicted = text_clf.predict(X_new_tfidf) File "D:\Anaconda\lib\site-packages\sklearn\pipeline.py",line 149,in predict Xt = transform.transform(Xt) File "D:\Anaconda\lib\site-packages\sklearn\feature_extraction\text.py",line 867,in transform _,X = self._count_vocab(raw_documents,fixed_vocab=True) File "D:\Anaconda\lib\site-packages\sklearn\feature_extraction\text.py",line 748,in _count_vocab for feature in analyze(doc): File "D:\Anaconda\lib\site-packages\sklearn\feature_extraction\text.py",line 234,in <lambda> tokenize(preprocess(self.decode(doc))),stop_words) File "D:\Anaconda\lib\site-packages\sklearn\feature_extraction\text.py",line 200,in <lambda> return lambda x: strip_accents(x.lower()) File "D:\Anaconda\lib\site-packages\scipy\sparse\base.py",line 499,in __getattr__ raise AttributeError(attr + " not found") AttributeError: lower not found <strong>2、使用网格搜索调参</strong> from sklearn.grid_search import GridSearchCV parameters = {'vect__ngram_range': [(1,1),(1,2)],... 'tfidf__use_idf': (True,False),... 'clf__alpha': (1e-2,1e-3),... } gs_clf = GridSearchCV(text_clf,parameters,n_jobs=-1) #这里n_jobs=-1告诉grid search要自动识别机器有几个核,并使用所有的核并行跑程序。 gs_clf = gs_clf.fit(rawData.data,rawData.target) rawData.target_names[gs_clf.predict(['i love this book'])] 'positive folder'
输出效果最好的参数: >>> best_parameters,score,_ = max(gs_clf.grid_scores_,key=lambda x: x[1]) >>> for param_name in sorted(parameters.keys()): ... print("%s: %r" % (param_name,best_parameters[param_name])) ... clf__alpha: 0.001 tfidf__use_idf: True vect__ngram_range: (1,1) >>> score 1.000