如何添加另一个功能(文本的长度)到当前的单词分类?Scikit学习

aar*_*vam 16 python classification machine-learning scikit-learn text-classification

我正在用文字袋来分类文字.它运作良好,但我想知道如何添加一个不是一个单词的功能.

这是我的示例代码.

import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.svm import LinearSVC
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.multiclass import OneVsRestClassifier

X_train = np.array(["new york is a hell of a town",
                    "new york was originally dutch",
                    "new york is also called the big apple",
                    "nyc is nice",
                    "the capital of great britain is london. london is a huge metropolis which has a great many number of people living in it. london is also a very old town with a rich and vibrant cultural history.",
                    "london is in the uk. they speak english there. london is a sprawling big city where it's super easy to get lost and i've got lost many times.",
                    "london is in england, which is a part of great britain. some cool things to check out in london are the museum and buckingham palace.",
                    "london is in great britain. it rains a lot in britain and london's fogs are a constant theme in books based in london, such as sherlock holmes. the weather is really bad there.",])
y_train = [[0],[0],[0],[0],[1],[1],[1],[1]]

X_test = np.array(["it's a nice day in nyc",
                   'i loved the time i spent in london, the weather was great, though there was a nip in the air and i had to wear a jacket.'
                   ])   
target_names = ['Class 1', 'Class 2']

classifier = Pipeline([
    ('vectorizer', CountVectorizer(min_df=1,max_df=2)),
    ('tfidf', TfidfTransformer()),
    ('clf', OneVsRestClassifier(LinearSVC()))])
classifier.fit(X_train, y_train)
predicted = classifier.predict(X_test)
for item, labels in zip(X_test, predicted):
    print '%s => %s' % (item, ', '.join(target_names[x] for x in labels))
Run Code Online (Sandbox Code Playgroud)

现在很明显,关于伦敦的文本往往比关于纽约的文本长得多.如何将文本的长度添加为功能?我是否必须使用另一种分类方法,然后结合两种预测?有什么方法可以和一袋字一起做吗?一些示例代码会很棒 - 我对机器学习和scikit学习都很陌生.

Ken*_*yme 10

如评论中所示,这是a FunctionTransformer,a FeaturePipeline和a 的组合FeatureUnion.

import numpy as np
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.svm import LinearSVC
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.multiclass import OneVsRestClassifier
from sklearn.preprocessing import FunctionTransformer

X_train = np.array(["new york is a hell of a town",
                    "new york was originally dutch",
                    "new york is also called the big apple",
                    "nyc is nice",
                    "the capital of great britain is london. london is a huge metropolis which has a great many number of people living in it. london is also a very old town with a rich and vibrant cultural history.",
                    "london is in the uk. they speak english there. london is a sprawling big city where it's super easy to get lost and i've got lost many times.",
                    "london is in england, which is a part of great britain. some cool things to check out in london are the museum and buckingham palace.",
                    "london is in great britain. it rains a lot in britain and london's fogs are a constant theme in books based in london, such as sherlock holmes. the weather is really bad there.",])
y_train = np.array([[0],[0],[0],[0],[1],[1],[1],[1]])

X_test = np.array(["it's a nice day in nyc",
                   'i loved the time i spent in london, the weather was great, though there was a nip in the air and i had to wear a jacket.'
                   ])   
target_names = ['Class 1', 'Class 2']


def get_text_length(x):
    return np.array([len(t) for t in x]).reshape(-1, 1)

classifier = Pipeline([
    ('features', FeatureUnion([
        ('text', Pipeline([
            ('vectorizer', CountVectorizer(min_df=1,max_df=2)),
            ('tfidf', TfidfTransformer()),
        ])),
        ('length', Pipeline([
            ('count', FunctionTransformer(get_text_length, validate=False)),
        ]))
    ])),
    ('clf', OneVsRestClassifier(LinearSVC()))])

classifier.fit(X_train, y_train)
predicted = classifier.predict(X_test)
predicted
Run Code Online (Sandbox Code Playgroud)

这会将文本的长度添加到分类器使用的功能中.

  • @user1725306 我知道的三个选项。**1**。确保新数据与文本的顺序相同(在训练前拆分列),然后使用 FeatureUnion 将它们连接在一起。**2**。使用整个数据框作为输入,但使用 [mlxtend](https://github.com/rasbt/mlxtend/) 中的 ColumnSelector 来选择 FeatureUnion 的两个分支中的文本和附加信息。**3**。看看 [sklearn-pandas](https://github.com/scikit-learn-contrib/sklearn-pandas),它使 sklearn 数据帧感知。 (2认同)