标签: text-classification

如何在Keras中显示路透社数据集的主题？

我在Keras使用路透社数据集.

我想知道46个主题的名字.

如何在Keras中显示路透社数据集的主题？

https://keras.io/datasets/#reuters-newswire-topics-classification

text-classification deep-learning keras

hye*_*eon

2017 07-17

9
推荐指数

1
解决办法

588
查看次数

Sklearn:用于多类分类的ROC

我正在做不同的文本分类实验.现在我需要计算每项任务的AUC-ROC.对于二进制分类,我已经使用此代码:

scaler = StandardScaler(with_mean=False)

enc = LabelEncoder()
y = enc.fit_transform(labels)

feat_sel = SelectKBest(mutual_info_classif, k=200)

clf = linear_model.LogisticRegression()

pipe = Pipeline([('vectorizer', DictVectorizer()),
                 ('scaler', StandardScaler(with_mean=False)),
                 ('mutual_info', feat_sel),
                 ('logistregress', clf)])
y_pred = model_selection.cross_val_predict(pipe, instances, y, cv=10)
# instances is a list of dictionaries

#visualisation ROC-AUC

fpr, tpr, thresholds = roc_curve(y, y_pred)
auc = auc(fpr, tpr)
print('auc =', auc)

plt.figure()
plt.title('Receiver Operating Characteristic')
plt.plot(fpr, tpr, 'b',
label='AUC = %0.2f'% auc)
plt.legend(loc='lower right')
plt.plot([0,1],[0,1],'r--')
plt.xlim([-0.1,1.2])
plt.ylim([-0.1,1.2])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()

Run Code Online (Sandbox Code Playgroud)

但现在我需要为多类分类任务执行此操作.我读到了我需要对标签进行二值化的地方,但我真的不知道如何计算多类分类的ROC.提示？

python roc scikit-learn text-classification multiclass-classification

Bam*_*mbi

lucky-day

9
推荐指数

2
解决办法

2万
查看次数

在特定文件上测试NLTK分类器

以下代码运行Naive Bayes电影评论分类器.该代码生成一个信息最丰富的功能列表.

注意: **movie review**文件夹在nltk.

from itertools import chain
from nltk.corpus import stopwords
from nltk.probability import FreqDist
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import movie_reviews
stop = stopwords.words('english')

documents = [([w for w in movie_reviews.words(i) if w.lower() not in stop and w.lower() not in string.punctuation], i.split('/')[0]) for i in movie_reviews.fileids()]


word_features = FreqDist(chain(*[i for i,j in documents]))
word_features = word_features.keys()[:100]

numtrain = int(len(documents) * 90 / 100)
train_set = [({i:(i in tokens) for i in …

Run Code Online (Sandbox Code Playgroud)

nlp classification nltk python-2.7 text-classification

ZaM*_*ZaM

2017 05-23

8
推荐指数

1
解决办法

2396
查看次数

SkLearn Multinomial NB:最具信息性的功能

由于我的分类器在测试数据上的准确率大约为99%,我有点怀疑并希望深入了解我的NB分类器中最具信息性的功能,以了解它正在学习哪种功能.以下主题非常有用:如何获取scikit-learn分类器的大部分信息功能？

至于我的功能输入,我还在玩,目前我正在测试一个简单的unigram模型,使用CountVectorizer:

 vectorizer = CountVectorizer(ngram_range=(1, 1), min_df=2, stop_words='english')

Run Code Online (Sandbox Code Playgroud)

在上述主题中,我发现了以下功能:

def show_most_informative_features(vectorizer, clf, n=20):
feature_names = vectorizer.get_feature_names()
coefs_with_fns = sorted(zip(clf.coef_[0], feature_names))
top = zip(coefs_with_fns[:n], coefs_with_fns[:-(n + 1):-1])
for (coef_1, fn_1), (coef_2, fn_2) in top:
    print "\t%.4f\t%-15s\t\t%.4f\t%-15s" % (coef_1, fn_1, coef_2, fn_2)

Run Code Online (Sandbox Code Playgroud)

这给出了以下结果:

    -16.2420        114th                   -4.0020 said           
    -16.2420        115                     -4.6937 obama          
    -16.2420        136                     -4.8614 house          
    -16.2420        14th                    -5.0194 president      
    -16.2420        15th                    -5.1236 state          
    -16.2420        1600                    -5.1370 senate         
    -16.2420        16th                    -5.3868 new            
    -16.2420        1920                    -5.4004 republicans    
    -16.2420        1961                    -5.4262 republican     
    -16.2420        1981 …

Run Code Online (Sandbox Code Playgroud)

python classification machine-learning scikit-learn text-classification

Ali*_*ice

2017 05-23

8
推荐指数

1
解决办法

5299
查看次数

TensorFlow - 使用神经网络的文本分类

有没有关于TensorFlow如何使用神经网络进行文本分类的示例？

text-classification tensorflow

Sum*_*wla

lucky-day

8
推荐指数

1
解决办法

1万
查看次数

如何使用CNN(Keras)处理文本分类的长度变化

已经证明CNN(卷积神经网络)对于文本/文档分类非常有用.我想知道如何处理长度差异,因为在大多数情况下文章的长度是不同的.在Keras有什么例子吗？谢谢!!

nlp text-classification deep-learning keras

Fio*_*ong

lucky-day

8
推荐指数

1
解决办法

4430
查看次数

如何在scikit-learn中将数字特征与文本(词袋)正确组合？

我正在为网页编写分类器,所以我有多种数字特征,我也想对文本进行分类.我正在使用词袋方法将文本转换为(大)数字向量.代码最终是这样的:

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
import numpy as np

numerical_features = [
  [1, 0],
  [1, 1],
  [0, 0],
  [0, 1]
]
corpus = [
  'This is the first document.',
  'This is the second second document.',
  'And the third one',
  'Is this the first document?',
]
bag_of_words_vectorizer = CountVectorizer(min_df=1)
X = bag_of_words_vectorizer.fit_transform(corpus)
words_counts = X.toarray()
tfidf_transformer = TfidfTransformer()
tfidf = tfidf_transformer.fit_transform(words_counts)

bag_of_words_vectorizer.get_feature_names()
combinedFeatures = np.hstack([numerical_features, tfidf.toarray()])

Run Code Online (Sandbox Code Playgroud)

这有效,但我很关心准确性.请注意,有4个对象,只有两个数字特征.即使是最简单的文本也会产生具有九个特征的向量(因为语料库中有九个不同的单词).显然,对于真实文本,将会有数百个或数千个不同的单词,因此最终的特征向量将是<10个数字特征但是> 1000个单词的特征向量.

因此,分类器(SVM)不会将数字特征上的单词加权100到1倍吗？如果是这样,我该如何补偿以确保单词包的数量与数字特征的权重相等？

python classification scikit-learn text-classification

Phe*_*Kai

lucky-day

8
推荐指数

1
解决办法

2341
查看次数

词汇处理器功能

我正在研究有关卷积神经网络的嵌入输入,我理解Word2vec.但是,在CNN文本分类中.dennybritz使用了函数learn.preprocessing.VocabularyProcessor.在文件中.他们说它将文档映射到单词id的序列.我不太清楚这个功能是如何工作的.它是否会创建一个Ids列表然后用单词映射Ids或它有一个单词及其ID的字典,当运行函数时它只给出ID？

python text-classification tensorflow

ngo*_*yvu

lucky-day

8
推荐指数

1
解决办法

6374
查看次数

FastText使用预先训练的单词向量进行文本分类

我正在研究文本分类问题,也就是说,给定一些文本,我需要为其分配某些给定的标签.

我尝试过使用Facebook的快速文本库,它有两个我感兴趣的实用工具:

A)具有预训练模型的单词向量

B)文本分类实用程序

但是,似乎这些是完全独立的工具,因为我无法找到合并这两个实用程序的任何教程.

我想要的是能够通过利用Word-Vectors的预训练模型对某些文本进行分类.有没有办法做到这一点？

nlp text-classification word2vec fasttext

Jar*_*sIA

2017 12-08

8
推荐指数

2
解决办法

4798
查看次数

使用 NLTK 生成字典以将推文分类为预定义的类别

我有一个 Twitter 用户列表 (screen_names)，我需要根据他们的兴趣将他们分为 7 个预定义的类别 - 教育、艺术、体育、商业、政治、汽车、技术。我在 Python 中提取了用户的最后 100 条推文，并在清理推文后为每个用户创建了一个语料库。

如此处所述，将推文分类为（无监督数据/推文）的多个类别：
我正在尝试在每个类别下生成常用词的词典，以便我可以将其用于分类。

有没有一种方法可以自动为一组自定义单词生成这些词典？

然后我可以使用这些来使用 tf-idf 分类器对推特数据进行分类，并获得推文与每个类别的对应程度。最高值将为我们提供最可能的推文类别。

但是由于分类是基于这些预先生成的字典，我正在寻找一种方法来为自定义类别列表自动生成它们。

示例词典：

Education - ['book','teacher','student'....]

Automobiles - ['car','auto','expo',....]

Run Code Online (Sandbox Code Playgroud)

示例输入/输出：

**Input :** 
UserA - "students visited share learning experience eye opening 
article important preserve linaugural workshop students teachers 
others know coding like know alphabets vision driving codeindia office 
initiative get students tagging wrong people apologies apologies real 
people work..."
.
.
UserN - <another corpus of cleaned tweets>


**Expected output** : …

Run Code Online (Sandbox Code Playgroud)

python nlp machine-learning nltk text-classification

Nis*_*wal

2020 06-09

8
推荐指数

1
解决办法

804
查看次数