小编Seu*_*JAO的帖子

scikit将输出metrics.classification_report学习为CSV /制表符分隔格式

我正在Scikit-Learn中进行多类文本分类.使用具有数百个标签的Multinomial Naive Bayes分类器训练数据集.以下是Scikit Learn脚本的摘录,用于拟合MNB模型

from __future__ import print_function

# Read **`file.csv`** into a pandas DataFrame

import pandas as pd
path = 'data/file.csv'
merged = pd.read_csv(path, error_bad_lines=False, low_memory=False)

# define X and y using the original DataFrame
X = merged.text
y = merged.grid

# split X and y into training and testing sets;
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

# import and instantiate CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()

# create document-term matrices …
Run Code Online (Sandbox Code Playgroud)

python text classification machine-learning scikit-learn

20
推荐指数
7
解决办法
2万
查看次数

Python 文本分类错误 - 预期的字符串或类似字节的对象

我正在尝试在 python 中对大型语料库(732,066 条推文)进行文本分类

# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# Importing the dataset
#dataset = pd.read_csv('Restaurant_Reviews.tsv', delimiter = '\t', quoting = 3)

# Importing the dataset
cols = ["text","geocoordinates0","geocoordinates1","grid"]
dataset = pd.read_csv('tweets.tsv', delimiter = '\t', usecols=cols, quoting = 3, error_bad_lines=False, low_memory=False)

# Removing Non-ASCII characters
def remove_non_ascii_1(dataset):
    return ''.join([i if ord(i) < 128 else ' ' for i in dataset])

# Cleaning the texts
import re
import nltk
nltk.download('stopwords') …
Run Code Online (Sandbox Code Playgroud)

python twitter text nlp classification

2
推荐指数
1
解决办法
8190
查看次数