我正在Scikit-Learn中进行多类文本分类.使用具有数百个标签的Multinomial Naive Bayes分类器训练数据集.以下是Scikit Learn脚本的摘录,用于拟合MNB模型
from __future__ import print_function
# Read **`file.csv`** into a pandas DataFrame
import pandas as pd
path = 'data/file.csv'
merged = pd.read_csv(path, error_bad_lines=False, low_memory=False)
# define X and y using the original DataFrame
X = merged.text
y = merged.grid
# split X and y into training and testing sets;
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
# import and instantiate CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()
# create document-term matrices …Run Code Online (Sandbox Code Playgroud) 我正在尝试在 python 中对大型语料库(732,066 条推文)进行文本分类
# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Importing the dataset
#dataset = pd.read_csv('Restaurant_Reviews.tsv', delimiter = '\t', quoting = 3)
# Importing the dataset
cols = ["text","geocoordinates0","geocoordinates1","grid"]
dataset = pd.read_csv('tweets.tsv', delimiter = '\t', usecols=cols, quoting = 3, error_bad_lines=False, low_memory=False)
# Removing Non-ASCII characters
def remove_non_ascii_1(dataset):
return ''.join([i if ord(i) < 128 else ' ' for i in dataset])
# Cleaning the texts
import re
import nltk
nltk.download('stopwords') …Run Code Online (Sandbox Code Playgroud)