我正在尝试从 NLTK 停用词列表中添加和删除单词:
from nltk.corpus import stopwords
stop_words = set(stopwords.words('french'))
#add words that aren't in the NLTK stopwords list
new_stopwords = ['cette', 'les', 'cet']
new_stopwords_list = set(stop_words.extend(new_stopwords))
#remove words that are in NLTK stopwords list
not_stopwords = {'n', 'pas', 'ne'}
final_stop_words = set([word for word in new_stopwords_list if word not in not_stopwords])
print(final_stop_words)
Run Code Online (Sandbox Code Playgroud)
输出:
Traceback (most recent call last):
File "test_stop.py", line 10, in <module>
new_stopwords_list = set(stop_words.extend(new_stopwords))
AttributeError: 'set' object has no attribute 'extend'
Run Code Online (Sandbox Code Playgroud) 我正在尝试删除所有不属于法语的短语.我尝试使用langdetect库(不幸的是没有pandas)
CSV文件
message
Je suis fatiguée
The book is on the table
Il fait chaud aujourd'hui!
They are sicks
La vie est belle
Run Code Online (Sandbox Code Playgroud)
脚本:
import csv
from langdetect import detect
with open('ddd.csv', 'r') as file:
fichier = csv.reader(file)
for line in fichier:
if line[0] != '':
message = line[0]
def detecteur_FR(message):
#We need to turn the column into a list of lists.
message_list = [comments for comments in message.split('\n')]
for text in message_list:
if detect(text) == 'fr':
message_FR = text …
Run Code Online (Sandbox Code Playgroud) 我正在尝试找到具有最大编号的月份(列'月')(在DepDelay列中)
数据
flightID Month ArrTime ActualElapsedTime DepDelay ArrDelay
BBYYEUVY67527 1 1514.0 58.0 NA 64.0
MUPXAQFN40227 1 37.0 120.0 13 52.0
LQLYUIMN79169 1 916.0 166.0 NA -25.0
KTAMHIFO10843 1 NaN NaN 5 NaN
BOOXJTEY23623 1 NaN NaN 4 NaN
BBYYEUVY67527 2 1514.0 58.0 NA 64.0
MUPXAQFN40227 2 37.0 120.0 NA 52.0
LQLYUIMN79169 2 916.0 166.0 NA -25.0
KTAMHIFO10843 2 NaN NaN 15 NaN
BOOXJTEY23623 2 NaN NaN 4 NaN
Run Code Online (Sandbox Code Playgroud)
我试过了:
data = pd.read_csv('data.csv', sep='\t')
dep_delay = all_data.groupby(["Month"].DepDelay.count().max())
print(dep_delay)
Run Code Online (Sandbox Code Playgroud)
错误:
AttributeError Traceback …
Run Code Online (Sandbox Code Playgroud) 我正在尝试根据另一列的信息填充一个空列
我的数据框
A B C
0 F House Are you at home?
1 E House description: to deliver tomorrow
2 F Apt Here is some exemples
3 F House description: a brown table
4 E Apt description: in the bus
5 F House Hello, how are you?
6 E Apt description: keys
Run Code Online (Sandbox Code Playgroud)
所以,我创建一个D列,如果列C以'description'开头,我填写'fuzzy',如果没有'buzzy'.
new_column['D'] = ''
Run Code Online (Sandbox Code Playgroud)
我试着填补它们
def fill_column(delete_column):
if new_column['D'].loc[new_column['D'].str.startswith('description:'):
new_column['D'] == 'fuzzy'
else:
new_column['D'] == 'buzzy'
return new_column
Run Code Online (Sandbox Code Playgroud)
我的输出:
File "<ipython-input-41-ec3c1407168c>", line 6
else:
^
SyntaxError: invalid syntax …
Run Code Online (Sandbox Code Playgroud) 我正在尝试使用 word2vec 和 Kmeans 进行聚类,但它不起作用。
这是我的部分数据:
demain fera chaud à paris pas marseille
mauvais exemple ce n est pas un cliché mais il faut comprendre pourquoi aussi
il y a plus de travail à Paris c est d ailleurs pour cette raison qu autant de gens",
mais s il y a plus de travail, il y a aussi plus de concurrence
s agglutinent autour de la capitale
Run Code Online (Sandbox Code Playgroud)
脚本:
import nltk
import pandas
import pprint
import numpy as np
import pandas …
Run Code Online (Sandbox Code Playgroud) 我正在尝试进行分类,其中一个文件完全是培训,另一个文件完全是测试。这是可能的?我试过了:
import pandas
import numpy as np
import pandas as pd
from sklearn import metrics
from sklearn import cross_validation
from sklearn.pipeline import Pipeline
from sklearn.metrics import confusion_matrix
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_recall_fscore_support as score
from sklearn.feature_extraction.text import TfidfVectorizer, HashingVectorizer, CountVectorizer, TfidfTransformer
from sklearn.metrics import precision_score, recall_score, confusion_matrix, classification_report, accuracy_score, f1_score
#csv file from train
df = pd.read_csv('data_train.csv', sep = ',')
#csv file from test
df_test = pd.read_csv('data_test.csv', sep = …
Run Code Online (Sandbox Code Playgroud) python machine-learning python-3.x scikit-learn text-classification
我正在尝试进行聚类。我正在使用熊猫和 sklearn。
import pandas
import pprint
import pandas as pd
from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_rand_score
from sklearn.feature_extraction.text import TfidfVectorizer
dataset = pandas.read_csv('text.csv', encoding='utf-8')
dataset_list = dataset.values.tolist()
vectors = TfidfVectorizer()
X = vectors.fit_transform(dataset_list)
clusters_number = 20
model = KMeans(n_clusters = clusters_number, init = 'k-means++', max_iter = 300, n_init = 1)
model.fit(X)
centers = model.cluster_centers_
labels = model.labels_
clusters = {}
for comment, label in zip(dataset_list, labels):
print ('Comment:', comment)
print ('Label:', label)
try:
clusters[str(label)].append(comment)
except:
clusters[str(label)] = [comment] …
Run Code Online (Sandbox Code Playgroud) 我正在尝试使文本分类
import pandas as pd
import pandas
from sklearn import svm
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.multiclass import OneVsOneClassifier
from sklearn.svm import SVC
from sklearn import cross_validation
from sklearn.metrics import confusion_matrix
dataset = pd.read_csv('data.csv', encoding = 'utf-8')
data = dataset['text']
labels = dataset['label']
X_train, X_test, y_train, y_test = train_test_split (data, labels, test_size = 0.2, random_state = 0)
count_vector = CountVectorizer()
tfidf = TfidfTransformer() …
Run Code Online (Sandbox Code Playgroud) python classification python-3.x scikit-learn text-classification
python ×8
python-3.x ×7
pandas ×3
scikit-learn ×3
dataframe ×1
k-means ×1
list ×1
nltk ×1
set ×1
word2vec ×1