小编mar*_*rin的帖子

在 NLTK 停用词列表中添加和删除单词

我正在尝试从 NLTK 停用词列表中添加和删除单词:

from nltk.corpus import stopwords

stop_words = set(stopwords.words('french'))

#add words that aren't in the NLTK stopwords list
new_stopwords = ['cette', 'les', 'cet']
new_stopwords_list = set(stop_words.extend(new_stopwords))

#remove words that are in NLTK stopwords list
not_stopwords = {'n', 'pas', 'ne'} 
final_stop_words = set([word for word in new_stopwords_list if word not in not_stopwords])

print(final_stop_words)
Run Code Online (Sandbox Code Playgroud)

输出:

Traceback (most recent call last):
  File "test_stop.py", line 10, in <module>
new_stopwords_list = set(stop_words.extend(new_stopwords))
AttributeError: 'set' object has no attribute 'extend'
Run Code Online (Sandbox Code Playgroud)

python list set nltk python-3.x

3
推荐指数
1
解决办法
5572
查看次数

从输出中删除"无"

我正在尝试删除所有不属于法语的短语.我尝试使用langdetect库(不幸的是没有pandas)

CSV文件

message
Je suis fatiguée
The book is on the table
Il fait chaud aujourd'hui!
They are sicks
La vie est belle
Run Code Online (Sandbox Code Playgroud)

脚本:

import csv
from langdetect import detect

with open('ddd.csv', 'r') as file:
    fichier = csv.reader(file)

    for line in fichier:
        if line[0] != '':
            message = line[0]

            def detecteur_FR(message):
                #We need to turn the column into a list of lists.
                message_list = [comments for comments in message.split('\n')]
                for text in message_list:
                    if detect(text) == 'fr':
                        message_FR = text …
Run Code Online (Sandbox Code Playgroud)

python python-3.x

2
推荐指数
1
解决办法
149
查看次数

找到列中的最大数字

我正在尝试找到具有最大编号的月份(列'月')(在DepDelay列中)

数据

flightID         Month  ArrTime ActualElapsedTime  DepDelay   ArrDelay
BBYYEUVY67527        1   1514.0               58.0       NA      64.0   
MUPXAQFN40227        1     37.0              120.0       13      52.0   
LQLYUIMN79169        1    916.0              166.0       NA     -25.0   
KTAMHIFO10843        1      NaN                NaN        5       NaN   
BOOXJTEY23623        1      NaN                NaN        4       NaN  
BBYYEUVY67527        2   1514.0               58.0       NA      64.0   
MUPXAQFN40227        2     37.0              120.0       NA      52.0   
LQLYUIMN79169        2    916.0              166.0       NA     -25.0   
KTAMHIFO10843        2      NaN                NaN       15       NaN   
BOOXJTEY23623        2      NaN                NaN        4       NaN  
Run Code Online (Sandbox Code Playgroud)

我试过了:

data = pd.read_csv('data.csv', sep='\t')

dep_delay = all_data.groupby(["Month"].DepDelay.count().max())

print(dep_delay)
Run Code Online (Sandbox Code Playgroud)

错误:

AttributeError                            Traceback …
Run Code Online (Sandbox Code Playgroud)

python dataframe python-3.x pandas pandas-groupby

2
推荐指数
1
解决办法
79
查看次数

根据来自另一个带有pandas的列的信息填充空列

我正在尝试根据另一列的信息填充一个空列

我的数据框

   A        B                                    C
0  F    House                     Are you at home?
1  E    House    description: to deliver tomorrow
2  F    Apt                 Here is some exemples 
3  F    House          description: a brown table
4  E    Apt               description: in the bus
5  F    House                 Hello, how are you?
6  E    Apt                     description: keys
Run Code Online (Sandbox Code Playgroud)

所以,我创建一个D列,如果列C以'description'开头,我填写'fuzzy',如果没有'buzzy'.

new_column['D'] = ''
Run Code Online (Sandbox Code Playgroud)

我试着填补它们

def fill_column(delete_column):
    if new_column['D'].loc[new_column['D'].str.startswith('description:'):
        new_column['D'] == 'fuzzy'
    else:
        new_column['D'] == 'buzzy'

    return new_column
Run Code Online (Sandbox Code Playgroud)

我的输出:

  File "<ipython-input-41-ec3c1407168c>", line 6
    else:
       ^
SyntaxError: invalid syntax …
Run Code Online (Sandbox Code Playgroud)

python pandas

2
推荐指数
1
解决办法
38
查看次数

使用 word2vec 和 Kmeans 进行聚类

我正在尝试使用 word2vec 和 Kmeans 进行聚类,但它不起作用。

这是我的部分数据:

demain fera chaud à paris pas marseille
mauvais exemple ce n est pas un cliché mais il faut comprendre pourquoi aussi
il y a plus de travail à Paris c est d ailleurs pour cette raison qu autant de gens",
mais s il y a plus de travail, il y a aussi plus de concurrence
s agglutinent autour de la capitale
Run Code Online (Sandbox Code Playgroud)

脚本:

import nltk
import pandas
import pprint
import numpy as np
import pandas …
Run Code Online (Sandbox Code Playgroud)

python cluster-analysis k-means python-3.x word2vec

1
推荐指数
1
解决办法
4405
查看次数

使用一个完全受过培训的文件分类,另一个完全受过测试的文件分类

我正在尝试进行分类,其中一个文件完全是培训,另一个文件完全是测试。这是可能的?我试过了:

import pandas
import numpy as np
import pandas as pd
from sklearn import metrics
from sklearn import cross_validation
from sklearn.pipeline import Pipeline
from sklearn.metrics import confusion_matrix
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_recall_fscore_support as score
from sklearn.feature_extraction.text import TfidfVectorizer, HashingVectorizer, CountVectorizer, TfidfTransformer
from sklearn.metrics import precision_score, recall_score, confusion_matrix, classification_report, accuracy_score, f1_score

#csv file from train
df = pd.read_csv('data_train.csv', sep = ',')

#csv file from test
df_test = pd.read_csv('data_test.csv', sep = …
Run Code Online (Sandbox Code Playgroud)

python machine-learning python-3.x scikit-learn text-classification

0
推荐指数
1
解决办法
282
查看次数

AttributeError: 'list' 对象没有属性 'lower' : 聚类

我正在尝试进行聚类。我正在使用熊猫和 sklearn。

import pandas
import pprint
import pandas as pd
from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_rand_score
from sklearn.feature_extraction.text import TfidfVectorizer

dataset = pandas.read_csv('text.csv', encoding='utf-8')

dataset_list = dataset.values.tolist()


vectors = TfidfVectorizer()
X = vectors.fit_transform(dataset_list)

clusters_number = 20

model = KMeans(n_clusters = clusters_number, init = 'k-means++', max_iter = 300, n_init = 1)

model.fit(X)

centers = model.cluster_centers_
labels = model.labels_

clusters = {}
for comment, label in zip(dataset_list, labels):
    print ('Comment:', comment)
    print ('Label:', label)

try:
    clusters[str(label)].append(comment)
except:
    clusters[str(label)] = [comment] …
Run Code Online (Sandbox Code Playgroud)

python python-3.x pandas scikit-learn

-1
推荐指数
1
解决办法
2157
查看次数

NameError:名称“ fit_classifier”未定义

我正在尝试使文本分类

import pandas as pd
import pandas
from sklearn import svm
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.multiclass import OneVsOneClassifier
from sklearn.svm import SVC
from sklearn import cross_validation
from sklearn.metrics import confusion_matrix

dataset = pd.read_csv('data.csv', encoding = 'utf-8')
data = dataset['text']
labels = dataset['label']

X_train, X_test, y_train, y_test = train_test_split (data, labels, test_size = 0.2, random_state = 0)

count_vector = CountVectorizer()
tfidf = TfidfTransformer() …
Run Code Online (Sandbox Code Playgroud)

python classification python-3.x scikit-learn text-classification

-2
推荐指数
1
解决办法
1403
查看次数