小编Vic*_*c13的帖子

如何在Pandas数据帧上为Twitter数据应用NLTK word_tokenize库?

这是我用于twitter的语义分析的代码: -

import pandas as pd
import datetime
import numpy as np
import re
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer

df=pd.read_csv('twitDB.csv',header=None, 
sep=',',error_bad_lines=False,encoding='utf-8')

hula=df[[0,1,2,3]]
hula=hula.fillna(0)
hula['tweet'] = hula[0].astype(str) 
+hula[1].astype(str)+hula[2].astype(str)+hula[3].astype(str) 
hula["tweet"]=hula.tweet.str.lower()

ho=hula["tweet"]
ho = ho.replace('\s+', ' ', regex=True) 
ho=ho.replace('\.+', '.', regex=True)
special_char_list = [':', ';', '?', '}', ')', '{', '(']
for special_char in special_char_list:
ho=ho.replace(special_char, '')
print(ho)

ho = ho.replace('((www\.[\s]+)|(https?://[^\s]+))','URL',regex=True)
ho =ho.replace(r'#([^\s]+)', r'\1', regex=True)
ho =ho.replace('\'"',regex=True)

lem = WordNetLemmatizer()
stem = PorterStemmer() …
Run Code Online (Sandbox Code Playgroud)

python twitter tokenize nltk pandas

7
推荐指数
1
解决办法
1万
查看次数

将 svc 模型与 onevsallclassifier 结合使用时出现错误

from sklearn.multiclass import OneVsRestClassifier
from sklearn.svm import SVC

classifier = SVC(C=100, # penalty parameter, setting it to a larger value 
             kernel='rbf', # kernel type, rbf working fine here
             degree=3, # default value, not tuned yet
             gamma=1, # kernel coefficient, not tuned yet
             coef0=1, # change to 1 from default value of 0.0
             shrinking=True, # using shrinking heuristics
             tol=0.001, # stopping criterion tolerance 
             probability=False, # no need to enable probability estimates
             cache_size=200, # 200 MB cache size
             class_weight=None, # all classes …
Run Code Online (Sandbox Code Playgroud)

python-3.x scikit-learn

6
推荐指数
1
解决办法
2089
查看次数

ValueError:X 每个样本有 1709 个特征;期待 2444

我正在使用这段代码:

import pandas as pd
import numpy as np
from nltk.tokenize import word_tokenize
import re
Run Code Online (Sandbox Code Playgroud)

使用 TFIDF 矢量化

from sklearn.feature_extraction.text import TfidfVectorizer
tv=TfidfVectorizer(max_df=0.5,min_df=2,stop_words='english')
Run Code Online (Sandbox Code Playgroud)

加载数据文件

df=pd.read_json('train.json',orient='columns')
test_df=pd.read_json('test.json',orient='columns')

df['seperated_ingredients'] = df['ingredients'].apply(','.join)
test_df['seperated_ingredients'] = test_df['ingredients'].apply(','.join)

df['seperated_ingredients']=df['seperated_ingredients'].str.lower()
test_df['seperated_ingredients']=test_df['seperated_ingredients'].str.lower()

cuisines={'thai':0,'vietnamese':1,'spanish':2,'southern_us':3,'russian':4,'moroccan':5,'mexican':6,'korean':7,'japanese':8,'jamaican':9,'italian':10,'irish':11,'indian':12,'greek':13,'french':14,'filipino':15,'chinese':16,'cajun_creole':17,'british':18,'brazilian':19 }
df.cuisine= [cuisines[item] for item in df.cuisine]
Run Code Online (Sandbox Code Playgroud)

进行预处理

ho=df['seperated_ingredients']
ho=ho.replace(r'#([^\s]+)', r'\1', regex=True)
ho=ho.replace('\'"',regex=True)

ho=tv.fit_transform(ho)

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(ho,df['cuisine'],random_state=0)


from sklearn.linear_model import LogisticRegression
clf= LogisticRegression(penalty='l1')
clf.fit(X_train, y_train)
clf.score(X_test,y_test)

from sklearn.linear_model import LogisticRegression
clf1= LogisticRegression(penalty='l1')
clf1.fit(ho,df['cuisine'])

hs=test_df['seperated_ingredients']

hs=hs.replace(r'#([^\s]+)', r'\1', regex=True)
hs=hs.replace('\'"',regex=True) …
Run Code Online (Sandbox Code Playgroud)

tf-idf python-3.x

3
推荐指数
1
解决办法
8244
查看次数

标签 统计

python-3.x ×2

nltk ×1

pandas ×1

python ×1

scikit-learn ×1

tf-idf ×1

tokenize ×1

twitter ×1