使用nltk时,标点符号和数字小写不起作用.
我的代码
stopwords=nltk.corpus.stopwords.words('english')+ list(string.punctuation)
user_defined_stop_words=['st','rd','hong','kong']
new_stop_words=stopwords+user_defined_stop_words
def preprocess(text):
return [word for word in word_tokenize(text) if word.lower() not in new_stop_words and not word.isdigit()]
miss_data['Clean_addr'] = miss_data['Adj_Addr'].apply(preprocess)
Run Code Online (Sandbox Code Playgroud)
样本输入
23FLOOR 9 DES VOEUX RD WEST HONG KONG
PAG CONSULTING FLAT 15 AIA CENTRAL 1 CONNAUGHT RD CENTRAL
C/O CITY LOST STUDIOS AND FLAT 4F 13-15 HILLIER ST SHEUNG HONG KONG
Run Code Online (Sandbox Code Playgroud)
预期产出
floor des voeux west
pag consulting flat aia central connaught central
co city lost studios flat f hillier sheung
Run Code Online (Sandbox Code Playgroud) 我在编写Python时遇到了以下问题:我使用包含必须阻止的词的Pandas数据帧(使用SnowballStemmer).我想要用词来调查词干与非词干文本的结果,为此我将使用分类器.我使用以下代码作为词干分析器:
import pandas as pd
from nltk.stem.snowball import SnowballStemmer
# Use English stemmer.
stemmer = SnowballStemmer("english")
# Sentences to be stemmed.
data = ["programers program with programing languages", "my code is working so there must be a bug in the optimizer"]
# Create the Pandas dataFrame.
df = pd.DataFrame(data, columns = ['unstemmed'])
# Split the sentences to lists of words.
df['unstemmed'] = df['unstemmed'].str.split()
# Make sure we see the full column.
pd.set_option('display.max_colwidth', -1)
# Print dataframe.
df
+----+--------------------------------------------------------------+
| | …Run Code Online (Sandbox Code Playgroud) 我试图在数据集中运行我的百万行的函数.
码:
def nlkt(val):
val=repr(val)
clean_txt = [word for word in val.split() if word.lower() not in stopwords.words('english')]
nopunc = [char for char in str(clean_txt) if char not in string.punctuation]
nonum = [char for char in nopunc if not char.isdigit()]
words_string = ''.join(nonum)
return words_string
Run Code Online (Sandbox Code Playgroud)
现在我使用for循环调用上述函数来运行百万条记录.即使我在24核CPU和88 GB Ram的重量级服务器上,我看到循环花费了太多时间而没有使用那里的计算能力
我这样调用上面的函数
data = pd.read_excel(scrPath + "UserData_Full.xlsx", encoding='utf-8')
droplist = ['Submitter', 'Environment']
data.drop(droplist,axis=1,inplace=True)
#Merging the columns company and detailed description
data['Anylize_Text']= data['Company'].astype(str) + ' ' + data['Detailed_Description'].astype(str)
finallist =[]
for …Run Code Online (Sandbox Code Playgroud)