使用nltk时,标点符号和数字小写不起作用.
我的代码
stopwords=nltk.corpus.stopwords.words('english')+ list(string.punctuation)
user_defined_stop_words=['st','rd','hong','kong']
new_stop_words=stopwords+user_defined_stop_words
def preprocess(text):
return [word for word in word_tokenize(text) if word.lower() not in new_stop_words and not word.isdigit()]
miss_data['Clean_addr'] = miss_data['Adj_Addr'].apply(preprocess)
Run Code Online (Sandbox Code Playgroud)
样本输入
23FLOOR 9 DES VOEUX RD WEST HONG KONG
PAG CONSULTING FLAT 15 AIA CENTRAL 1 CONNAUGHT RD CENTRAL
C/O CITY LOST STUDIOS AND FLAT 4F 13-15 HILLIER ST SHEUNG HONG KONG
Run Code Online (Sandbox Code Playgroud)
预期产出
floor des voeux west
pag consulting flat aia central connaught central
co city lost studios flat f hillier sheung
Run Code Online (Sandbox Code Playgroud)