相关疑难解决方法(0)

使用pandas进行基于NLTK的文本处理

使用nltk时,标点符号和数字小写不起作用.

我的代码

stopwords=nltk.corpus.stopwords.words('english')+ list(string.punctuation)
user_defined_stop_words=['st','rd','hong','kong']                    
new_stop_words=stopwords+user_defined_stop_words

def preprocess(text):
    return [word for word in word_tokenize(text) if word.lower() not in new_stop_words and not word.isdigit()]

miss_data['Clean_addr'] = miss_data['Adj_Addr'].apply(preprocess)
Run Code Online (Sandbox Code Playgroud)

样本输入

23FLOOR 9 DES VOEUX RD WEST     HONG KONG
PAG CONSULTING FLAT 15 AIA CENTRAL 1 CONNAUGHT RD CENTRAL
C/O CITY LOST STUDIOS AND FLAT 4F 13-15 HILLIER ST SHEUNG HONG KONG
Run Code Online (Sandbox Code Playgroud)

预期产出

 floor des voeux west
 pag consulting flat aia central connaught central
 co city lost studios flat f hillier sheung
Run Code Online (Sandbox Code Playgroud)

python string nltk dataframe pandas

7
推荐指数
1
解决办法
3386
查看次数

标签 统计

dataframe ×1

nltk ×1

pandas ×1

python ×1

string ×1