相关疑难解决方法(0)

使用pandas进行基于NLTK的文本处理

使用nltk时,标点符号和数字小写不起作用.

我的代码

stopwords=nltk.corpus.stopwords.words('english')+ list(string.punctuation)
user_defined_stop_words=['st','rd','hong','kong']                    
new_stop_words=stopwords+user_defined_stop_words

def preprocess(text):
    return [word for word in word_tokenize(text) if word.lower() not in new_stop_words and not word.isdigit()]

miss_data['Clean_addr'] = miss_data['Adj_Addr'].apply(preprocess)
Run Code Online (Sandbox Code Playgroud)

样本输入

23FLOOR 9 DES VOEUX RD WEST     HONG KONG
PAG CONSULTING FLAT 15 AIA CENTRAL 1 CONNAUGHT RD CENTRAL
C/O CITY LOST STUDIOS AND FLAT 4F 13-15 HILLIER ST SHEUNG HONG KONG
Run Code Online (Sandbox Code Playgroud)

预期产出

 floor des voeux west
 pag consulting flat aia central connaught central
 co city lost studios flat f hillier sheung
Run Code Online (Sandbox Code Playgroud)

python string nltk dataframe pandas

7
推荐指数
1
解决办法
3386
查看次数

Python词干(使用pandas数据帧)

我在编写Python时遇到了以下问题:我使用包含必须阻止的词的Pandas数据帧(使用SnowballStemmer).我想要用词来调查词干与非词干文本的结果,为此我将使用分类器.我使用以下代码作为词干分析器:

import pandas as pd
from nltk.stem.snowball import SnowballStemmer

# Use English stemmer.
stemmer = SnowballStemmer("english")

# Sentences to be stemmed.
data = ["programers program with programing languages", "my code is working so there must be a bug in the optimizer"] 

# Create the Pandas dataFrame.
df = pd.DataFrame(data, columns = ['unstemmed']) 

# Split the sentences to lists of words.
df['unstemmed'] = df['unstemmed'].str.split()

# Make sure we see the full column.
pd.set_option('display.max_colwidth', -1)

# Print dataframe.
df 

+----+--------------------------------------------------------------+
|    | …
Run Code Online (Sandbox Code Playgroud)

python nlp stemming pandas

5
推荐指数
1
解决办法
1万
查看次数

为什么在处理DataFrame时我的NLTK功能会变慢?

我试图在数据集中运行我的百万行的函数.

  1. 我在数据帧中读取CSV中的数据
  2. 我使用drop list删除我不需要的数据
  3. 我在for循环中通过NLTK函数传递它.

码:

def nlkt(val):
    val=repr(val)
    clean_txt = [word for word in val.split() if word.lower() not in stopwords.words('english')]
    nopunc = [char for char in str(clean_txt) if char not in string.punctuation]
    nonum = [char for char in nopunc if not char.isdigit()]
    words_string = ''.join(nonum)
    return words_string
Run Code Online (Sandbox Code Playgroud)

现在我使用for循环调用上述函数来运行百万条记录.即使我在24核CPU和88 GB Ram的重量级服务器上,我看到循环花费了太多时间而没有使用那里的计算能力

我这样调用上面的函数

data = pd.read_excel(scrPath + "UserData_Full.xlsx", encoding='utf-8')
droplist = ['Submitter', 'Environment']
data.drop(droplist,axis=1,inplace=True)

#Merging the columns company and detailed description

data['Anylize_Text']= data['Company'].astype(str) + ' ' + data['Detailed_Description'].astype(str)

finallist =[]

for …
Run Code Online (Sandbox Code Playgroud)

python optimization nltk

0
推荐指数
1
解决办法
1922
查看次数

标签 统计

python ×3

nltk ×2

pandas ×2

dataframe ×1

nlp ×1

optimization ×1

stemming ×1

string ×1