如何从字符串的 DataFrame 列中获取唯一单词？

Question

如何从字符串的 DataFrame 列中获取唯一单词？

Pan*_*.V5 2 python numpy bayesian-networks dataframe pandas

我正在寻找一种方法来获取 DataFrame 中一列字符串中的唯一单词列表。

import pandas as pd
import numpy as np

df = pd.read_csv('FinalStemmedSentimentAnalysisDataset.csv', sep=';',dtype= 
       {'tweetId':int,'tweetText':str,'tweetDate':str,'sentimentLabel':int})

tweets = {}
tweets[0] = df[df['sentimentLabel'] == 0]
tweets[1] = df[df['sentimentLabel'] == 1]

Run Code Online (Sandbox Code Playgroud)

我使用的数据集来自此链接：http : //thinknook.com/twitter-sentiment-analysis-training-corpus-dataset-2012-09-22/

我得到了这个带有可变长度字符串的列，我想获得该列中每个唯一单词的列表及其计数，我怎样才能得到它？我在 python 中使用 Pandas，原始数据库有超过 100 万行，所以我还需要一些有效的方法来足够快地处理它，并且不会使代码运行太长时间。

列的示例可以是：

为我的 apl 朋友感到难过。
天哪，这太可怕了。
这是什么新歌？

列表可能类似于。

[is,so,sad,for,my,apl,friend,omg,this,terrible,what,new,song]

Answer 1

fur*_*ras 5

如果列中有字符串，则必须将每个句子拆分为单词列表，然后将所有列表放在一个列表中-您可以sum()为此使用它-它应该为您提供所有单词。要获得独特的单词，您可以将其转换为set()- 稍后您可以转换回list()

但在开始你就必须干净句子删除字符一样.，?等我的用途regex只保留一些字符和空间。最终，您必须将所有单词转换为小写或大写。

import pandas as pd

df = pd.DataFrame({
    'sentences': [
        'is so sad for my apl friend.',
        'omg this is terrible.',
        'what is this new song?',
    ]
})

unique = set(df['sentences'].str.replace('[^a-zA-Z ]', '').str.lower().str.split(' ').sum())

print(list(sorted(unique)))

Run Code Online (Sandbox Code Playgroud)

结果

['apl', 'for', 'friend', 'is', 'my', 'new', 'omg', 'sad', 'so', 'song', 'terrible', 'this', 'what']

Run Code Online (Sandbox Code Playgroud)

编辑：正如评论中提到的@HenryYik -findall('\w+')可以代替split()但也可以代替replace()

unique = set(df['sentences'].str.lower().str.findall("\w+").sum())

Run Code Online (Sandbox Code Playgroud)

编辑：我用来自

http://thinknook.com/twitter-sentiment-analysis-training-corpus-dataset-2012-09-22/

除了column.sum()或sum(column)- 我测量了 1000 行的时间并计算了 1 500 000 行，这需要 35 分钟。

使用速度要快得多itertools.chain()- 大约需要 8 秒。

import itertools

words = df['sentences'].str.lower().str.findall("\w+")
words = list(itertools.chain(words))
unique = set(words)

Run Code Online (Sandbox Code Playgroud)

但它可以直接转换为set()。

words = df['sentences'].str.lower().str.findall("\w+")

unique = set()

for x in words:
    unique.update(x)

Run Code Online (Sandbox Code Playgroud)

大约需要 5 秒

完整代码：

import pandas as pd
import time 

print(time.strftime('%H:%M:%S'), 'start')

print('-----')
#------------------------------------------------------------------------------

start = time.time()

# `read_csv()` can read directly from internet and compressed to zip
#url = 'http://thinknook.com/wp-content/uploads/2012/09/Sentiment-Analysis-Dataset.zip'
url = 'SentimentAnalysisDataset.csv'

# need to skip two rows which are incorrect
df = pd.read_csv(url, sep=',', dtype={'ItemID':int, 'Sentiment':int, 'SentimentSource':str, 'SentimentText':str}, skiprows=[8835, 535881])

end = time.time()
print(time.strftime('%H:%M:%S'), 'load:', end-start, 's')

print('-----')
#------------------------------------------------------------------------------

start = end

words = df['SentimentText'].str.lower().str.findall("\w+")
#df['words'] = words

end = time.time()
print(time.strftime('%H:%M:%S'), 'words:', end-start, 's')

print('-----')
#------------------------------------------------------------------------------

start = end

unique = set()
for x in words:
    unique.update(x)

end = time.time()
print(time.strftime('%H:%M:%S'), 'set:', end-start, 's')

print('-----')
#------------------------------------------------------------------------------

print(list(sorted(unique))[:10])

Run Code Online (Sandbox Code Playgroud)

结果

00:27:04 start
-----
00:27:08 load: 4.10780930519104 s
-----
00:27:23 words: 14.803470849990845 s
-----
00:27:27 set: 4.338541269302368 s
-----
['0', '00', '000', '0000', '00000', '000000000000', '0000001', '000001', '000014', '00004873337e0033fea60']

Run Code Online (Sandbox Code Playgroud)

归档时间：	6 年，3 月前
查看次数：	3280 次
最近记录：	6 年，3 月前