计算pandas数据帧中单词的频率

J R*_*za 16 python nltk pandas

我有一张如下表:

      URN                   Firm_Name
0  104472               R.X. Yah & Co
1  104873        Big Building Society
2  109986          St James's Society
3  114058  The Kensington Society Ltd
4  113438      MMV Oil Associates Ltd
Run Code Online (Sandbox Code Playgroud)

我想计算Firm_Name列中所有单词的频率,得到如下输出:

在此输入图像描述

我试过以下代码:

import pandas as pd
import nltk
data = pd.read_csv("X:\Firm_Data.csv")
top_N = 20
word_dist = nltk.FreqDist(data['Firm_Name'])
print('All frequencies')
print('='*60)
rslt=pd.DataFrame(word_dist.most_common(top_N),columns=['Word','Frequency'])

print(rslt)
print ('='*60)
Run Code Online (Sandbox Code Playgroud)

但是,以下代码不会产生唯一的字数.

Zer*_*ero 39

IIUIC,使用 value_counts()

In [3361]: df.Firm_Name.str.split(expand=True).stack().value_counts()
Out[3361]:
Society       3
Ltd           2
James's       1
R.X.          1
Yah           1
Associates    1
St            1
Kensington    1
MMV           1
Big           1
&             1
The           1
Co            1
Oil           1
Building      1
dtype: int64
Run Code Online (Sandbox Code Playgroud)

要么,

pd.Series(np.concatenate([x.split() for x in df.Firm_Name])).value_counts()
Run Code Online (Sandbox Code Playgroud)

要么,

pd.Series(' '.join(df.Firm_Name).split()).value_counts()
Run Code Online (Sandbox Code Playgroud)

对于前N个,例如3个

In [3379]: pd.Series(' '.join(df.Firm_Name).split()).value_counts()[:3]
Out[3379]:
Society    3
Ltd        2
James's    1
dtype: int64
Run Code Online (Sandbox Code Playgroud)

细节

In [3380]: df
Out[3380]:
      URN                   Firm_Name
0  104472               R.X. Yah & Co
1  104873        Big Building Society
2  109986          St James's Society
3  114058  The Kensington Society Ltd
4  113438      MMV Oil Associates Ltd
Run Code Online (Sandbox Code Playgroud)

  • 我测试了这三种方法的速度,以及 @WilliamGerecke 评论中列出的第四个选项 (.explode())。我的 df 有 500k 行,每行包含 0 到 400 个单词。四舍五入到最接近秒的结果为:``df.Firm_Name.str.split(expand=True).stack().value_counts()```` - 77 秒``pd.Series(np.concatenate([ x.split() for x in df.Firm_Name])).value_counts()``` - 24 ```pd.Series(' '.join(df.Firm_Name).split()).value_counts()`` ` - 4 ```df.Firm_Name.str.split().explode().value_counts()``` - 6. (5认同)
  • 如前所述,如果字符串数据的长度可变,则“.split(expand=True).stack().value_counts()”会占用大量额外内存。试试这个`.str.split().explode().value_counts()`。它执行完全相同的操作,而不分配任何额外的内存。 (2认同)

jez*_*ael 5

你需要str.catlower先为concanecate一个所有的值string,那么就需要word_tokenize和最后一次使用您的解决方案:

top_N = 4
#if not necessary all lower
a = data['Firm_Name'].str.lower().str.cat(sep=' ')
words = nltk.tokenize.word_tokenize(a)
word_dist = nltk.FreqDist(words)
print (word_dist)
<FreqDist with 17 samples and 20 outcomes>

rslt = pd.DataFrame(word_dist.most_common(top_N),
                    columns=['Word', 'Frequency'])
print(rslt)
      Word  Frequency
0  society          3
1      ltd          2
2      the          1
3       co          1
Run Code Online (Sandbox Code Playgroud)

也可以lower根据需要删除:

top_N = 4
a = data['Firm_Name'].str.cat(sep=' ')
words = nltk.tokenize.word_tokenize(a)
word_dist = nltk.FreqDist(words)
rslt = pd.DataFrame(word_dist.most_common(top_N),
                    columns=['Word', 'Frequency'])
print(rslt)
         Word  Frequency
0     Society          3
1         Ltd          2
2         MMV          1
3  Kensington          1
Run Code Online (Sandbox Code Playgroud)