Sam*_*mie 2 python text dataframe pandas
我有一个包含以下列的数据框
我正在尝试计算 df['Lyrics'] 中的单词数并返回一个名为 df['wordcount'] 的新列以及计算 df['Lyrics'] 中唯一单词的数量并返回一个名为 df 的新列['唯一字数']。
我已经能够通过计算 df['lyrics'] 中的每个字符串减去空格来获得 df['wordcount'] 。
totalscore = df.Lyrics.str.count('[^\s]') #count every word in a track
df['wordcount'] = totalscore
df
我已经能够计算 df['Lyrics'] 中的唯一单词
import collections
from collections import Counter
results = Counter()
count_unique = df.Lyrics.str.lower().str.split().apply(results.update)
unique_counts = sum((results).values())
df['uniquewordcount'] = unique_counts
Run Code Online (Sandbox Code Playgroud)
这给了我 df['Lyrics'] 中所有唯一单词的数量,这就是代码的目的,但我想要每首曲目的歌词中的唯一单词,我的 python 目前不是很好解决方案可能对每个人都显而易见,但对我来说不是。我希望有人能指出我如何获得每首曲目的唯一单词数的正确方向。
预期输出:
Album Tracks Lyrics wordcount uniquewordcount
A Ball Ball is life and Ball is key 7 5
Pass Pass me the hookah Pass me the 7 4
Run Code Online (Sandbox Code Playgroud)
我得到了什么:
Album Tracks Lyrics wordcount uniquewordcount
A Ball Ball is life and Ball is key 7 9
Pass Pass me the hookah Pass me the 7 9
Run Code Online (Sandbox Code Playgroud)
这是一种替代解决方案:
import pandas as pd
df = pd.DataFrame({'Lyrics': ['This is some life some collection of words',
'Lyrics abound lyrics here there eveywhere',
'Come fly come fly away']})
# Split list into new series
lyrics = df['Lyrics'].str.lower().str.split()
# Get amount of unique words
df['LyricsCounter'] = lyrics.apply(set).apply(len)
# Get amount of words
df['LyricsWords'] = lyrics.apply(len)
print(df)
Run Code Online (Sandbox Code Playgroud)
返回:
Lyrics LyricsCounter LyricsWords
0 This is some life some collection of words 7 8
1 Lyrics abound lyrics here there eveywhere 5 6
2 Come fly come fly away 3 5
Run Code Online (Sandbox Code Playgroud)