Python：从csv中逐行提取关键字

Question

Python：从csv中逐行提取关键字

我正在尝试从 csv 文件中逐行提取关键字并创建一个关键字字段。现在我能够获得完整的提取。如何获取每一行/字段的关键字？

数据：

id,some_text
1,"What is the meaning of the word Himalaya?"
2,"Palindrome is a word, phrase, or sequence that reads the same backward as forward"

Run Code Online (Sandbox Code Playgroud)

代码：这是搜索整个文本，而不是逐行搜索。我还需要放别的东西replace(r'\|', ' ')吗？

import pandas as pd
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

df = pd.read_csv('test-data.csv')
# print(df.head(5))

text_context = df['some_text'].str.lower().str.replace(r'\|', ' ').str.cat(sep=' ') # not put lower case?
print(text_context)
print('')
tokens=nltk.tokenize.word_tokenize(text_context)
word_dist = nltk.FreqDist(tokens)
stop_words = stopwords.words('english')
punctuations = ['(',')',';',':','[',']',',','!','?']
keywords = [word for word in tokens if not word in stop_words and not word in punctuations]
print(keywords)

Run Code Online (Sandbox Code Playgroud)

最终输出：

id,some_text,new_keyword_field
1,What is the meaning of the word Himalaya?,"meaning,word,himalaya"
2,"Palindrome is a word, phrase, or sequence that reads the same backward as forward","palindrome,word,phrase,sequence,reads,backward,forward"

Run Code Online (Sandbox Code Playgroud)

Answer 1

L.P*_*ley 7

这是使用 Pandas apply 向数据框添加新关键字列的简洁方法。Apply 的工作原理是首先定义一个函数（get_keywords在我们的例子中），我们可以将其应用于每一行或每一列。

import pandas as pd
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

# I define the stop_words here so I don't do it every time in the function below
stop_words = stopwords.words('english')
# I've added the index_col='id' here to set your 'id' column as the index. This assumes that the 'id' is unique.
df = pd.read_csv('test-data.csv', index_col='id')

Run Code Online (Sandbox Code Playgroud)

在这里，我们定义了将在下一个单元格中使用 df.apply 应用于每一行的函数。您可以看到此函数get_keywords将 arow作为其参数并返回一串逗号分隔的关键字，就像您在上面所需的输出（“meaning,word,himalaya”）中一样。在这个函数中，我们降低、标记化、过滤掉标点符号isalpha()、过滤掉我们的停用词，并将我们的关键字连接在一起以形成所需的输出。

# This function will be applied to each row in our Pandas Dataframe
# See the docs for df.apply at: 
# https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.apply.html
def get_keywords(row):
    some_text = row['some_text']
    lowered = some_text.lower()
    tokens = nltk.tokenize.word_tokenize(lowered)
    keywords = [keyword for keyword in tokens if keyword.isalpha() and not keyword in stop_words]
    keywords_string = ','.join(keywords)
    return keywords_string

Run Code Online (Sandbox Code Playgroud)

现在我们已经定义了将要应用的函数，我们调用df.apply(get_keywords, axis=1). 这将返回一个 Pandas 系列（类似于列表）。因为我们希望这个系列成为我们数据框的一部分，所以我们使用df['keywords'] = df.apply(get_keywords, axis=1)

# applying the get_keywords function to our dataframe and saving the results
# as a new column in our dataframe called 'keywords'
# axis=1 means that we will apply get_keywords to each row and not each column
df['keywords'] = df.apply(get_keywords, axis=1)

Run Code Online (Sandbox Code Playgroud)

输出：添加“关键字”列后的数据框

归档时间：	7 年，7 月前
查看次数：	2002 次
最近记录：	7 年，7 月前