如何有效地将pos_tag_sents()应用于pandas数据帧

mob*_*cdi 11 python nltk pos-tagger python-3.x pandas

如果您希望POS标记存储在pandas数据帧中的文本列,每行1个句子,则SO上的大多数实现都使用apply方法

dfData['POSTags']= dfData['SourceText'].apply(
                 lamda row: [pos_tag(word_tokenize(row) for item in row])
Run Code Online (Sandbox Code Playgroud)

NLTK文档建议使用pos_tag_sents()来有效标记多个句子.

这是否适用于这个例子中,如果是将代码那样改变简单pso_tagpos_tag_sents或不NLTK意味着段落的文本来源

正如评论中所提到的那样,pos_tag_sents()目的是每次都减少负载的负载,但问题是如何做到这一点并仍然在pandas数据帧中产生一个列?

链接到示例数据集20kRows

alv*_*vas 10

输入

$ cat test.csv 
ID,Task,label,Text
1,Collect Information,no response,cozily married practical athletics Mr. Brown flat
2,New Credit,no response,active married expensive soccer Mr. Chang flat
3,Collect Information,response,healthy single expensive badminton Mrs. Green flat
4,Collect Information,response,cozily married practical soccer Mr. Brown hierachical
5,Collect Information,response,cozily single practical badminton Mr. Brown flat
Run Code Online (Sandbox Code Playgroud)

TL; DR

>>> from nltk import word_tokenize, pos_tag, pos_tag_sents
>>> import pandas as pd
>>> df = pd.read_csv('test.csv', sep=',')
>>> df['Text']
0    cozily married practical athletics Mr. Brown flat
1       active married expensive soccer Mr. Chang flat
2    healthy single expensive badminton Mrs. Green ...
3    cozily married practical soccer Mr. Brown hier...
4     cozily single practical badminton Mr. Brown flat
Name: Text, dtype: object
>>> texts = df['Text'].tolist()
>>> tagged_texts = pos_tag_sents(map(word_tokenize, texts))
>>> tagged_texts
[[('cozily', 'RB'), ('married', 'JJ'), ('practical', 'JJ'), ('athletics', 'NNS'), ('Mr.', 'NNP'), ('Brown', 'NNP'), ('flat', 'JJ')], [('active', 'JJ'), ('married', 'VBD'), ('expensive', 'JJ'), ('soccer', 'NN'), ('Mr.', 'NNP'), ('Chang', 'NNP'), ('flat', 'JJ')], [('healthy', 'JJ'), ('single', 'JJ'), ('expensive', 'JJ'), ('badminton', 'NN'), ('Mrs.', 'NNP'), ('Green', 'NNP'), ('flat', 'JJ')], [('cozily', 'RB'), ('married', 'JJ'), ('practical', 'JJ'), ('soccer', 'NN'), ('Mr.', 'NNP'), ('Brown', 'NNP'), ('hierachical', 'JJ')], [('cozily', 'RB'), ('single', 'JJ'), ('practical', 'JJ'), ('badminton', 'NN'), ('Mr.', 'NNP'), ('Brown', 'NNP'), ('flat', 'JJ')]]

>>> df['POS'] = tagged_texts
>>> df
   ID                 Task        label  \
0   1  Collect Information  no response   
1   2           New Credit  no response   
2   3  Collect Information     response   
3   4  Collect Information     response   
4   5  Collect Information     response   

                                                Text  \
0  cozily married practical athletics Mr. Brown flat   
1     active married expensive soccer Mr. Chang flat   
2  healthy single expensive badminton Mrs. Green ...   
3  cozily married practical soccer Mr. Brown hier...   
4   cozily single practical badminton Mr. Brown flat   

                                                 POS  
0  [(cozily, RB), (married, JJ), (practical, JJ),...  
1  [(active, JJ), (married, VBD), (expensive, JJ)...  
2  [(healthy, JJ), (single, JJ), (expensive, JJ),...  
3  [(cozily, RB), (married, JJ), (practical, JJ),...  
4  [(cozily, RB), (single, JJ), (practical, JJ), ... 
Run Code Online (Sandbox Code Playgroud)

在龙:

首先,您可以将Text列提取到字符串列表:

texts = df['Text'].tolist()
Run Code Online (Sandbox Code Playgroud)

然后你可以应用这个word_tokenize功能:

map(word_tokenize, texts)
Run Code Online (Sandbox Code Playgroud)

请注意,@ Boud的建议几乎相同,使用df.apply:

df['Text'].apply(word_tokenize)
Run Code Online (Sandbox Code Playgroud)

然后将标记化的文本转储到字符串列表的列表中:

df['Text'].apply(word_tokenize).tolist()
Run Code Online (Sandbox Code Playgroud)

然后你可以使用pos_tag_sents:

pos_tag_sents( df['Text'].apply(word_tokenize).tolist() )
Run Code Online (Sandbox Code Playgroud)

然后将列添加回DataFrame:

df['POS'] = pos_tag_sents( df['Text'].apply(word_tokenize).tolist() )
Run Code Online (Sandbox Code Playgroud)

  • 您的“ TL; DR”比“ In Long”版本长:) (3认同)