Python词干(使用pandas数据帧)

Chi*_*iel 5 python nlp stemming pandas

我在编写Python时遇到了以下问题:我使用包含必须阻止的词的Pandas数据帧(使用SnowballStemmer).我想要用词来调查词干与非词干文本的结果,为此我将使用分类器.我使用以下代码作为词干分析器:

import pandas as pd
from nltk.stem.snowball import SnowballStemmer

# Use English stemmer.
stemmer = SnowballStemmer("english")

# Sentences to be stemmed.
data = ["programers program with programing languages", "my code is working so there must be a bug in the optimizer"] 

# Create the Pandas dataFrame.
df = pd.DataFrame(data, columns = ['unstemmed']) 

# Split the sentences to lists of words.
df['unstemmed'] = df['unstemmed'].str.split()

# Make sure we see the full column.
pd.set_option('display.max_colwidth', -1)

# Print dataframe.
df 

+----+--------------------------------------------------------------+
|    | unstemmed                                                    |
|----+--------------------------------------------------------------|
|  0 | ['programers', 'program', 'with', 'programing', 'languages'] |
|  1 | ['my', 'code', 'is', 'working', 'so', 'there', 'must',       |   
|    |  'be', 'a', 'bug', 'in', 'the', 'interpreter']               |
+----+--------------------------------------------------------------+
Run Code Online (Sandbox Code Playgroud)

我希望在保留顺序的同时阻止列表中的所有单独的单词并保持每个键都包含每个值.这是Pandas数据框中的一列,我希望每个单独的单词都是从

大熊猫框架的内容

我想到了这样的事情:

import pandas as pd
from nltk.stem.snowball import SnowballStemmer

# Use English stemmer.
stemmer = SnowballStemmer("english")

# Sentences to be stemmed.
data = ["programers program with programing languages", "my code is working so there must be a bug in the optimizer"] 

# Create the Pandas dataFrame.
df = pd.DataFrame(data, columns = ['unstemmed']) 

# Split the sentences to lists of words.
df['unstemmed'] = df['unstemmed'].str.split()

# Make sure we see the full column.
pd.set_option('display.max_colwidth', -1)

# Print dataframe.
df 

+----+--------------------------------------------------------------+
|    | unstemmed                                                    |
|----+--------------------------------------------------------------|
|  0 | ['programers', 'program', 'with', 'programing', 'languages'] |
|  1 | ['my', 'code', 'is', 'working', 'so', 'there', 'must',       |   
|    |  'be', 'a', 'bug', 'in', 'the', 'interpreter']               |
+----+--------------------------------------------------------------+
Run Code Online (Sandbox Code Playgroud)

然而,在运行之后它并没有阻止每个单独的单词.当你看第7行时,你可以在那里看到"amsterdamse"这个词实际上被认为是"阿姆斯特丹":

运行上面的代码后的数据

数据分隔如下:

import pandas as pd
from nltk.stem.snowball import SnowballStemmer

# Use English stemmer.
stemmer = SnowballStemmer("english")

# Sentences to be stemmed.
data = ["programers program with programing languages", "my code is working so there must be a bug in the optimizer"] 

# Create the Pandas dataFrame.
df = pd.DataFrame(data, columns = ['unstemmed']) 

# Split the sentences to lists of words.
df['unstemmed'] = df['unstemmed'].str.split()

# Make sure we see the full column.
pd.set_option('display.max_colwidth', -1)

# Print dataframe.
df 

+----+--------------------------------------------------------------+
|    | unstemmed                                                    |
|----+--------------------------------------------------------------|
|  0 | ['programers', 'program', 'with', 'programing', 'languages'] |
|  1 | ['my', 'code', 'is', 'working', 'so', 'there', 'must',       |   
|    |  'be', 'a', 'bug', 'in', 'the', 'interpreter']               |
+----+--------------------------------------------------------------+
Run Code Online (Sandbox Code Playgroud)

art*_*hur 9

您必须在每个单词上应用词干并将其存储到"词干"列中.

编辑

例如 :

df['stemmed'] = df['unstemmed'].apply(lambda x: [stemmer.stem(y) for y in x]) # Stem every word.
df = df.drop(columns=['unstemmed']) # Get rid of the unstemmed column.
df # Print dataframe.

+----+--------------------------------------------------------------+
|    | stemmed                                                      |
|----+--------------------------------------------------------------|
|  0 | ['program', 'program', 'with', 'program', 'languag']         |
|  1 | ['my', 'code', 'is', 'work', 'so', 'there', 'must',          |   
|    |  'be', 'a', 'bug', 'in', 'the', 'interpret']                 |
+----+--------------------------------------------------------------+
Run Code Online (Sandbox Code Playgroud)

然后以下应该工作

df['stemmed'] = df['unstemmed'].apply(lambda x: [stemmer.stem(y) for y in x]) # Stem every word.
df = df.drop(columns=['unstemmed']) # Get rid of the unstemmed column.
df # Print dataframe.

+----+--------------------------------------------------------------+
|    | stemmed                                                      |
|----+--------------------------------------------------------------|
|  0 | ['program', 'program', 'with', 'program', 'languag']         |
|  1 | ['my', 'code', 'is', 'work', 'so', 'there', 'must',          |   
|    |  'be', 'a', 'bug', 'in', 'the', 'interpret']                 |
+----+--------------------------------------------------------------+
Run Code Online (Sandbox Code Playgroud)