相关疑难解决方法(0)

在NLTK和scikit-learn中结合文本词干和删除标点符号

我使用NLTK和组合scikit-learn的CountVectorizer对词干的单词和符号化.

下面是一个简单用法的例子CountVectorizer:

from sklearn.feature_extraction.text import CountVectorizer

vocab = ['The swimmer likes swimming so he swims.']
vec = CountVectorizer().fit(vocab)

sentence1 = vec.transform(['The swimmer likes swimming.'])
sentence2 = vec.transform(['The swimmer swims.'])

print('Vocabulary: %s' %vec.get_feature_names())
print('Sentence 1: %s' %sentence1.toarray())
print('Sentence 2: %s' %sentence2.toarray())

Run Code Online (Sandbox Code Playgroud)

哪个会打印

Vocabulary: ['he', 'likes', 'so', 'swimmer', 'swimming', 'swims', 'the']
Sentence 1: [[0 1 0 1 1 0 1]]
Sentence 2: [[0 0 0 1 0 1 1]]

Run Code Online (Sandbox Code Playgroud)

现在,让我们说我想删除停用词并阻止这些词.一种选择是这样做:

from nltk import word_tokenize          
from nltk.stem.porter …

Run Code Online (Sandbox Code Playgroud)

python text nltk scikit-learn

作者

lucky-day

20
推荐指数

1
解决办法

2万
查看次数

如何有效地将pos_tag_sents()应用于pandas数据帧

如果您希望POS标记存储在pandas数据帧中的文本列,每行1个句子,则SO上的大多数实现都使用apply方法

dfData['POSTags']= dfData['SourceText'].apply(
                 lamda row: [pos_tag(word_tokenize(row) for item in row])

Run Code Online (Sandbox Code Playgroud)

NLTK文档建议使用pos_tag_sents()来有效标记多个句子.

这是否适用于这个例子中,如果是将代码那样改变简单pso_tag以pos_tag_sents或不NLTK意味着段落的文本来源

正如评论中所提到的那样,pos_tag_sents()目的是每次都减少负载的负载,但问题是如何做到这一点并仍然在pandas数据帧中产生一个列？

链接到示例数据集20kRows

python nltk pos-tagger python-3.x pandas

mob*_*cdi

2017 02-07

11
推荐指数

1
解决办法

4114
查看次数

使用pandas进行基于NLTK的文本处理

使用nltk时,标点符号和数字小写不起作用.

我的代码

stopwords=nltk.corpus.stopwords.words('english')+ list(string.punctuation)
user_defined_stop_words=['st','rd','hong','kong']                    
new_stop_words=stopwords+user_defined_stop_words

def preprocess(text):
    return [word for word in word_tokenize(text) if word.lower() not in new_stop_words and not word.isdigit()]

miss_data['Clean_addr'] = miss_data['Adj_Addr'].apply(preprocess)

Run Code Online (Sandbox Code Playgroud)

样本输入

23FLOOR 9 DES VOEUX RD WEST     HONG KONG
PAG CONSULTING FLAT 15 AIA CENTRAL 1 CONNAUGHT RD CENTRAL
C/O CITY LOST STUDIOS AND FLAT 4F 13-15 HILLIER ST SHEUNG HONG KONG

Run Code Online (Sandbox Code Playgroud)

预期产出

 floor des voeux west
 pag consulting flat aia central connaught central
 co city lost studios flat f hillier sheung

Run Code Online (Sandbox Code Playgroud)

python string nltk dataframe pandas

ash*_*pen

2018 12-09

7
推荐指数

1
解决办法

3386
查看次数

Python（NLTK）-提取名词短语的更有效方法？

我有一个涉及大量文本数据的机器学习任务。我想在训练文本中识别并提取名词短语，以便稍后在管道中将其用于特征构建。我已经从文本中提取了我想要的名词短语的类型，但是我对NLTK还是很陌生，所以我以一种可以分解列表理解的每一步的方式来解决这个问题，如下所示。

但是我真正的问题是，我在这里重塑车轮吗？有没有我看不到的更快的方法？

import nltk
import pandas as pd

myData = pd.read_excel("\User\train_.xlsx")
texts = myData['message']

# Defining a grammar & Parser
NP = "NP: {(<V\w+>|<NN\w?>)+.*<NN\w?>}"
chunkr = nltk.RegexpParser(NP)

tokens = [nltk.word_tokenize(i) for i in texts]

tag_list = [nltk.pos_tag(w) for w in tokens]

phrases = [chunkr.parse(sublist) for sublist in tag_list]

leaves = [[subtree.leaves() for subtree in tree.subtrees(filter = lambda t: t.label == 'NP')] for tree in phrases]

Run Code Online (Sandbox Code Playgroud)

将我们最终得到的元组列表的列表扁平化为仅元组列表的列表

leaves = [tupls for sublists in leaves for tupls in sublists]

Run Code Online (Sandbox Code Playgroud)

将提取的术语加入一个二元组

nounphrases = …

Run Code Online (Sandbox Code Playgroud)

nlp nltk python-3.x pandas text-chunking

Sil*_*t-J

2018 03-31

6
推荐指数

1
解决办法

5159
查看次数

预处理熊猫数据框中的字符串数据

我有一个用户评论数据集。我已经加载了这个数据集，现在我想在将其拟合到分类器之前对用户评论进行预处理（即删除停用词、标点符号、转换为小写、删除称呼等），但我遇到了错误。这是我的代码：

    import pandas as pd
    import numpy as np
    df=pd.read_json("C:/Users/ABC/Downloads/Compressed/reviews_Musical_Instruments_5.json/Musical_Instruments_5.json",lines=True)
    dataset=df.filter(['overall','reviewText'],axis=1)
    def cleanText(text):
        """
        removes punctuation, stopwords and returns lowercase text in a list 
        of single words
        """
        text = (text.lower() for text in text)   

        from bs4 import BeautifulSoup
        text = BeautifulSoup(text).get_text()

        from nltk.tokenize import RegexpTokenizer
        tokenizer = RegexpTokenizer(r'\w+')
        text = tokenizer.tokenize(text)

        from nltk.corpus import stopwords
        clean = [word for word in text if word not in 
        stopwords.words('english')]

        return clean

    dataset['reviewText']=dataset['reviewText'].apply(cleanText)
    dataset['reviewText']

Run Code Online (Sandbox Code Playgroud)

我收到这些错误：

TypeError                                 Traceback (most recent call last)
<ipython-input-68-f42f70ec46e5> …

Run Code Online (Sandbox Code Playgroud)

machine-learning nltk python-3.x pandas data-cleaning

Shu*_*ngh

2017 12-23

2
推荐指数

1
解决办法

7090
查看次数

将函数应用于 pandas 系列的每个元素

我正在尝试标记我的pandas系列中的每个句子。我尝试按照文档中的说明使用 apply 进行操作，但没有成功：

x.apply(nltk.word_tokenize)

Run Code Online (Sandbox Code Playgroud)

如果我只是使用nltk.word_tokenize(x)也不起作用，因为x不是字符串。有人有什么想法吗？

编辑：x是一系列pandas句子：

0       A very, very, very slow-moving, aimless movie ...
1       Not sure who was more lost - the flat characte...
2       Attempting artiness with black & white and cle...

Run Code Online (Sandbox Code Playgroud)

与x.apply(nltk.word_tokenize)它返回完全相同：

0       A very, very, very slow-moving, aimless movie ...
1       Not sure who was more lost - the flat characte...
2       Attempting artiness with black & white and cle...

Run Code Online (Sandbox Code Playgroud)

错误nltk.word_tokenize(x)是： …

python nltk python-3.x pandas

CAB*_*CAB

2018 08-26

-2
推荐指数

1
解决办法

3490
查看次数