我使用NLTK和组合scikit-learn的CountVectorizer对词干的单词和符号化.
下面是一个简单用法的例子CountVectorizer:
from sklearn.feature_extraction.text import CountVectorizer
vocab = ['The swimmer likes swimming so he swims.']
vec = CountVectorizer().fit(vocab)
sentence1 = vec.transform(['The swimmer likes swimming.'])
sentence2 = vec.transform(['The swimmer swims.'])
print('Vocabulary: %s' %vec.get_feature_names())
print('Sentence 1: %s' %sentence1.toarray())
print('Sentence 2: %s' %sentence2.toarray())
Run Code Online (Sandbox Code Playgroud)
哪个会打印
Vocabulary: ['he', 'likes', 'so', 'swimmer', 'swimming', 'swims', 'the']
Sentence 1: [[0 1 0 1 1 0 1]]
Sentence 2: [[0 0 0 1 0 1 1]]
Run Code Online (Sandbox Code Playgroud)
现在,让我们说我想删除停用词并阻止这些词.一种选择是这样做:
from nltk import word_tokenize
from nltk.stem.porter …Run Code Online (Sandbox Code Playgroud) 如果您希望POS标记存储在pandas数据帧中的文本列,每行1个句子,则SO上的大多数实现都使用apply方法
dfData['POSTags']= dfData['SourceText'].apply(
lamda row: [pos_tag(word_tokenize(row) for item in row])
Run Code Online (Sandbox Code Playgroud)
NLTK文档建议使用pos_tag_sents()来有效标记多个句子.
这是否适用于这个例子中,如果是将代码那样改变简单pso_tag以pos_tag_sents或不NLTK意味着段落的文本来源
正如评论中所提到的那样,pos_tag_sents()目的是每次都减少负载的负载,但问题是如何做到这一点并仍然在pandas数据帧中产生一个列?
使用nltk时,标点符号和数字小写不起作用.
我的代码
stopwords=nltk.corpus.stopwords.words('english')+ list(string.punctuation)
user_defined_stop_words=['st','rd','hong','kong']
new_stop_words=stopwords+user_defined_stop_words
def preprocess(text):
return [word for word in word_tokenize(text) if word.lower() not in new_stop_words and not word.isdigit()]
miss_data['Clean_addr'] = miss_data['Adj_Addr'].apply(preprocess)
Run Code Online (Sandbox Code Playgroud)
样本输入
23FLOOR 9 DES VOEUX RD WEST HONG KONG
PAG CONSULTING FLAT 15 AIA CENTRAL 1 CONNAUGHT RD CENTRAL
C/O CITY LOST STUDIOS AND FLAT 4F 13-15 HILLIER ST SHEUNG HONG KONG
Run Code Online (Sandbox Code Playgroud)
预期产出
floor des voeux west
pag consulting flat aia central connaught central
co city lost studios flat f hillier sheung
Run Code Online (Sandbox Code Playgroud) 我有一个涉及大量文本数据的机器学习任务。我想在训练文本中识别并提取名词短语,以便稍后在管道中将其用于特征构建。我已经从文本中提取了我想要的名词短语的类型,但是我对NLTK还是很陌生,所以我以一种可以分解列表理解的每一步的方式来解决这个问题,如下所示。
但是我真正的问题是,我在这里重塑车轮吗?有没有我看不到的更快的方法?
import nltk
import pandas as pd
myData = pd.read_excel("\User\train_.xlsx")
texts = myData['message']
# Defining a grammar & Parser
NP = "NP: {(<V\w+>|<NN\w?>)+.*<NN\w?>}"
chunkr = nltk.RegexpParser(NP)
tokens = [nltk.word_tokenize(i) for i in texts]
tag_list = [nltk.pos_tag(w) for w in tokens]
phrases = [chunkr.parse(sublist) for sublist in tag_list]
leaves = [[subtree.leaves() for subtree in tree.subtrees(filter = lambda t: t.label == 'NP')] for tree in phrases]
Run Code Online (Sandbox Code Playgroud)
将我们最终得到的元组列表的列表扁平化为仅元组列表的列表
leaves = [tupls for sublists in leaves for tupls in sublists]
Run Code Online (Sandbox Code Playgroud)
将提取的术语加入一个二元组
nounphrases = …Run Code Online (Sandbox Code Playgroud) 我有一个用户评论数据集。我已经加载了这个数据集,现在我想在将其拟合到分类器之前对用户评论进行预处理(即删除停用词、标点符号、转换为小写、删除称呼等),但我遇到了错误。这是我的代码:
import pandas as pd
import numpy as np
df=pd.read_json("C:/Users/ABC/Downloads/Compressed/reviews_Musical_Instruments_5.json/Musical_Instruments_5.json",lines=True)
dataset=df.filter(['overall','reviewText'],axis=1)
def cleanText(text):
"""
removes punctuation, stopwords and returns lowercase text in a list
of single words
"""
text = (text.lower() for text in text)
from bs4 import BeautifulSoup
text = BeautifulSoup(text).get_text()
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'\w+')
text = tokenizer.tokenize(text)
from nltk.corpus import stopwords
clean = [word for word in text if word not in
stopwords.words('english')]
return clean
dataset['reviewText']=dataset['reviewText'].apply(cleanText)
dataset['reviewText']
Run Code Online (Sandbox Code Playgroud)
我收到这些错误:
TypeError Traceback (most recent call last)
<ipython-input-68-f42f70ec46e5> …Run Code Online (Sandbox Code Playgroud) 我正在尝试标记我的pandas系列中的每个句子。我尝试按照文档中的说明使用 apply 进行操作,但没有成功:
x.apply(nltk.word_tokenize)
Run Code Online (Sandbox Code Playgroud)
如果我只是使用nltk.word_tokenize(x)也不起作用,因为x不是字符串。有人有什么想法吗?
编辑:x是一系列pandas句子:
0 A very, very, very slow-moving, aimless movie ...
1 Not sure who was more lost - the flat characte...
2 Attempting artiness with black & white and cle...
Run Code Online (Sandbox Code Playgroud)
与x.apply(nltk.word_tokenize)它返回完全相同:
0 A very, very, very slow-moving, aimless movie ...
1 Not sure who was more lost - the flat characte...
2 Attempting artiness with black & white and cle...
Run Code Online (Sandbox Code Playgroud)
错误nltk.word_tokenize(x)是: …
nltk ×6
pandas ×5
python ×4
python-3.x ×4
dataframe ×1
nlp ×1
pos-tagger ×1
scikit-learn ×1
string ×1
text ×1