小编Ale*_*lex的帖子

如何使用nltk或python删除停用词

所以我有一个数据集,我想删除使用的停止词

stopwords.words('english')

Run Code Online (Sandbox Code Playgroud)

我正在努力如何在我的代码中使用它只是简单地取出这些单词.我已经有了这个数据集中的单词列表,我正在努力的部分是与此列表进行比较并删除停用词.任何帮助表示赞赏.

python nltk stop-words

Ale*_*lex

2013 03-06

98
推荐指数

7
解决办法

16万
查看次数

在nltk停止列表中添加单词

我有一些代码可以从我的数据集中删除停用词,因为停止列表似乎没有删除我想要的大多数单词,我希望在此停止列表中添加单词以便它将删除对于这种情况他们.我用来删除停用词的代码是:

word_list2 = [w.strip() for w in word_list if w.strip() not in nltk.corpus.stopwords.words('english')]

Run Code Online (Sandbox Code Playgroud)

我不确定添加单词的正确语法,似乎无法在任何地方找到正确的语法.任何帮助表示赞赏.谢谢.

python nltk stop-words

Ale*_*lex

lucky-day

10
推荐指数

3
解决办法

2万
查看次数

停用词nltk / python问题

我有一些处理数据集供以后使用的代码，我用于停用词的代码似乎还可以，但是我认为问题出在我代码的其余部分，因为它似乎只删除了一些停用词。

import re
import nltk

# Quran subset
filename = 'subsetQuran.txt'

# create list of lower case words
word_list = re.split('\s+', file(filename).read().lower())
print 'Words in text:', len(word_list)

word_list2 = [w for w in word_list if not w in nltk.corpus.stopwords.words('english')]



# create dictionary of word:frequency pairs
freq_dic = {}
# punctuation and numbers to be removed
punctuation = re.compile(r'[-.?!,":;()|0-9]') 
for word in word_list2:
    # remove punctuation marks
    word = punctuation.sub("", word)
    # form dictionary
    try: 
        freq_dic[word] += 1
    except: 
        freq_dic[word] = …

Run Code Online (Sandbox Code Playgroud)

python nltk

Ale*_*lex

lucky-day

5
推荐指数

1
解决办法

8313
查看次数

从单词频率创建ARFF

我有一些代码,它给出了一个单词列表,其中包含它们在文本中出现的频率,我希望这样做,以便代码将前10个单词自动转换为ARFF

@RELATION wordfrequencies

@ATTRIBUTE字符串@ATTRIBUTE频率数字

和前10名作为数据及其频率.

我正在努力解决如何使用我当前的代码执行此操作

import re
import nltk

# Quran subset
filename = 'subsetQuran.txt'

# create list of lower case words
word_list = re.split('\s+', file(filename).read().lower())
print 'Words in text:', len(word_list)

word_list2 = [w.strip() for w in word_list if w.strip() not in nltk.corpus.stopwords.words('english')]



# create dictionary of word:frequency pairs
freq_dic = {}
# punctuation and numbers to be removed
punctuation = re.compile(r'[-.?!,":;()|0-9]') 
for word in word_list2:
    # remove punctuation marks
    word = punctuation.sub("", word)
    # form dictionary
    try: 
        freq_dic[word] …

Run Code Online (Sandbox Code Playgroud)

python nltk weka arff word-frequency

Ale*_*lex

2013 03-18

5
推荐指数

1
解决办法

1185
查看次数

从文本问题中删除标点符号/数字

我有一些代码可以正常使用python中的正则表达式删除标点符号/数字,我不得不更改代码,以便停止列表工作,不是特别重要.无论如何,现在标点符号没有被删除,坦率地说,我很难过为什么.

import re
import nltk

# Quran subset
filename = raw_input('Enter name of file to convert to ARFF with extension, eg. name.txt: ')

# create list of lower case words
word_list = re.split('\s+', file(filename).read().lower())
print 'Words in text:', len(word_list)
# punctuation and numbers to be removed
punctuation = re.compile(r'[-.?!,":;()|0-9]')
for word in word_list:
    word = punctuation.sub("", word)
print word_list

Run Code Online (Sandbox Code Playgroud)

关于它为什么不起作用的任何指针都会很棒,我不是python的专家所以它可能是一些非常愚蠢的东西.谢谢.

python nltk

Ale*_*lex

lucky-day

5
推荐指数

1
解决办法

1万
查看次数

计算二元频率

我编写了一段基本上计算字频率的代码,并将它们插入到ARFF文件中,以便与weka一起使用.我想改变它,以便它可以计算二进制频率,即成对的单词而不是单个单词,尽管我的尝试最多证明是不成功的.

我意识到有很多东西要看,但对此的任何帮助都非常感谢.这是我的代码:

    import re
    import nltk

    # Quran subset
    filename = raw_input('Enter name of file to convert to ARFF with extension, eg. name.txt: ')

    # create list of lower case words
    word_list = re.split('\s+', file(filename).read().lower())
    print 'Words in text:', len(word_list)
    # punctuation and numbers to be removed
    punctuation = re.compile(r'[-.?!,":;()|0-9]')
    word_list = [punctuation.sub("", word) for word in word_list]

    word_list2 = [w.strip() for w in word_list if w.strip() not in nltk.corpus.stopwords.words('english')]



    # create dictionary of word:frequency pairs
    freq_dic = {}


    for …

Run Code Online (Sandbox Code Playgroud)

python nlp arff

Ale*_*lex

2011 05-04

3
推荐指数

1
解决办法

2万
查看次数