我知道如何使用NLTK获得bigram和trigram搭配,并将它们应用到我自己的语料库中.代码如下.
然而,我不确定(1)如何获得特定单词的搭配?(2)NLTK是否具有基于对数似然比的配置度量?
import nltk
from nltk.collocations import *
from nltk.tokenize import word_tokenize
text = "this is a foo bar bar black sheep foo bar bar black sheep foo bar bar black sheep shep bar bar black sentence"
trigram_measures = nltk.collocations.TrigramAssocMeasures()
finder = TrigramCollocationFinder.from_words(word_tokenize(text))
for i in finder.score_ngrams(trigram_measures.pmi):
print i
Run Code Online (Sandbox Code Playgroud) 我总是处理阿拉伯语文本文件并避免编码问题我根据Buckwalter的方案将阿拉伯字符音译成英文(http://www.qamus.org/transliteration.htm)
这是我的代码,但即使像400 kb这样的小文件,它也很慢.想让它更快?
谢谢
def transliterate(file):
data = open(file).read()
buckArab = {"'":"?", "|":"?", "?":"?", "&":"?", "<":"?", "}":"?", "A":"?", "b":"?", "p":"?", "t":"?", "v":"?", "g":"?", "H":"?", "x":"?", "d":"?", "*":"?", "r":"?", "z":"?", "s":"?", "$":"?", "S":"?", "D":"?", "T":"?", "Z":"?", "E":"?", "G":"?", "_":"?", "f":"?", "q":"?", "k":"?", "l":"?", "m":"?", "n":"?", "h":"?", "w":"?", "Y":"?", "y":"?", "F":"?", "N":"?", "K":"?", "~":"?", "o":"?", "u":"?", "a":"?", "i":"?"}
for char in data:
for k, v in arabBuck.iteritems():
data = data.replace(k,v)
return data
Run Code Online (Sandbox Code Playgroud) 我有一个关于PageRank的问题,这可能表明我不太了解它.如果我有一个带有两个节点"A"和"B"的图形和链接A - > B权重1.0和B - > A权重2.0,那么A的等级不应该更高,因为它的度数权重更高吗?
当我从networkx尝试PageRank时似乎并非如此,但我不知道为什么.
>>> from networkx import nx
>>> DG = nx.DiGraph()
>>> DG.add_weighted_edges_from([("A", "B", 1.0),("B", "A",2.0)])
>>> pr = nx.pagerank(DG)
>>> pr
{'A': 0.5, 'B': 0.5}
Run Code Online (Sandbox Code Playgroud) 我知道如何使用 NLTK 获得二元组和三元组搭配,并将它们应用到我自己的语料库中。代码如下。
我唯一的问题是如何用 PMI 值打印出 birgram?我多次搜索 NLTK 文档。要么是我遗漏了什么,要么是不存在。
import nltk
from nltk.collocations import *
myFile = open("large.txt", 'r').read()
myList = myFile.split()
myCorpus = nltk.Text(myList)
trigram_measures = nltk.collocations.TrigramAssocMeasures()
finder = TrigramCollocationFinder.from_words((myCorpus))
finder.apply_freq_filter(3)
print finder.nbest(trigram_measures.pmi, 500000)
Run Code Online (Sandbox Code Playgroud)