alv*_*vas 16 python nlp information-retrieval nltk n-gram
我有3,000,000行的巨大文件,每行有20-40个单词.我必须从语料库中提取1到5个ngrams.我的输入文件是标记化的纯文本,例如:
This is a foo bar sentence .
There is a comma , in this sentence .
Such is an example text .
Run Code Online (Sandbox Code Playgroud)
目前,我正在执行以下操作,但这似乎不是提取1-5grams的有效方法:
#!/usr/bin/env python -*- coding: utf-8 -*-
import io, os
from collections import Counter
import sys; reload(sys); sys.setdefaultencoding('utf-8')
with io.open('train-1.tok.en', 'r', encoding='utf8') as srcfin, \
io.open('train-1.tok.jp', 'r', encoding='utf8') as trgfin:
# Extract words from file.
src_words = ['<s>'] + srcfin.read().replace('\n', ' </s> <s> ').split()
del src_words[-1] # Removes the final '<s>'
trg_words = ['<s>'] + trgfin.read().replace('\n', ' </s> <s> ').split()
del trg_words[-1] # Removes the final '<s>'
# Unigrams count.
src_unigrams = Counter(src_words)
trg_unigrams = Counter(trg_words)
# Sum of unigram counts.
src_sum_unigrams = sum(src_unigrams.values())
trg_sum_unigrams = sum(trg_unigrams.values())
# Bigrams count.
src_bigrams = Counter(zip(src_words,src_words[1:]))
trg_bigrams = Counter(zip(trg_words,trg_words[1:]))
# Sum of bigram counts.
src_sum_bigrams = sum(src_bigrams.values())
trg_sum_bigrams = sum(trg_bigrams.values())
# Trigrams count.
src_trigrams = Counter(zip(src_words,src_words[1:], src_words[2:]))
trg_trigrams = Counter(zip(trg_words,trg_words[1:], trg_words[2:]))
# Sum of trigram counts.
src_sum_trigrams = sum(src_bigrams.values())
trg_sum_trigrams = sum(trg_bigrams.values())
Run Code Online (Sandbox Code Playgroud)
有没有其他方法可以更有效地做到这一点?
如何同时最佳地提取不同的N ngrams?
从快速/优化python中的N-gram实现,基本上是这样的:
zip(*[words[i:] for i in range(n)])
Run Code Online (Sandbox Code Playgroud)
当硬编码为bigrams时,n=2:
zip(src_words,src_words[1:])
Run Code Online (Sandbox Code Playgroud)
这是三卦,n=3:
zip(src_words,src_words[1:],src_words[2:])
Run Code Online (Sandbox Code Playgroud)
如果您只对最常见(频繁)的n图表感兴趣(我认为这是您的情况),您可以重用Apriori算法的核心思想.给定s_min的最小支持可以被认为n是包含给定报文的行数,它有效地搜索所有这样的n图.
这个想法如下:编写一个查询函数,它接受n-gram并测试它包含在语料库中的次数.在准备好这样的函数之后(可以如稍后讨论的那样进行优化),扫描整个语料库并获取所有1-grams,即裸标记,并选择至少包含的那些函数s_min.这为您提供F1了频繁1-grams的子集.然后2通过组合1来自的所有-grams来测试所有可能的-grams F1.再次,选择那些符合s_min标准并且你会得到的F2.通过组合所有的2-grams F2并选择频繁的3-grams,你会得到F3.只要Fn非空,重复一次.
可以在此处进行许多优化.当组合n-grams时Fn,你可以利用n-grams 的事实,x并且y可以只组合成form (n+1)-gram iff x[1:] == y[:-1](n如果使用适当的散列,可以在任何时间检查任何正确的散列).此外,如果你有足够的RAM(对于你的语料库,很多GB),你可以极大地加快查询功能.对于每个1-gram,存储包含给定1-gram 的行索引的哈希集.当将两个n-gram 组合成(n+1)-gram时,使用两个相应集合的交集,获得(n+1)可以包含-gram的一组行.
时间复杂度随着s_min减少而增长.美妙的是,n在算法运行时,不经常(并因此无趣)的图表被完全过滤,只为常用的图像节省了计算时间.
我给你提供了一些关于你试图解决的一般问题的建议。其中一个或多个应该对你有用并帮助你解决这个问题。
对于您正在做的事情(我猜是某种机器翻译实验),您实际上并不需要同时将两个文件 srcfin 和 trgfin 加载到内存中(至少对于您提供的代码示例来说不是)。就给定时间需要在内存中保存的内容量而言,单独处理它们会更便宜。
您正在将大量数据读入内存,对其进行处理(这会占用更多内存),然后将结果保存在某些内存数据结构中。你不应该这样做,而应该努力变得懒惰。了解 python 生成器并编写一个生成器,该生成器可以从给定文本中流出所有 ngram,而无需在任何给定时间点将整个文本保存在内存中。在编写本文时,itertools python 包可能会派上用场。
超过某一点,将所有这些数据保存在内存中将不再可行。您应该考虑查看 map-reduce 来帮助您解决这个问题。查看 mrjob python 包,它可以让您用 python 编写映射缩减作业。在映射器步骤中,您将把文本分解为 ngram,在缩减器阶段,您将计算看到每个 ngram 的次数以获得其总数。mrjob 也可以在本地运行,这显然不会给你带来任何并行化的好处,但会很好,因为 mrjob 仍然会为你做很多繁重的工作。
如果您被迫同时将所有计数保存在内存中(对于大量文本),那么要么实施一些修剪策略来修剪非常罕见的 ngram,要么考虑使用一些基于文件的持久查找表(例如 sqlite)为您保存所有数据。