alv*_*vas 5 nlp n-gram machine-translation moses language-model
Moses是一个建立机器翻译模型的软件.并且KenLM是摩西使用的事实语言模型软件.
我有一个16GB文本的文本文件,我用它来构建一个语言模型:
bin/lmplz -o 5 <text > text.arpa
Run Code Online (Sandbox Code Playgroud)
生成的文件(text.arpa)为38GB.然后我将语言模型二进制化:
bin/build_binary text.arpa text.binary
Run Code Online (Sandbox Code Playgroud)
二进制语言模型(text.binary)增长到71GB.
在moses训练翻译模型后,您应该使用MERT算法调整模型的权重.这可以通过https://github.com/moses-smt/mosesdecoder/blob/master/scripts/training/mert-moses.pl完成.
MERT适用于小型语言模型,但使用大型语言模型,需要相当长的时间才能完成.
我做了谷歌搜索,发现了KenLM的过滤器,它承诺将语言模型过滤到更小的尺寸:https://kheafield.com/code/kenlm/filter/
但我对如何使其发挥作用毫无头绪.命令帮助给出:
$ ~/moses/bin/filter
Usage: /home/alvas/moses/bin/filter mode [context] [phrase] [raw|arpa] [threads:m] [batch_size:m] (vocab|model):input_file output_file
copy mode just copies, but makes the format nicer for e.g. irstlm's broken
parser.
single mode treats the entire input as a single sentence.
multiple mode filters to multiple sentences in parallel. Each sentence is on
a separate line. A separate file is created for each sentence by appending
the 0-indexed line number to the output file name.
union mode produces one filtered model that is the union of models created by
multiple mode.
context means only the context (all but last word) has to pass the filter, but
the entire n-gram is output.
phrase means that the vocabulary is actually tab-delimited phrases and that the
phrases can generate the n-gram when assembled in arbitrary order and
clipped. Currently works with multiple or union mode.
The file format is set by [raw|arpa] with default arpa:
raw means space-separated tokens, optionally followed by a tab and arbitrary
text. This is useful for ngram count files.
arpa means the ARPA file format for n-gram language models.
threads:m sets m threads (default: conccurrency detected by boost)
batch_size:m sets the batch size for threading. Expect memory usage from this
of 2*threads*batch_size n-grams.
There are two inputs: vocabulary and model. Either may be given as a file
while the other is on stdin. Specify the type given as a file using
vocab: or model: before the file name.
For ARPA format, the output must be seekable. For raw format, it can be a
stream i.e. /dev/stdout
Run Code Online (Sandbox Code Playgroud)
但是,当我尝试以下操作时,它会卡住并且什么都不做:
$ ~/moses/bin/filter union lm.en.binary lm.filter.binary
Assuming that lm.en.binary is a model file
Reading lm.en.binary
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
Run Code Online (Sandbox Code Playgroud)
在二值化之后,应该对语言模型做些什么?是否还有其他步骤来操作大型语言模型以减少调优时的计算负荷?
调整大型LM文件的常用方法是什么?
如何使用KenLM的过滤器?
(有关https://www.mail-archive.com/moses-support@mit.edu/msg12089.html的详细信息)
回答如何使用KenLMfilter命令
cat small_vocabulary_one_word_per_line.txt \
| filter single \
"model:LM_large_vocab.arpa" \
output_LM_small_vocab.
Run Code Online (Sandbox Code Playgroud)
注:single可以用union或代替copy。如果您运行不带参数的二进制文件,请阅读打印的帮助中的更多信息filter。