如何保存Python NLTK对齐模型供以后使用?

Mer*_*ako 13 python io nlp nltk machine-translation

在Python中,我NLTK's alignment module用来在并行文本之间创建单词对齐.对齐bitexts可能是一个耗时的过程,尤其是在相当多的语料库上完成时.最好在一天内进行批量对齐,然后再使用这些对齐.

from nltk import IBMModel1 as ibm
biverses = [list of AlignedSent objects]
model = ibm(biverses, 20)

with open(path + "eng-taq_model.txt", 'w') as f:
    f.write(model.train(biverses, 20))  // makes empty file
Run Code Online (Sandbox Code Playgroud)

一旦我创建了一个模型,我怎样才能(1)将它保存到磁盘上,以及(2)以后重用它?

alv*_*vas 7

直接的答案是腌制它,请参阅https://wiki.python.org/moin/UsingPickle

但是因为IBMModel1返回一个lambda函数,所以不可能使用默认的pickle/ 来修补它cPickle(参见https://github.com/nltk/nltk/blob/develop/nltk/align/ibm1.py#L74https:// github.com/nltk/nltk/blob/develop/nltk/align/ibm1.py#L104)

所以我们会用dill.首先,安装dill,请参阅Python pickle lambda函数?

$ pip install dill
$ python
>>> import dill as pickle
Run Code Online (Sandbox Code Playgroud)

然后:

>>> import dill
>>> import dill as pickle
>>> from nltk.corpus import comtrans
>>> from nltk.align import IBMModel1
>>> bitexts = comtrans.aligned_sents()[:100]
>>> ibm = IBMModel1(bitexts, 20)
>>> with open('model1.pk', 'wb') as fout:
...     pickle.dump(ibm, fout)
...
>>> exit()
Run Code Online (Sandbox Code Playgroud)

要使用酸洗模型:

>>> import dill as pickle
>>> from nltk.corpus import comtrans
>>> bitexts = comtrans.aligned_sents()[:100]
>>> with open('model1.pk', 'rb') as fin:
...     ibm = pickle.load(fin)
... 
>>> aligned_sent = ibm.align(bitexts[0])
>>> aligned_sent.words
['Wiederaufnahme', 'der', 'Sitzungsperiode']
Run Code Online (Sandbox Code Playgroud)

如果你试图挑选IBMModel1一个lambda函数的对象,你最终会得到这个:

>>> import cPickle as pickle
>>> from nltk.corpus import comtrans
>>> from nltk.align import IBMModel1
>>> bitexts = comtrans.aligned_sents()[:100]
>>> ibm = IBMModel1(bitexts, 20)
>>> with open('model1.pk', 'wb') as fout:
...     pickle.dump(ibm, fout)
... 
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "/usr/lib/python2.7/copy_reg.py", line 70, in _reduce_ex
    raise TypeError, "can't pickle %s objects" % base.__name__
TypeError: can't pickle function objects
Run Code Online (Sandbox Code Playgroud)

(注意:上面的代码片段来自NLTK 3.0.0版)

在带有NLTK 3.0.0的python3中,您还将面临同样的问题,因为IBMModel1返回一个lambda函数:

alvas@ubi:~$ python3
Python 3.4.0 (default, Apr 11 2014, 13:05:11) 
[GCC 4.8.2] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pickle
>>> from nltk.corpus import comtrans
>>> from nltk.align import IBMModel1
>>> bitexts = comtrans.aligned_sents()[:100]
>>> ibm = IBMModel1(bitexts, 20)
>>> with open('mode1.pk', 'wb') as fout:
...     pickle.dump(ibm, fout)
... 
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
_pickle.PicklingError: Can't pickle <function IBMModel1.train.<locals>.<lambda> at 0x7fa37cf9d620>: attribute lookup <lambda> on nltk.align.ibm1 failed'

>>> import dill
>>> with open('model1.pk', 'wb') as fout:
...     dill.dump(ibm, fout)
... 
>>> exit()

alvas@ubi:~$ python3
Python 3.4.0 (default, Apr 11 2014, 13:05:11) 
[GCC 4.8.2] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import dill
>>> from nltk.corpus import comtrans
>>> with open('model1.pk', 'rb') as fin:
...     ibm = dill.load(fin)
... 
>>> bitexts = comtrans.aligned_sents()[:100]
>>> aligned_sent = ibm.aligned(bitexts[0])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'IBMModel1' object has no attribute 'aligned'
>>> aligned_sent = ibm.align(bitexts[0])
>>> aligned_sent.words
['Wiederaufnahme', 'der', 'Sitzungsperiode']
Run Code Online (Sandbox Code Playgroud)

(注意:在python3 picklecPickle,请参阅http://docs.pythonsprints.com/python3_porting/py-porting.html)