4 python regex file-io nlp nltk
我有这个python脚本,我使用nltk库来解析,标记化,标记和块,一些让我们说来自网络的随机文本.
我需要格式化并写入文件的输出chunked1,chunked2,chunked3.这些有类型class 'nltk.tree.Tree'
更具体地讲,我需要写只匹配正则表达式的线条chunkGram1,chunkGram2,chunkGram3.
我怎样才能做到这一点?
#! /usr/bin/python2.7
import nltk
import re
import codecs
xstring = ["An electronic library (also referred to as digital library or digital repository) is a focused collection of digital objects that can include text, visual material, audio material, video material, stored as electronic media formats (as opposed to print, micro form, or other media), along with means for organizing, storing, and retrieving the files and media contained in the library collection. Digital libraries can vary immensely in size and scope, and can be maintained by individuals, organizations, or affiliated with established physical library buildings or institutions, or with academic institutions.[1] The electronic content may be stored locally, or accessed remotely via computer networks. An electronic library is a type of information retrieval system."]
def processLanguage():
for item in xstring:
tokenized = nltk.word_tokenize(item)
tagged = nltk.pos_tag(tokenized)
#print tokenized
#print tagged
chunkGram1 = r"""Chunk: {<JJ\w?>*<NN>}"""
chunkGram2 = r"""Chunk: {<JJ\w?>*<NNS>}"""
chunkGram3 = r"""Chunk: {<NNP\w?>*<NNS>}"""
chunkParser1 = nltk.RegexpParser(chunkGram1)
chunked1 = chunkParser1.parse(tagged)
chunkParser2 = nltk.RegexpParser(chunkGram2)
chunked2 = chunkParser2.parse(tagged)
chunkParser3 = nltk.RegexpParser(chunkGram3)
chunked3 = chunkParser2.parse(tagged)
#print chunked1
#print chunked2
#print chunked3
# with codecs.open('path\to\file\output.txt', 'w', encoding='utf8') as outfile:
# for i,line in enumerate(chunked1):
# if "JJ" in line:
# outfile.write(line)
# elif "NNP" in line:
# outfile.write(line)
processLanguage()
Run Code Online (Sandbox Code Playgroud)
暂时我试图运行它时出现错误:
`Traceback (most recent call last):
File "sentdex.py", line 47, in <module>
processLanguage()
File "sentdex.py", line 40, in processLanguage
outfile.write(line)
File "C:\Python27\lib\codecs.py", line 688, in write
return self.writer.write(data)
File "C:\Python27\lib\codecs.py", line 351, in write
data, consumed = self.encode(object, self.errors)
TypeError: coercing to Unicode: need string or buffer, tuple found`
Run Code Online (Sandbox Code Playgroud)
编辑: @Alvas回答后,我设法做了我想做的事.但是现在,我想知道如何从文本语料库中删除所有非ascii字符.例:
#store cleaned file into variable
with open('path\to\file.txt', 'r') as infile:
xstring = infile.readlines()
infile.close
def remove_non_ascii(line):
return ''.join([i if ord(i) < 128 else ' ' for i in line])
for i, line in enumerate(xstring):
line = remove_non_ascii(line)
#tokenize and tag text
def processLanguage():
for item in xstring:
tokenized = nltk.word_tokenize(item)
tagged = nltk.pos_tag(tokenized)
print tokenized
print tagged
processLanguage()
Run Code Online (Sandbox Code Playgroud)
以上是S/O中的另一个答案.但它似乎不起作用.可能有什么问题?我得到的错误是:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position
not in range(128)
Run Code Online (Sandbox Code Playgroud)
首先,请观看此视频:https://www.youtube.com/watch?v = 0Ef9GudbxXY

现在正确答案:
import re
import io
from nltk import pos_tag, word_tokenize, sent_tokenize, RegexpParser
xstring = u"An electronic library (also referred to as digital library or digital repository) is a focused collection of digital objects that can include text, visual material, audio material, video material, stored as electronic media formats (as opposed to print, micro form, or other media), along with means for organizing, storing, and retrieving the files and media contained in the library collection. Digital libraries can vary immensely in size and scope, and can be maintained by individuals, organizations, or affiliated with established physical library buildings or institutions, or with academic institutions.[1] The electronic content may be stored locally, or accessed remotely via computer networks. An electronic library is a type of information retrieval system."
chunkGram1 = r"""Chunk: {<JJ\w?>*<NN>}"""
chunkParser1 = RegexpParser(chunkGram1)
chunked = [chunkParser1.parse(pos_tag(word_tokenize(sent)))
for sent in sent_tokenize(xstring)]
with io.open('outfile', 'w', encoding='utf8') as fout:
for chunk in chunked:
fout.write(str(chunk)+'\n\n')
Run Code Online (Sandbox Code Playgroud)
[OUT]:
alvas@ubi:~$ python test2.py
Traceback (most recent call last):
File "test2.py", line 18, in <module>
fout.write(str(chunk)+'\n\n')
TypeError: must be unicode, not str
alvas@ubi:~$ python3 test2.py
alvas@ubi:~$ head outfile
(S
An/DT
(Chunk electronic/JJ library/NN)
(/:
also/RB
referred/VBD
to/TO
as/IN
(Chunk digital/JJ library/NN)
or/CC
Run Code Online (Sandbox Code Playgroud)
如果你必须坚持python2.7:
with io.open('outfile', 'w', encoding='utf8') as fout:
for chunk in chunked:
fout.write(unicode(chunk)+'\n\n')
Run Code Online (Sandbox Code Playgroud)
[OUT]:
alvas@ubi:~$ python test2.py
alvas@ubi:~$ head outfile
(S
An/DT
(Chunk electronic/JJ library/NN)
(/:
also/RB
referred/VBD
to/TO
as/IN
(Chunk digital/JJ library/NN)
or/CC
alvas@ubi:~$ python3 test2.py
Traceback (most recent call last):
File "test2.py", line 18, in <module>
fout.write(unicode(chunk)+'\n\n')
NameError: name 'unicode' is not defined
Run Code Online (Sandbox Code Playgroud)
如果你必须坚持使用py2.7,强烈建议:
from six import text_type
with io.open('outfile', 'w', encoding='utf8') as fout:
for chunk in chunked:
fout.write(text_type(chunk)+'\n\n')
Run Code Online (Sandbox Code Playgroud)
[OUT]:
alvas@ubi:~$ python test2.py
alvas@ubi:~$ head outfile
(S
An/DT
(Chunk electronic/JJ library/NN)
(/:
also/RB
referred/VBD
to/TO
as/IN
(Chunk digital/JJ library/NN)
or/CC
alvas@ubi:~$ python3 test2.py
alvas@ubi:~$ head outfile
(S
An/DT
(Chunk electronic/JJ library/NN)
(/:
also/RB
referred/VBD
to/TO
as/IN
(Chunk digital/JJ library/NN)
or/CC
Run Code Online (Sandbox Code Playgroud)
您的代码有几个问题,但主要的罪魁祸首是您的for循环不会修改以下内容xstring:
我将在此处解决代码中的所有问题:
你不能用single写这样的路径\,因为\t它将被解释为制表符,并且\f作为换行符.你必须加倍他们.我知道这是一个例子,但经常出现这样的混淆:
with open('path\\to\\file.txt', 'r') as infile:
xstring = infile.readlines()
Run Code Online (Sandbox Code Playgroud)
以下infile.close行是错误的.它不会调用close方法,它实际上并没有做任何事情.此外,你的文件是已经被用条款,如果你看到的任何地方任何回答这一行,请你只downvote的答案直接与评论说,关闭file.close是错误的,应该是file.close().
以下应该可以工作,但你需要知道它用它替换每个非ascii字符' '会打破诸如naïve和café之类的单词
def remove_non_ascii(line):
return ''.join([i if ord(i) < 128 else ' ' for i in line])
Run Code Online (Sandbox Code Playgroud)
但是这就是你的代码因unicode异常而失败的原因:你根本没有修改元素xstring,也就是说,你正在计算删除了ascii字符的行,是的,但这是一个新值,永远不会存储进入清单:
for i, line in enumerate(xstring):
line = remove_non_ascii(line)
Run Code Online (Sandbox Code Playgroud)
相反它应该是:
for i, line in enumerate(xstring):
xstring[i] = remove_non_ascii(line)
Run Code Online (Sandbox Code Playgroud)
或者我喜欢的非常pythonic:
xstring = [ remove_non_ascii(line) for line in xstring ]
Run Code Online (Sandbox Code Playgroud)
虽然这些Unicode错误的发生主要是因为你使用Python 2.7来处理纯Unicode文本,但是最近的Python 3版本是先行的,因此我建议你如果你刚开始使用任务升级很快就到了Python 3.4+.
| 归档时间: |
|
| 查看次数: |
1352 次 |
| 最近记录: |