Python无法完成一个句子

Question

Python无法完成一个句子

关于如何对一个句子进行标记,有很多指南,但我没有找到任何关于如何做相反的事情.

 import nltk
 words = nltk.word_tokenize("I've found a medicine for my disease.")
 result I get is: ['I', "'ve", 'found', 'a', 'medicine', 'for', 'my', 'disease', '.']

Run Code Online (Sandbox Code Playgroud)

是否有任何功能,而不是将标记化的句子恢复到原始状态.由于tokenize.untokenize()某种原因,该功能不起作用.

编辑:

我知道我可以这样做,这可能解决了这个问题,但我很好奇是否有一个集成的功能:

result = ' '.join(sentence).replace(' , ',',').replace(' .','.').replace(' !','!')
result = result.replace(' ?','?').replace(' : ',': ').replace(' \'', '\'')

Run Code Online (Sandbox Code Playgroud)

Answer 1

ale*_*cxe 44

如今(2016),有一个内置的去语音器TreebankWordDetokenizer - 它被称为MosesDetokenizer:

from nltk.tokenize.treebank import TreebankWordDetokenizer
TreebankWordDetokenizer().detokenize(['the', 'quick', 'brown'])
# 'The quick brown'

Run Code Online (Sandbox Code Playgroud)

你需要nltk必须能够使用detokenizer.

截至2018年4月10日，由于许可问题https://github.com/nltk/nltk/issues/2000摩西在NLTK中不可用 (4认同)
但它似乎已经搬到这里 https://github.com/alvations/sacremoses (2认同)
当我使用 detokenize 时，有时我会在标点符号之前（句点或逗号之前）得到一个我不想要的空格。其他人也有这个问题或知道可能是什么问题吗？ (2认同)

Answer 2

alv*_*vas 11

为了扭转word_tokenize从nltk,我建议在寻找http://www.nltk.org/_modules/nltk/tokenize/punkt.html#PunktLanguageVars.word_tokenize,并做一些逆向工程.

没有在nltk上做疯狂的黑客攻击,你可以试试这个:

>>> import nltk
>>> import string
>>> nltk.word_tokenize("I've found a medicine for my disease.")
['I', "'ve", 'found', 'a', 'medicine', 'for', 'my', 'disease', '.']
>>> tokens = nltk.word_tokenize("I've found a medicine for my disease.")
>>> "".join([" "+i if not i.startswith("'") and i not in string.punctuation else i for i in tokens]).strip()
"I've found a medicine for my disease."

Run Code Online (Sandbox Code Playgroud)

Answer 3

Ren*_*auf 6

token_utils.untokenize从这里使用

import re
def untokenize(words):
    """
    Untokenizing a text undoes the tokenizing operation, restoring
    punctuation and spaces to the places that people expect them to be.
    Ideally, `untokenize(tokenize(text))` should be identical to `text`,
    except for line breaks.
    """
    text = ' '.join(words)
    step1 = text.replace("`` ", '"').replace(" ''", '"').replace('. . .',  '...')
    step2 = step1.replace(" ( ", " (").replace(" ) ", ") ")
    step3 = re.sub(r' ([.,:;?!%]+)([ \'"`])', r"\1\2", step2)
    step4 = re.sub(r' ([.,:;?!%]+)$', r"\1", step3)
    step5 = step4.replace(" '", "'").replace(" n't", "n't").replace(
         "can not", "cannot")
    step6 = step5.replace(" ` ", " '")
    return step6.strip()

 tokenized = ['I', "'ve", 'found', 'a', 'medicine', 'for', 'my','disease', '.']
 untokenize(tokenized)
 "I've found a medicine for my disease."

Run Code Online (Sandbox Code Playgroud)

归档时间：	11 年，6 月前
查看次数：	18113 次
最近记录：	6 年，2 月前