如何将没有空格的文本拆分成单词列表？

Question

如何将没有空格的文本拆分成单词列表？

Ser*_*gey 92 python algorithm text split

输入: "tableapplechairtablecupboard..."很多单词

将这样的文本拆分为单词列表并得到的有效算法是什么？

输出: ["table", "apple", "chair", "table", ["cupboard", ["cup", "board"]], ...]

想到的第一件事就是要经历所有可能的单词(从第一个字母开始)并找到最长的单词,继续 position=word_position+len(word)

PS
我们列出了所有可能的单词.
单词"橱柜"可以是"杯子"和"板子",选择最长.
语言:python,但主要的是算法本身.

Answer 1

Gen*_*man 166

当应用于真实世界数据时,朴素算法不会给出好的结果.这是一个20行算法,利用相对字频率为真实文本文本提供准确的结果.

(如果你想要一个不使用单词频率的原始问题的答案,你需要改进"最长单词"的确切含义:最好有一个20个字母的单词和10个3个字母的单词,或者是最好有五个10个字母的单词？一旦你确定了一个精确的定义,你只需要更改定义的行wordcost以反映预期的含义.)

这个想法

最好的方法是对输出的分布进行建模.一个好的第一近似是假设所有单词都是独立分布的.然后你只需要知道所有单词的相对频率.可以合理地假设它们遵循Zipf定律,即单词列表中具有等级n的单词的概率大致为1 /(n log N),其中N是字典中单词的数量.

修复模型后,可以使用动态编程来推断空间的位置.最可能的句子是最大化每个单词的概率乘积的句子,并且通过动态编程很容易计算它.我们使用定义为概率倒数的对数的成本来避免溢出,而不是直接使用概率.

代码

from math import log

# Build a cost dictionary, assuming Zipf's law and cost = -math.log(probability).
words = open("words-by-frequency.txt").read().split()
wordcost = dict((k, log((i+1)*log(len(words)))) for i,k in enumerate(words))
maxword = max(len(x) for x in words)

def infer_spaces(s):
    """Uses dynamic programming to infer the location of spaces in a string
    without spaces."""

    # Find the best match for the i first characters, assuming cost has
    # been built for the i-1 first characters.
    # Returns a pair (match_cost, match_length).
    def best_match(i):
        candidates = enumerate(reversed(cost[max(0, i-maxword):i]))
        return min((c + wordcost.get(s[i-k-1:i], 9e999), k+1) for k,c in candidates)

    # Build the cost array.
    cost = [0]
    for i in range(1,len(s)+1):
        c,k = best_match(i)
        cost.append(c)

    # Backtrack to recover the minimal-cost string.
    out = []
    i = len(s)
    while i>0:
        c,k = best_match(i)
        assert c == cost[i]
        out.append(s[i-k:i])
        i -= k

    return " ".join(reversed(out))

Run Code Online (Sandbox Code Playgroud)

你可以用它

s = 'thumbgreenappleactiveassignmentweeklymetaphor'
print(infer_spaces(s))

Run Code Online (Sandbox Code Playgroud)

结果

我正在使用这个快速而肮脏的125k字词典,我从维基百科的一小部分组合在一起.

之前: thumbgreenappleactiveassignmentweeklymetaphor.
之后:拇指青苹果主动分配每周比喻.

之前: thereismassesoftextinformationofpeoplescommentswhichisparsedfromhtmlbuttherearen odelimitedcharactersinthemforexamplethumbgreenappleactiveassignmentweeklymetapho rapparentlytherearethumbgreenappleetcinthestringialsohavealargedictionarytoquery whetherthewordisreasonablesowhatsthefastestwayofextractionthxalot.

:以后也有存在的是从HTML解析人民意见的文本信息,群众但是在他们没有分隔字符例如拇指青苹果主动分配每周比喻显然有字符串中的拇指青苹果等我大词典查询这个词是否合理,那么最快的提取方法是什么.

之前: itwasadarkandstormynighttherainfellintorrentsexceptatoccasionalintervalswhenitwascheckedbyaviolentgustofwindwhichsweptupthestreetsforitisinlondonthatoursceneliesrattlingalongthehousetopsandfiercelyagitatingthescantyflameofthelampsthatstruggledagainstthedarkness.

后:这是一个夜黑风高的大雨如注,除了在当它是由风的猛烈阵风席卷了大街小巷,因为这是在伦敦,我们的现场位于沿房顶剑拔弩张检查偶有间隔激烈搅动在黑暗中挣扎的灯的火焰很少.

如你所见,它基本上完美无瑕.最重要的部分是确保您的单词列表训练成类似于您实际遇到的语料库,否则结果将非常糟糕.

优化

实现消耗了线性的时间和内存,因此效率相当高.如果您需要进一步加速,可以从单词列表构建后缀树以减少候选集的大小.

如果需要处理非常大的连续字符串,则拆分字符串以避免过多的内存使用是合理的.例如,您可以处理10000个字符的块中的文本以及两侧的1000个字符的边距,以避免边界效应.这将使内存使用量降至最低,几乎肯定不会影响质量.

这段代码让我麻木了.我有点不明白.我不懂日志.但我在我的电脑上测试了这段代码.你是个天才. (9认同)
太好了我把它变成了一个pip包：https://pypi.python.org/pypi/wordninja`pip install wordninja` (4认同)
@wittrup你的`words.txt`包含"comp":```$ grep"^ comp $"words.txt comp```它按字母顺序排序.这段代码假设它以降低的出现频率排序(这对于这样的n-gram列表很常见).如果你使用一个正确排序的列表,你的字符串就可以了:```>>> wordninja.split('namethecompanywherebonniewasemployedwhenwestarteddating')['name','the','company','where','bonnie','是','雇用','当','我们','开始','约会']``` (2认同)

Answer 2

ker*_*son 38

基于最佳答案中的出色工作,我创建了一个pip易于使用的软件包.

>>> import wordninja
>>> wordninja.split('derekanderson')
['derek', 'anderson']

Run Code Online (Sandbox Code Playgroud)

要安装,请运行pip install wordninja.

唯一的区别是次要的.这将返回一个list而不是一个str,它可以工作python3,它包括单词列表并正确拆分,即使存在非alpha字符(如下划线,破折号等).

再次感谢Generic Human!

https://github.com/keredson/wordninja

感谢您创建这个。 (2认同)

Answer 3

unu*_*tbu 16

这是使用递归搜索的解决方案:

def find_words(instring, prefix = '', words = None):
    if not instring:
        return []
    if words is None:
        words = set()
        with open('/usr/share/dict/words') as f:
            for line in f:
                words.add(line.strip())
    if (not prefix) and (instring in words):
        return [instring]
    prefix, suffix = prefix + instring[0], instring[1:]
    solutions = []
    # Case 1: prefix in solution
    if prefix in words:
        try:
            solutions.append([prefix] + find_words(suffix, '', words))
        except ValueError:
            pass
    # Case 2: prefix not in solution
    try:
        solutions.append(find_words(suffix, prefix, words))
    except ValueError:
        pass
    if solutions:
        return sorted(solutions,
                      key = lambda solution: [len(word) for word in solution],
                      reverse = True)[0]
    else:
        raise ValueError('no solution')

print(find_words('tableapplechairtablecupboard'))
print(find_words('tableprechaun', words = set(['tab', 'table', 'leprechaun'])))

Run Code Online (Sandbox Code Playgroud)

产量

['table', 'apple', 'chair', 'table', 'cupboard']
['tab', 'leprechaun']

Run Code Online (Sandbox Code Playgroud)

Answer 4

mik*_*iku 11

使用包含可能单词列表的trie 数据结构,执行以下操作不会太复杂:

高级指针(在连接的字符串中)
在trie中查找并存储相应的节点
如果trie节点有子节点(例如,单词较长),请转到1.
如果到达的节点没有子节点,则发生最长的单词匹配; 添加单词(存储在节点中或仅在遍历遍历期间连接)到结果列表,重置trie中的指针(或重置参考),然后重新开始

如果目标是使用整个字符串,则需要回溯,"tableprechaun"`然后必须在"tab"之后拆分. (3认同)
@Devin 事实上，对于`"tableprechaun"`，从一开始最长的匹配是`"table"`，留下`"prechaun"`，它不能被拆分成字典单词。所以你必须选择较短的匹配“tab”，给你留下一个“leprechaun”。 (2认同)

Answer 5

Rem*_*ems 9

Unutbu的解决方案非常接近,但我发现代码难以阅读,并且没有产生预期的结果.Generic Human的解决方案的缺点是它需要字频率.不适合所有用例.

这是使用Divide and Conquer算法的简单解决方案.

它试图最小化单词的数量 Eg find_words('cupboard')将返回['cupboard']而不是['cup', 'board'](假设cupboard,cup并且board在dictionnary中)
最佳解决方案不是唯一的,下面的实现返回一个解决方案.find_words('charactersin')可以返回['characters', 'in']或者它可能会返回['character', 'sin'](如下所示).您可以非常轻松地修改算法以返回所有最佳解决方案.
在该实现中,解决方案被记忆以使其在合理的时间内运行.

代码:

words = set()
with open('/usr/share/dict/words') as f:
    for line in f:
        words.add(line.strip())

solutions = {}
def find_words(instring):
    # First check if instring is in the dictionnary
    if instring in words:
        return [instring]
    # No... But maybe it's a result we already computed
    if instring in solutions:
        return solutions[instring]
    # Nope. Try to split the string at all position to recursively search for results
    best_solution = None
    for i in range(1, len(instring) - 1):
        part1 = find_words(instring[:i])
        part2 = find_words(instring[i:])
        # Both parts MUST have a solution
        if part1 is None or part2 is None:
            continue
        solution = part1 + part2
        # Is the solution found "better" than the previous one?
        if best_solution is None or len(solution) < len(best_solution):
            best_solution = solution
    # Remember (memoize) this solution to avoid having to recompute it
    solutions[instring] = best_solution
    return best_solution

Run Code Online (Sandbox Code Playgroud)

在我的3GHz机器上大约需要5秒钟:

result = find_words("thereismassesoftextinformationofpeoplescommentswhichisparsedfromhtmlbuttherearenodelimitedcharactersinthemforexamplethumbgreenappleactiveassignmentweeklymetaphorapparentlytherearethumbgreenappleetcinthestringialsohavealargedictionarytoquerywhetherthewordisreasonablesowhatsthefastestwayofextractionthxalot")
assert(result is not None)
print ' '.join(result)

Run Code Online (Sandbox Code Playgroud)

reis大量的人们评论的文本信息,从html解析,但没有分隔字符罪恶他们例如拇指绿苹果活跃分配每周隐喻显然有拇指绿苹果等在字符串我也有一个大字典来查询是否这个词是合理的,所以最快的提取方式是什么

Answer 6

Akh*_*ese 6

/sf/users/106108271/的答案很棒.但是我见过的最好的实现是Peter Norvig本人在他的"美丽数据"一书中写的.

在我粘贴他的代码之前,让我扩展为什么Norvig的方法更准确(尽管在代码方面稍微慢一点).

1)数据有点好 - 无论是在大小方面还是在精确度方面(他使用字数而不是简单的排名)2)更重要的是,n-gram背后的逻辑确实使得方法如此准确.

他在书中提供的例子是分裂字符串'sitdown'的问题.现在一个非二元组的字符串拆分方法会考虑p('sit')*p('down'),如果这小于p('sitdown') - 这将是经常出现的情况 - 它将不会分裂它,但我们希望它(大部分时间).

然而,当你拥有二元模型时,你可以将p('坐下')视为一个二元组vs p('sitdown')并且前者获胜.基本上,如果你不使用bigrams,它会将你分裂的单词的概率视为独立,而不是这种情况,有些单词更可能一个接一个地出现.不幸的是,这些也是在很多情况下经常粘在一起并使分离器混淆的词.

这是数据的链接(它是3个独立问题的数据,分段只有一个.详情请阅读本章):http://norvig.com/ngrams/

这里是代码的链接:http://norvig.com/ngrams/ngrams.py

这些链接已经有一段时间了,但无论如何我都会复制粘贴代码的分段部分

import re, string, random, glob, operator, heapq
from collections import defaultdict
from math import log10

def memo(f):
    "Memoize function f."
    table = {}
    def fmemo(*args):
        if args not in table:
            table[args] = f(*args)
        return table[args]
    fmemo.memo = table
    return fmemo

def test(verbose=None):
    """Run some tests, taken from the chapter.
    Since the hillclimbing algorithm is randomized, some tests may fail."""
    import doctest
    print 'Running tests...'
    doctest.testfile('ngrams-test.txt', verbose=verbose)

################ Word Segmentation (p. 223)

@memo
def segment(text):
    "Return a list of words that is the best segmentation of text."
    if not text: return []
    candidates = ([first]+segment(rem) for first,rem in splits(text))
    return max(candidates, key=Pwords)

def splits(text, L=20):
    "Return a list of all possible (first, rem) pairs, len(first)<=L."
    return [(text[:i+1], text[i+1:]) 
            for i in range(min(len(text), L))]

def Pwords(words): 
    "The Naive Bayes probability of a sequence of words."
    return product(Pw(w) for w in words)

#### Support functions (p. 224)

def product(nums):
    "Return the product of a sequence of numbers."
    return reduce(operator.mul, nums, 1)

class Pdist(dict):
    "A probability distribution estimated from counts in datafile."
    def __init__(self, data=[], N=None, missingfn=None):
        for key,count in data:
            self[key] = self.get(key, 0) + int(count)
        self.N = float(N or sum(self.itervalues()))
        self.missingfn = missingfn or (lambda k, N: 1./N)
    def __call__(self, key): 
        if key in self: return self[key]/self.N  
        else: return self.missingfn(key, self.N)

def datafile(name, sep='\t'):
    "Read key,value pairs from file."
    for line in file(name):
        yield line.split(sep)

def avoid_long_words(key, N):
    "Estimate the probability of an unknown word."
    return 10./(N * 10**len(key))

N = 1024908267229 ## Number of tokens

Pw  = Pdist(datafile('count_1w.txt'), N, avoid_long_words)

#### segment2: second version, with bigram counts, (p. 226-227)

def cPw(word, prev):
    "Conditional probability of word, given previous word."
    try:
        return P2w[prev + ' ' + word]/float(Pw[prev])
    except KeyError:
        return Pw(word)

P2w = Pdist(datafile('count_2w.txt'), N)

@memo 
def segment2(text, prev='<S>'): 
    "Return (log P(words), words), where words is the best segmentation." 
    if not text: return 0.0, [] 
    candidates = [combine(log10(cPw(first, prev)), first, segment2(rem, first)) 
                  for first,rem in splits(text)] 
    return max(candidates) 

def combine(Pfirst, first, (Prem, rem)): 
    "Combine first and rem results into one (probability, words) pair." 
    return Pfirst+Prem, [first]+rem

Run Code Online (Sandbox Code Playgroud)

归档时间：	13 年，11 月前
查看次数：	45250 次
最近记录：	6 年前