字符串是随机生成的还是合理的英文单词？

Question

字符串是随机生成的还是合理的英文单词？

ike*_*kel 6 java text data-mining text-mining

我有一个包含一些字符串的文本语料库.在这些字符串中,有些是英文单词,有些是随机的,如VmsVKmGMY6eQE4eMI,每个字符串中的字符数没有限制.

有没有办法测试一个字符串是否是英文单词？我正在寻找一种能够完成这项工作的算法.这是Java,我宁愿不实现额外的字典.

Answer 1

我必须解决源代码挖掘项目的一个密切相关的问题，尽管该包是用 Python 而不是 Java 编写的，但这里似乎值得一提，以防它仍然有用。该软件包是Nostril（“Nonsense String Evaluator”），旨在确定在源代码挖掘期间提取的字符串是否可能是类/函数/变量/等。标识符或随机乱码。Nostril 不使用字典，但它确实包含一个相当大的 n-gram 频率表来支持其对文本字符串的概率评估。

示例：以下代码，

from nostril import nonsense
real_test = ['bunchofwords', 'getint', 'xywinlist', 'ioFlXFndrInfo',
             'DMEcalPreshowerDigis', 'httpredaksikatakamiwordpresscom']
junk_test = ['faiwtlwexu', 'asfgtqwafazfyiur', 'zxcvbnmlkjhgfdsaqwerty']
for s in real_test + junk_test:
    print('{}: {}'.format(s, 'nonsense' if nonsense(s) else 'real'))

Run Code Online (Sandbox Code Playgroud)

将产生以下输出：

bunchofwords: real
getint: real
xywinlist: real
ioFlXFndrInfo: real
DMEcalPreshowerDigis: real
httpredaksikatakamiwordpresscom: real
faiwtlwexu: nonsense
asfgtqwafazfyiur: nonsense
zxcvbnmlkjhgfdsaqwerty: nonsense

Run Code Online (Sandbox Code Playgroud)

该项目位于GitHub上，我欢迎贡献。如果您确实需要 Java 实现，也许我们可以使 Nostril 与 Python 2.7 兼容，您可以尝试使用Jython从 Java 运行它。

Answer 2

War*_*ord 1

如果你的意思是某种区分英语单词和随机文本的经验法则，那么没有。为了获得合理的准确性，您需要查询外部源，无论是网络、字典还是服务。

如果您只需要检查单词是否存在，我建议Wordnet。它使用起来非常简单，并且有一个很好的 Java API，称为JWNL，这使得查询 Wordnet 字典变得轻而易举。

归档时间：	11 年，8 月前
查看次数：	1056 次
最近记录：	7 年，9 月前