在给定文本中发现"模板"?

Leg*_*end 5 language-agnostic nlp machine-learning data-mining nltk

如果我有大量的文本,并且我试图发现最常出现的模板,我正在考虑使用N-Gram方法解决它,事实上它也被建议作为这个问题的解决方案,但我的要求略有不同.只是为了澄清,我有一些这样的文字:

I wake up every day morning and read the newspaper and then go to work
I wake up every day morning and eat my breakfast and then go to work
I am not sure that this is the solution but I will try
I am not sure that this is the answer but I will try
I am not feeling well today but I will get the work done and deliver it tomorrow
I was not feeling well yesterday but I will get the work done and let you know by tomorrow
Run Code Online (Sandbox Code Playgroud)

并试图提取这样的"模板":

I wake up every day morning and ... and then go to work
I am not sure that this is the ... but I will try
I ... not feeling well ... but I will get the work done and ... tomorrow
Run Code Online (Sandbox Code Playgroud)

我正在寻找一种可以扩展到数百万行文本的方法,所以我只是想知道我是否可以采用相同的N-gram方法来解决这个问题,还是有其他选择?

Fre*_*Foo 6

数百万行文字并不是真正的大数字:)

您正在寻找的是至少类似于搭配发现.您可以尝试计算n-gram上的逐点互信息.请参阅Manning&Schütze(1999)以了解该问题以及解决该问题的其他方法.