从句子生成N-gram

Pre*_*bia 28 java lucene nlp n-gram

如何生成一个n-gram的字符串,如:

String Input="This is my car."
Run Code Online (Sandbox Code Playgroud)

我想用这个输入生成n-gram:

Input Ngram size = 3
Run Code Online (Sandbox Code Playgroud)

输出应该是:

This
is
my
car

This is
is my
my car

This is my
is my car
Run Code Online (Sandbox Code Playgroud)

在Java中给出一些想法,如何实现它或者是否有可用的库.

我正在尝试使用这个NGramTokenizer,但它给出了n-gram的字符序列,我想要n-gram的单词序列.

aio*_*obe 42

我相信这会做你想要的:

import java.util.*;

public class Test {

    public static List<String> ngrams(int n, String str) {
        List<String> ngrams = new ArrayList<String>();
        String[] words = str.split(" ");
        for (int i = 0; i < words.length - n + 1; i++)
            ngrams.add(concat(words, i, i+n));
        return ngrams;
    }

    public static String concat(String[] words, int start, int end) {
        StringBuilder sb = new StringBuilder();
        for (int i = start; i < end; i++)
            sb.append((i > start ? " " : "") + words[i]);
        return sb.toString();
    }

    public static void main(String[] args) {
        for (int n = 1; n <= 3; n++) {
            for (String ngram : ngrams(n, "This is my car."))
                System.out.println(ngram);
            System.out.println();
        }
    }
}
Run Code Online (Sandbox Code Playgroud)

输出:

This
is
my
car.

This is
is my
my car.

This is my
is my car.
Run Code Online (Sandbox Code Playgroud)

作为迭代器实现的"按需"解决方案:

class NgramIterator implements Iterator<String> {

    String[] words;
    int pos = 0, n;

    public NgramIterator(int n, String str) {
        this.n = n;
        words = str.split(" ");
    }

    public boolean hasNext() {
        return pos < words.length - n + 1;
    }

    public String next() {
        StringBuilder sb = new StringBuilder();
        for (int i = pos; i < pos + n; i++)
            sb.append((i > pos ? " " : "") + words[i]);
        pos++;
        return sb.toString();
    }

    public void remove() {
        throw new UnsupportedOperationException();
    }
}
Run Code Online (Sandbox Code Playgroud)


Sha*_*ore 24

您正在寻找ShingleFilter.

更新:链接指向3.0.2版.在较新版本的Lucene中,此类可能位于不同的包中.


Lan*_*dei 6

此代码返回给定长度的所有字符串的数组:

public static String[] ngrams(String s, int len) {
    String[] parts = s.split(" ");
    String[] result = new String[parts.length - len + 1];
    for(int i = 0; i < parts.length - len + 1; i++) {
       StringBuilder sb = new StringBuilder();
       for(int k = 0; k < len; k++) {
           if(k > 0) sb.append(' ');
           sb.append(parts[i+k]);
       }
       result[i] = sb.toString();
    }
    return result;
}
Run Code Online (Sandbox Code Playgroud)

例如

System.out.println(Arrays.toString(ngrams("This is my car", 2)));
//--> [This is, is my, my car]
System.out.println(Arrays.toString(ngrams("This is my car", 3)));
//--> [This is my, is my car] 
Run Code Online (Sandbox Code Playgroud)

  • 在这些情况下你有什么期望?我建议在方法的开头放一个测试并返回一个空数组.通常我会看到很少的SO答案和复杂的错误处理. (6认同)
  • `ngrams("这是我的车",-3)`(抱歉,无法抗拒) (2认同)