我如何在名称,数字,金钱,日期等内容中对文字进行分类？

Question

我如何在名称,数字,金钱,日期等内容中对文字进行分类？

Ren*_*ani 1 java nlp classification named-entity-recognition text-mining

我在一周前做了一些关于文本挖掘的问题,但我有点困惑,但现在我知道我想做什么.

情况:我有很多带HTML内容的下载页面.例如,其中一些可以是来自博客的文本.它们不是结构化的,来自不同的网站.

我想做什么:我将用空白分割所有单词,我想在一些预先定义的内容中对每个单词或一组单词进行分类,如姓名,号码,电话,电子邮件,网址,日期,金钱,温度等.

我所知道的:我知道关于自然语言处理,命名实体重新接收器,POSTagging,NayveBayesian,HMM,培训和许多要做分类的事情的概念/听说过,但是有一些不同的NLP库有不同的分类器和如何做到这一点,我不知道有什么用途或做什么.

我需要什么:我需要一些来自分类器,NLP的代码示例,无论如何,它可以对文本中的每个单词进行分类,而不是整个文本.像这样的东西:

//This is pseudo-code for what I want, and not a implementation

classifier.trainFromFile("file-with-train-words.txt");
words = text.split(" ");
for(String word: words){
    classifiedWord = classifier.classify(word);
    System.out.println(classifiedWord.getType());
}

Run Code Online (Sandbox Code Playgroud)

有人可以帮帮我吗？我对各种API,分类器和算法感到困惑.

Answer 1

wco*_*len 5

你应该尝试Apache OpenNLP.它易于使用和定制.

如果你是为葡萄牙语做的,那么有关如何使用Amazonia Corpus对项目文档进行操作的信息.支持的类型有:

人,组织,集团,地方,事件,ArtProd,抽象,事,时间和数字.

下载OpenNLP和亚马逊语料库.解压缩并将文件复制amazonia.ad到该apache-opennlp-1.5.1-incubating文件夹.
执行TokenNameFinderConverter工具将Amazonia语料库转换为OpenNLP格式:
```
bin/opennlp TokenNameFinderConverter ad -encoding ISO-8859-1 -data amazonia.ad -lang pt > corpus.txt
```
Run Code Online (Sandbox Code Playgroud)
训练模型(将编码更改为corpus.txt文件的编码,应该是您的系统默认编码.此命令可能需要几分钟):
```
bin/opennlp TokenNameFinderTrainer -lang pt -encoding UTF-8 -data corpus.txt -model pt-ner.bin -cutoff 20
```
Run Code Online (Sandbox Code Playgroud)

从命令行执行它(你应该只执行一个句子,并且标记应该分开):

$ bin/opennlp TokenNameFinder pt-ner.bin 
Loading Token Name Finder model ... done (1,112s)
Meu nome é João da Silva , moro no Brasil . Trabalho na Petrobras e tenho 50 anos .
Meu nome é <START:person> João da Silva <END> , moro no <START:place> Brasil <END> . <START:abstract> Trabalho <END> na <START:abstract> Petrobras <END> e tenho <START:numeric> 50 anos <END> .

Run Code Online (Sandbox Code Playgroud)

使用API执行它:

InputStream modelIn = new FileInputStream("pt-ner.bin");

try {
  TokenNameFinderModel model = new TokenNameFinderModel(modelIn);
}
catch (IOException e) {
  e.printStackTrace();
}
finally {
  if (modelIn != null) {
    try {
       modelIn.close();
    }
    catch (IOException e) {
    }
  }
}

// load the name finder
NameFinderME nameFinder = new NameFinderME(model);

// pass the token array to the name finder
String[] toks = {"Meu","nome","é","João","da","Silva",",","moro","no","Brasil",".","Trabalho","na","Petrobras","e","tenho","50","anos","."};

// the Span objects will show the start and end of each name, also the type
Span[] nameSpans = nameFinder.find(toks);

Run Code Online (Sandbox Code Playgroud)

要评估您的模型,您可以使用10倍交叉验证:(仅适用于1.5.2-INCUBATOR,今天使用它需要使用SVN中继线)(可能需要几个小时)
```
bin/opennlp TokenNameFinderCrossValidator -lang pt -encoding UTF-8 -data corpus.txt -cutoff 20
```
Run Code Online (Sandbox Code Playgroud)
使用自定义要素生成(检查文档)提高精度/召回率,例如添加名称词典.

归档时间：	14 年，4 月前
查看次数：	7067 次
最近记录：	8 年，2 月前