从文本中提取国籍和国家

use*_*258 6 python nlp nltk pos-tagger

我想使用nltk从文本中提取所有国家和国籍提及,我使用POS标记来提取所有GPE标记的标记,但结果并不令人满意.

 abstract="Thyroid-associated orbitopathy (TO) is an autoimmune-mediated orbital inflammation that can lead to disfigurement and blindness. Multiple genetic loci have been associated with Graves' disease, but the genetic basis for TO is largely unknown. This study aimed to identify loci associated with TO in individuals with Graves' disease, using a genome-wide association scan (GWAS) for the first time to our knowledge in TO.Genome-wide association scan was performed on pooled DNA from an Australian Caucasian discovery cohort of 265 participants with Graves' disease and TO (cases) and 147 patients with Graves' disease without TO (controls). "

  sent = nltk.tokenize.wordpunct_tokenize(abstract)
  pos_tag = nltk.pos_tag(sent)
  nes = nltk.ne_chunk(pos_tag)
  places = []
  for ne in nes:
      if type(ne) is nltk.tree.Tree:
         if (ne.label() == 'GPE'):
            places.append(u' '.join([i[0] for i in ne.leaves()]))
      if len(places) == 0:
          places.append("N/A")
Run Code Online (Sandbox Code Playgroud)

获得的结果是:

['Thyroid', 'Australian', 'Caucasian', 'Graves']
Run Code Online (Sandbox Code Playgroud)

有些是国籍,但有些只是名词.

那么我做错了什么或是否有其他方法来提取这些信息?

use*_*258 5

因此,在发表了卓有成效的评论之后,我更深入地研究了各种NER工具,以找出识别国籍和国家/地区提及内容的最佳工具,并发现SPACY具有一个NORP实体,可以有效地提取国籍。 https://spacy.io/docs/usage/entity-recognition


Aer*_*rin 3

如果你想提取国家名称,你需要的是 NER 标注器,而不是 POS 标注器。

命名实体识别 (NER) 是信息提取的子任务,旨在定位文本中的元素并将其分类为预定义的类别,例如人名、组织、位置、时间表达、数量、货币价值、百分比等。

查看斯坦福 NER 标记器!

from nltk.tag.stanford import NERTagger
import os
st = NERTagger('../ner-model.ser.gz','../stanford-ner.jar')
tagging = st.tag(text.split()) 
Run Code Online (Sandbox Code Playgroud)