解析位置,人名,字符串的日期由NLTK

Shi*_*dim 7 python nlp corpus nltk

我有很多字符串如下,

  1. ISLAMABAD: Chief Justice Iftikhar Muhammad Chaudhry said that National Accountab
  2. KARACHI, July 24 -- Police claimed to have arrested several suspects in separate
  3. ALUM KULAM, Sri Lanka -- As gray-bellied clouds started to blot out the scorchin

我使用NLTK删除日期行部分并识别日期,地点和人名?

使用pos标记我可以找到词性.但我需要确定位置,日期,人名.我怎样才能做到这一点?

更新:

注意:我不想执行另一个http请求.我需要使用自己的代码解析它.如果有图书馆可以使用它.

更新:

我用ne_chunk.但没有运气.

import nltk

def pchunk(t):
    w_tokens = nltk.word_tokenize(t)
    pt = nltk.pos_tag(w_tokens)
    ne = nltk.ne_chunk(pt)
    print ne

# txts is a list of those 3 sentences.
for t in txts:                                            
    print t
    pchunk(t)
Run Code Online (Sandbox Code Playgroud)

输出如下,

ISLAMABAD: Chief Justice Iftikhar Muhammad Chaudhry said that National Accountab

(S
  ISLAMABAD/NNP
  :/:
  Chief/NNP
  Justice/NNP
  (PERSON Iftikhar/NNP Muhammad/NNP Chaudhry/NNP)
  said/VBD
  that/IN
  (ORGANIZATION National/NNP Accountab/NNP))

KARACHI, July 24 -- Police claimed to have arrested several suspects in separate

(S
  (GPE KARACHI/NNP)
  ,/,
  July/NNP
  24/CD
  --/:
  Police/NNP
  claimed/VBD
  to/TO
  have/VB
  arrested/VBN
  several/JJ
  suspects/NNS
  in/IN
  separate/JJ)

ALUM KULAM, Sri Lanka -- As gray-bellied clouds started to blot out the scorchin

(S
  (GPE ALUM/NN)
  (ORGANIZATION KULAM/NN)
  ,/,
  (PERSON Sri/NNP Lanka/NNP)
  --/:
  As/IN
  gray-bellied/JJ
  clouds/NNS
  started/VBN
  to/TO
  blot/VB
  out/RP
  the/DT
  scorchin/NN)
Run Code Online (Sandbox Code Playgroud)

仔细检查.甚至KARACHI也被很好地认可,但斯里兰卡被认为是人,而ISLAMABAD被认为是NNP而不是GPE.

Bla*_*sad 2

如果使用 API 与您自己的代码相比可以满足您的要求,那么Wit API可以轻松为您做到这一点。

在此输入图像描述

Wit 还将把日期/时间标记解析为标准化日期。

首先,您只需提供一些示例即可。

  • **这不是答案**。我不想依赖外部服务 (7认同)