我试图从像sms这样非常小的文本块中提取名称和组织名称中的专有名词,nltk 使用NLTK WordNet查找专有名词的基本解析器能够获得名词但问题是当我们得到专有名词时不是以大写字母开头,对于像这样的文本,像sumit这样的名字不会被认为是专有名词
>>> sentence = "i spoke with sumit and rajesh and Samit about the gridlock situation last night @ around 8 pm last nite"
>>> tagged_sent = pos_tag(sentence.split())
>>> print tagged_sent
[('i', 'PRP'), ('spoke', 'VBP'), ('with', 'IN'), **('sumit', 'NN')**, ('and', 'CC'), ('rajesh', 'JJ'), ('and', 'CC'), **('Samit', 'NNP'),** ('about', 'IN'), ('the', 'DT'), ('gridlock', 'NN'), ('situation', 'NN'), ('last', 'JJ'), ('night', 'NN'), ('@', 'IN'), ('around', 'IN'), ('8', 'CD'), ('pm', 'NN'), ('last', 'JJ'), ('nite', 'NN')]
Run Code Online (Sandbox Code Playgroud) 我正在尝试使用StanfordNERTagger和nltk从一段文本中提取关键字.
docText="John Donk works for POI. Brian Jones wants to meet with Xyz Corp. for measuring POI's Short Term performance Metrics."
words = re.split("\W+",docText)
stops = set(stopwords.words("english"))
#remove stop words from the list
words = [w for w in words if w not in stops and len(w) > 2]
str = " ".join(words)
print str
stn = StanfordNERTagger('english.all.3class.distsim.crf.ser.gz')
stp = StanfordPOSTagger('english-bidirectional-distsim.tagger')
stanfordPosTagList=[word for word,pos in stp.tag(str.split()) if pos == 'NNP']
print "Stanford POS Tagged"
print stanfordPosTagList
tagged = stn.tag(stanfordPosTagList)
print …Run Code Online (Sandbox Code Playgroud) 有没有可以使用Python从句子中删除专有名词的软件包?
我知道一些像NLTK,Stanford和Text Blob这样的程序包可以完成工作(删除名称),但是它们也删除了很多以大写字母开头但不是专有名词的单词。
另外,我无法使用名称字典,因为它将非常庞大,并且随着数据不断在数据库中填充而将继续扩展。