使用python从文本中提取城市名称

Geo*_*geC 1 python validation normalization

我有一个数据集,其中一列的标题是"你的位置和时区是什么?"

这意味着我们有像这样的条目

  1. 丹麦,CET
  2. 位置是英格兰德文郡,GMT时区
  3. 澳大利亚.澳大利亚东部标准时间 + 10h UTC

乃至

  1. 我的位置是俄勒冈州尤金,一年中的大部分时间或韩国首尔,视学校假期而定.我的主要时区是太平洋时区.
  2. 整个五月,我将在英国伦敦(GMT + 1).在整个六月份,我将在挪威(格林尼治标准时间+2)或以色列(格林威治标准时间+3)中使用有限的互联网接入.在整个7月和8月,我将在英国伦敦(GMT + 1).然后从2015年9月起,我将在美国波士顿(EDT)

有没有办法从中提取城市,国家和时区?

我想所有的国家名称(包括缩写形式)以及城市名称/时区和创建数组(从一个开源的数据集)的,然后如果在数据集中的任何字与一个城市/国家/时区匹配或简短表单将它填入同一数据集中的新列并对其进行计数.

这有用吗?

===========基于NLTK答案的REPLT ============

运行与Alecxe相同的代码

Traceback (most recent call last):
  File "E:\SBTF\ntlk_test.py", line 19, in <module>
    tagged_sentences = [nltk.pos_tag(sentence) for sentence in tokenized_sentences]
  File "C:\Python27\ArcGIS10.4\lib\site-packages\nltk\tag\__init__.py", line 110, in pos_tag
    tagger = PerceptronTagger()
  File "C:\Python27\ArcGIS10.4\lib\site-packages\nltk\tag\perceptron.py", line 141, in __init__
    self.load(AP_MODEL_LOC)
  File "C:\Python27\ArcGIS10.4\lib\site-packages\nltk\tag\perceptron.py", line 209, in load
    self.model.weights, self.tagdict, self.classes = load(loc)
  File "C:\Python27\ArcGIS10.4\lib\site-packages\nltk\data.py", line 801, in load
    opened_resource = _open(resource_url)
  File "C:\Python27\ArcGIS10.4\lib\site-packages\nltk\data.py", line 924, in _open
    return urlopen(resource_url)
  File "C:\Python27\ArcGIS10.4\lib\urllib2.py", line 154, in urlopen
    return opener.open(url, data, timeout)
  File "C:\Python27\ArcGIS10.4\lib\urllib2.py", line 431, in open
    response = self._open(req, data)
  File "C:\Python27\ArcGIS10.4\lib\urllib2.py", line 454, in _open
    'unknown_open', req)
  File "C:\Python27\ArcGIS10.4\lib\urllib2.py", line 409, in _call_chain
    result = func(*args)
  File "C:\Python27\ArcGIS10.4\lib\urllib2.py", line 1265, in unknown_open
    raise URLError('unknown url type: %s' % type)
URLError: <urlopen error unknown url type: c>
Run Code Online (Sandbox Code Playgroud)

ale*_*cxe 8

我会使用自然语言处理,nltk并提供实体.

示例(严重基于此要点)从文件中标记每一行,将其拆分为块并NE递归查找每个块的(命名实体)标签.这里有更多解释:

import nltk

def extract_entity_names(t):
    entity_names = []

    if hasattr(t, 'label') and t.label:
        if t.label() == 'NE':
            entity_names.append(' '.join([child[0] for child in t]))
        else:
            for child in t:
                entity_names.extend(extract_entity_names(child))

    return entity_names

with open('sample.txt', 'r') as f:
    for line in f:
        sentences = nltk.sent_tokenize(line)
        tokenized_sentences = [nltk.word_tokenize(sentence) for sentence in sentences]
        tagged_sentences = [nltk.pos_tag(sentence) for sentence in tokenized_sentences]
        chunked_sentences = nltk.ne_chunk_sents(tagged_sentences, binary=True)

        entities = []
        for tree in chunked_sentences:
            entities.extend(extract_entity_names(tree))

        print(entities)
Run Code Online (Sandbox Code Playgroud)

对于sample.txt包含:

Denmark, CET
Location is Devon, England, GMT time zone
Australia. Australian Eastern Standard Time. +10h UTC.
My location is Eugene, Oregon for most of the year or in Seoul, South Korea depending on school holidays. My primary time zone is the Pacific time zone.
For the entire May I will be in London, United Kingdom (GMT+1). For the entire June I will be in either Norway (GMT+2) or Israel (GMT+3) with limited internet access. For the entire July and August I will be in London, United Kingdom (GMT+1). And then from September, 2015, I will be in Boston, United States (EDT)
Run Code Online (Sandbox Code Playgroud)

它打印:

['Denmark', 'CET']
['Location', 'Devon', 'England', 'GMT']
['Australia', 'Australian Eastern Standard Time']
['Eugene', 'Oregon', 'Seoul', 'South Korea', 'Pacific']
['London', 'United Kingdom', 'Norway', 'Israel', 'London', 'United Kingdom', 'Boston', 'United States', 'EDT']
Run Code Online (Sandbox Code Playgroud)

输出并不理想,但对您来说可能是一个好的开始.

  • 这是如何运作的?好像是巫术 (4认同)
  • @Racialz`nltk`经常令人惊讶!我不是NLP的专家,但试图为进一步的阅读添加更多的解释和链接.谢谢你询问细节! (2认同)