从作者关系中提取国家名称

Shr*_*nik 6 text nlp r

我目前正在探索从作者联盟(PubMed文章)中提取国家名称的可能性,我的样本数据如下:

Mechanical and Production Engineering Department, National University of Singapore.

Cancer Research Campaign Mammalian Cell DNA Repair Group, Department of Zoology, Cambridge, U.K.

Cancer Research Campaign Mammalian Cell DNA Repair Group, Department of Zoology, Cambridge, UK.

Lilly Research Laboratories, Eli Lilly and Company, Indianapolis, IN 46285.

最初我尝试删除标点符号并将矢量分成单词,然后将其与维基百科的国家名称列表进行比较,但我没有成功.

任何人都可以建议我一个更好的方法吗?我更喜欢解决方案,R因为我必须进行进一步的分析并生成图形R.

And*_*rie 7

这是一个简单的解决方案,可能会让你开始一些方式.它使用包含地图包中的城市和国家数据的数据库.如果您可以获得更好的数据库,那么修改代码应该很简单.

library(maps)
library(plyr)

# Load data from package maps
data(world.cities)

# Create test data
aa <- c(
    "Mechanical and Production Engineering Department, National University of Singapore.",
    "Cancer Research Campaign Mammalian Cell DNA Repair Group, Department of Zoology, Cambridge, U.K.",
    "Cancer Research Campaign Mammalian Cell DNA Repair Group, Department of Zoology, Cambridge, UK.",
    "Lilly Research Laboratories, Eli Lilly and Company, Indianapolis, IN 46285."
)

# Remove punctuation from data
caa <- gsub(aa, "[[:punct:]]", "")    ### *Edit*

# Split data at word boundaries
saa <- strsplit(caa, " ")

# Match on cities in world.cities
# Assumes that if multiple matches, the last takes precedence, i.e. max()
llply(saa, function(x)x[max(which(x %in% world.cities$name))])

# Match on country in world.countries
llply(saa, function(x)x[which(x %in% world.cities$country.etc)])
Run Code Online (Sandbox Code Playgroud)

这是城市的结果:

[[1]]
[1] "Singapore"

[[2]]
[1] "Cambridge"

[[3]]
[1] "Cambridge"

[[4]]
[1] "Indianapolis"
Run Code Online (Sandbox Code Playgroud)

以及各国的结果:

[[1]]
[1] "Singapore"

[[2]]
[1] "UK"

[[3]]
[1] "UK"

[[4]]
character(0)
Run Code Online (Sandbox Code Playgroud)

通过一些数据清理,您可以对此做些什么.