我目前正在探索从作者联盟(PubMed Articles)中提取国家名称的可能性,我的样本数据如下:
新加坡国立大学机械与生产工程系.
癌症研究运动哺乳动物细胞DNA修复组,剑桥,英国动物学系
癌症研究运动哺乳动物细胞DNA修复组,英国剑桥动物学系.
Lilly Research Laboratories,Eli Lilly and Company,Indianapolis,IN 46285.
解决方法
这是一个简单的解决方案,可以让你开始一些方式.它使用包含地图包中的城市和国家数据的数据库.如果您能掌握更好的数据库,那么修改代码应该很简单.
library(maps) library(plyr) # Load data from package maps data(world.cities) # Create test data aa <- c( "Mechanical and Production Engineering Department,National University of Singapore.","Cancer Research Campaign Mammalian Cell DNA Repair Group,Department of Zoology,Cambridge,U.K.",UK.","Lilly Research Laboratories,IN 46285." ) # Remove punctuation from data caa <- gsub(aa,"[[:punct:]]","") ### *Edit* # Split data at word boundaries saa <- strsplit(caa," ") # Match on cities in world.cities # Assumes that if multiple matches,the last takes precedence,i.e. max() llply(saa,function(x)x[max(which(x %in% world.cities$name))]) # Match on country in world.countries llply(saa,function(x)x[which(x %in% world.cities$country.etc)])
这是城市的结果:
[[1]] [1] "Singapore" [[2]] [1] "Cambridge" [[3]] [1] "Cambridge" [[4]] [1] "Indianapolis"
对各国的结果:
[[1]] [1] "Singapore" [[2]] [1] "UK" [[3]] [1] "UK" [[4]] character(0)
通过一些数据清理,您可以对此做些什么.