我试图从ClinicalTrials.gov的XML文件中提取信息.该文件按以下方式组织:
@H_301_12@
您可以先将XML展平.
<clinical_study> ... <brief_title> ... <location> <facility> <name> <address> <city> <state> <zip> <country> </facility> <status> <contact> <last_name> <phone> <email> </contact> </location> <location> ... </location> ... </clinical_study>
我可以在以下代码中使用CRAN的R XML包从XML文件中提取所有位置节点:
library(XML) clinicalTrialUrl <- "http://clinicaltrials.gov/ct2/show/NCT01480479?resultsxml=true" xmlDoc <- xmlParse(clinicalTrialUrl,useInternalNode=TRUE) locations <- xmlToDataFrame(getNodeSet(xmlDoc,"//location"))
这样做很好.但是,如果您查看数据框,您会注意到xmlToDataFrame函数将< facility>下的所有内容集中在一起.成一个连接的字符串.解决方案是编写代码以逐列生成数据框,例如,您可以生成
flatten_xml <- function(x) { if (length(xmlChildren(x)) == 0) structure(list(xmlValue(x)),.Names = xmlName(xmlParent(x))) else Reduce(append,lapply(xmlChildren(x),flatten_xml)) } dfs <- lapply(getNodeSet(xmlDoc,"//location"),function(x) data.frame(flatten_xml(x))) allnames <- unique(c(lapply(dfs,colnames),recursive = TRUE)) df <- do.call(rbind,lapply(dfs,function(df) { df[,setdiff(allnames,colnames(df))] <- NA; df })) head(df) # city state zip country status last_name phone email last_name.1 # 1 Birmingham Alabama 35294 United States Recruiting Louis B Nabors,MD 205-934-1813 bnabors@uab.edu Louis B Nabors,MD # 2 Mobile Alabama 36604 United States Recruiting Melanie Alford,RN 251-445-9649 malford@usouthal.edu Pamela Francisco,CCRP # 3 Phoenix Arizona 85013 United States Recruiting Lynn Ashby,MD 602-406-6262 LASHBY@CHW.EDU Lynn Ashby,MD # 4 Tucson Arizona 85724 United States Recruiting Jamie Holt 520-626-6800 jholt1@email.arizona.edu Baldassarre Stea,MD,PhD # 5 Little Rock Arkansas 72205 United States Recruiting Wilma Brooks,RN 501-686-8530 ALEubanks@uams.edu Amanda Eubanks,APN # 6 Berkeley California 94704 United States Withdrawn <NA> <NA> <NA> <NA>