对于作业作业,我正在尝试将一个
XML文件转换成R中的一个数据框架.我尝试了许多不同的东西,并且已经在互联网上搜索了想法,但是一直没有成功.这是我的代码到目前为止
library(XML) url <- 'http://www.ggobi.org/book/data/olive.xml' doc <- xmlParse(myUrl) root <- xmlRoot(doc) dataFrame <- xmlSApply(xmltop,function(x) xmlSApply(x,xmlValue)) data.frame(t(dataFrame),row.names=NULL)
它可能不像XML包一样冗长,但是xml2没有内存泄漏,并且是激光关注数据提取的.我使用trimws这是最近添加到R核心.
library(xml2) pg <- read_xml("http://www.ggobi.org/book/data/olive.xml") # get all the <record>s recs <- xml_find_all(pg,"//record") # extract and clean all the columns vals <- trimws(xml_text(recs)) # extract and clean (if needed) the area names labs <- trimws(xml_attr(recs,"label")) # mine the column names from the two variable descriptions # this XPath construct lets us grab either the <categ…> or <real…> tags # and then grabs the 'name' attribute of them cols <- xml_attr(xml_find_all(pg,"//data/variables/*[self::categoricalvariable or self::realvariable]"),"name") # this converts each set of <record> columns to a data frame # after first converting each row to numeric and assigning # names to each column (making it easier to do the matrix to data frame conv) dat <- do.call(rbind,lapply(strsplit(vals,"\ +"),function(x) { data.frame(rbind(setNames(as.numeric(x),cols))) })) # then assign the area name column to the data frame dat$area_name <- labs head(dat) ## region area palmitic palmitoleic stearic oleic linoleic linolenic ## 1 1 1 1075 75 226 7823 672 NA ## 2 1 1 1088 73 224 7709 781 31 ## 3 1 1 911 54 246 8113 549 31 ## 4 1 1 966 57 240 7952 619 50 ## 5 1 1 1051 67 259 7771 672 50 ## 6 1 1 911 49 268 7924 678 51 ## arachidic eicosenoic area_name ## 1 60 29 North-Apulia ## 2 61 29 North-Apulia ## 3 63 29 North-Apulia ## 4 78 35 North-Apulia ## 5 80 46 North-Apulia ## 6 70 44 North-Apulia
UPDATE
我现在这样做最后一点:
library(tidyverse) strsplit(vals,"[[:space:]]+") %>% map_df(~as_data_frame(as.list(setNames(.,cols)))) %>% mutate(area_name=labs)