我需要在一些巨大的
XML文件中查找和组合信息(doc< - xmlInternalTreeParse(file.name,useInternalNodes = TRUE,trim = TRUE)导致我的16GB计算机在完成之前开始交换到磁盘),并且已经跟着好了关于
http://www.omegahat.org/RSXML/Overview.html的说明.
<?xml version="1.0" ?> <TABLE> <SCHOOL> <NAME> School1 </NAME> <GRADES> <STUDENT> Fred </STUDENT> <TEST1> 66 </TEST1> <TEST2> 80 </TEST2> <FINAL> 70 </FINAL> </GRADES> <TEAMS> <SOCCER> SoccerTeam1 </SOCCER> <HOCKEY> HockeyTeam1 </HOCKEY> </TEAMS> </SCHOOL> <SCHOOL> <NAME> School2 </NAME> <GRADES> <STUDENT> Wilma </STUDENT> <TEST1> 97 </TEST1> <TEST2> 91 </TEST2> <FINAL> 98 </FINAL> </GRADES> <TEAMS> <SOCCER> SoccerTeam2 </SOCCER> </TEAMS> </SCHOOL> </TABLE>
我需要为每个学校的学生列出曲棍球队和队名.示例中所需的输出应为“Fred”,“HockeyTeam1”,“School1”.真实的例子有成千上万的“学校”,“曲棍球队”和“球员”.
如何使用xmlEventParse解析文件以提取信息?我试图从文件中提取所有文本字段,但经过几个小时的等待后仍然没有输出.注意:真实文件比这更嵌套,因此不需要步骤固定级别来查找信息.
我们将使用XML包
library(XML)
并创建一个闭包,其中包含一个处理’SCHOOL’节点的函数,以及两个辅助函数来完成后检索结果.在每个SCHOOL节点上调用SCHOOL函数.如果它找到一个曲棍球队,它使用/ SCHOOL / NAME / text()作为’key’,和/ SCHOOL / TEAMS / HOCKEY / text()和// STUDENT / text()(或/ SCHOOL / GRADES / STUDENT / text())作为值.每100个(默认情况下)有曲棍球队的学校会打印一条消息,以便有一些进度指示.事后使用’get’函数来检索结果.
teams <- function(progress=1000) { res <- new.env(parent=emptyenv()) # for results it <- 0L # iterator -- nodes visited list(SCHOOL=function(elt) { ## handle 'SCHOOL' nodes if (getNodeSet(elt,"not(/SCHOOL/TEAMS/HOCKEY)")) ## early exit -- no hockey team return(NULL) it <<- it + 1L if (it %% progress == 0L) message(it) school <- getNodeSet(elt,"string(/SCHOOL/NAME/text())") # 'key' res[[school]] <- list(team=getNodeSet(elt,"normalize-space(/SCHOOL/TEAMS/HOCKEY/text())"),students= xpathSApply(elt,"//STUDENT",xmlValue)) },getres = function() { ## retrieve the 'res' environment when done res },get=function() { ## retrieve 'res' environment as data.frame school <- ls(res) team <- unlist(eapply(res,"[[","team"),use.names=FALSE) student <- eapply(res,"students") len <- sapply(student,length) data.frame(school=rep(school,len),team=rep(team,student=unlist(student,use.names=FALSE)) }) }
我们使用函数as
branches <- teams() xmlEventParse("event.xml",handlers=NULL,branches=branches) branches$get()