XML
包被广泛应用于整个R宇宙.
因此,请将此视为跟进帖子和/或参考,并提供有希望的信息,并简要说明问题.
问题
解析XML / HTML文档之后,可以使用XPath进行搜索,需要内部使用C指针(AFAIU).而且看起来至少在MS Windows(我在Windows 8.1,64位上运行)这些引用没有被垃圾回收器正确识别.因此,消耗的存储器未被正确地释放,这导致在某个时刻冻结R过程.
至今中央调查结果
对我来说,XML:free和/或gc在通过xmlParse或htmlParse解析XML / HTML文档时,无法识别所有内存,并随后用xpathApply等进行处理:
所报告的操作系统任务(Rterm.exe)的内存使用量显着加快,而R进程的报告内存“R内”(功能内存大小)从中可以适度增加(相比之下).在下面的实质解析周期之前和之后查看列表元素mem_r,mem_os和比率.
总而言之,扔在推荐的所有东西(free,rm和gc)中,当调用xmlParse等时,内存使用总是会增加.这只是一个多少的问题.所以IMHO还是有一些不能正常工作的东西.
插图
我借鉴了Duncan的Omegahat @L_403_4@的分析代码.
一些准备:
Sys.setenv("LANGUAGE"="en") require("compiler") require("XML") > sessionInfo() R version 3.1.0 (2014-04-10) Platform: x86_64-w64-mingw32/x64 (64-bit) locale: [1] LC_COLLATE=German_Germany.1252 LC_CTYPE=German_Germany.1252 [3] LC_MONETARY=German_Germany.1252 LC_NUMERIC=C [5] LC_TIME=German_Germany.1252 attached base packages: [1] compiler stats graphics grDevices utils datasets methods [8] base other attached packages: [1] XML_3.98-1.1
我们需要的功能:
getTaskMemoryByPid <- cmpfun(function( pid=Sys.getpid() ) { cmd <- sprintf("tasklist /FI \"pid eq %s\" /FO csv",pid) mem <- read.csv(text=shell(cmd,intern = TRUE),stringsAsFactors=FALSE)[,5] mem <- as.numeric(gsub("\\.|\\s|K","",mem))/1000 mem },options=list(suppressAll=TRUE)) memoryLeak <- cmpfun(function( x=system.file("exampleData","mtcars.xml",package="XML"),n=10000,use_text=FALSE,xpath=FALSE,free_doc=FALSE,clean_up=FALSE,detailed=FALSE ) { if(use_text) { x <- readLines(x) } ## Before // mem_os <- getTaskMemoryByPid() mem_r <- memory.size() prof_1 <- memory.profile() mem_before <- list(mem_r=mem_r,mem_os=mem_os,ratio=mem_os/mem_r) ## Per run // mem_perrun <- lapply(1:n,function(ii) { doc <- xmlParse(x,asText=use_text) if (xpath) { res <- xpathApply(doc=doc,path="/blah",fun=xmlValue) rm(res) } if (free_doc) { free(doc) } rm(doc) out <- NULL if (detailed) { out <- list( profile=memory.profile(),size=memory.size() ) } out }) has_perrun <- any(sapply(mem_perrun,length) > 0) if (!has_perrun) { mem_perrun <- NULL } ## Garbage collect // mem_gc <- NULL if(clean_up) { gc() tmp <- gc() mem_gc <- list(gc_mb=tmp["Ncells","(Mb)"]) } ## After // mem_os <- getTaskMemoryByPid() mem_r <- memory.size() prof_2 <- memory.profile() mem_after <- list(mem_r=mem_r,ratio=mem_os/mem_r) list( before=mem_before,perrun=mem_perrun,gc=mem_gc,after=mem_after,comparison_r=data.frame( before=prof_1,after=prof_2,increase=round((prof_2/prof_1)-1,4) ),increase_r=(mem_after$mem_r/mem_before$mem_r)-1,increase_os=(mem_after$mem_os/mem_before$mem_os)-1 ) },options=list(suppressAll=TRUE))
结果
情景1
快速的事实:启用垃圾回收,XML文档被解析n次,但不通过xpathApply进行搜索
注意OS内存与R内存的比例:
之前:1.364832
之后:1.322702
res <- memoryLeak(clean_up=TRUE,n=50000) save(res,file=file.path(tempdir(),"memory-profile-1.rdata")) > res $before $before$mem_r [1] 37.42 $before$mem_os [1] 51.072 $before$ratio [1] 1.364832 $perrun NULL $gc $gc$gc_mb [1] 45 $after $after$mem_r [1] 63.21 $after$mem_os [1] 83.608 $after$ratio [1] 1.322702 $comparison_r before after increase NULL 1 1 0.0000 symbol 7387 7392 0.0007 pairlist 190383 390633 1.0518 closure 5077 55085 9.8499 environment 1032 51032 48.4496 promise 5226 105226 19.1351 language 54675 54791 0.0021 special 44 44 0.0000 builtin 648 648 0.0000 char 8746 8763 0.0019 logical 9081 9084 0.0003 integer 22804 22807 0.0001 double 2773 2783 0.0036 complex 1 1 0.0000 character 44522 94569 1.1241 ... 0 0 NaN any 0 0 NaN list 19946 19951 0.0003 expression 1 1 0.0000 bytecode 16049 16050 0.0001 externalptr 1487 1487 0.0000 weakref 391 391 0.0000 raw 392 392 0.0000 S4 1392 1392 0.0000 $increase_r [1] 0.6892036 $increase_os [1] 0.6370614
情景2
快速的事实:启用垃圾收集,明确地被调用,XML文档被解析了n次,但没有通过xpathApply进行搜索.
注意OS内存与R内存的比例:
之前:1.315249
之后:1.222143
res <- memoryLeak(clean_up=TRUE,free_doc=TRUE,"memory-profile-2.rdata")) > res $before $before$mem_r [1] 63.48 $before$mem_os [1] 83.492 $before$ratio [1] 1.315249 $perrun NULL $gc $gc$gc_mb [1] 69.3 $after $after$mem_r [1] 95.92 $after$mem_os [1] 117.228 $after$ratio [1] 1.222143 $comparison_r before after increase NULL 1 1 0.0000 symbol 7454 7454 0.0000 pairlist 392455 592466 0.5096 closure 55104 105104 0.9074 environment 51032 101032 0.9798 promise 105226 205226 0.9503 language 55592 55592 0.0000 special 44 44 0.0000 builtin 648 648 0.0000 char 8847 8848 0.0001 logical 9141 9141 0.0000 integer 23109 23111 0.0001 double 2802 2807 0.0018 complex 1 1 0.0000 character 94775 144781 0.5276 ... 0 0 NaN any 0 0 NaN list 20174 20177 0.0001 expression 1 1 0.0000 bytecode 16265 16265 0.0000 externalptr 1488 1487 -0.0007 weakref 392 391 -0.0026 raw 393 392 -0.0025 S4 1392 1392 0.0000 $increase_r [1] 0.5110271 $increase_os [1] 0.4040627
情景3
快速的事实:启用垃圾回收,明确地调用自由,XML doc被解析n次,并通过xpathApply每次进行搜索.
注意OS内存与R内存的比例:
之前:1.220429
之后:13.15629(!)
res <- memoryLeak(clean_up=TRUE,xpath=TRUE,"memory-profile-3.rdata")) res $before $before$mem_r [1] 95.94 $before$mem_os [1] 117.088 $before$ratio [1] 1.220429 $perrun NULL $gc $gc$gc_mb [1] 93.4 $after $after$mem_r [1] 124.64 $after$mem_os [1] 1639.8 $after$ratio [1] 13.15629 $comparison_r before after increase NULL 1 1 0.0000 symbol 7454 7460 0.0008 pairlist 592458 793042 0.3386 closure 105104 155110 0.4758 environment 101032 151032 0.4949 promise 205226 305226 0.4873 language 55592 55882 0.0052 special 44 44 0.0000 builtin 648 648 0.0000 char 8847 8867 0.0023 logical 9142 9162 0.0022 integer 23109 23112 0.0001 double 2802 2832 0.0107 complex 1 1 0.0000 character 144775 194819 0.3457 ... 0 0 NaN any 0 0 NaN list 20174 20177 0.0001 expression 1 1 0.0000 bytecode 16265 16265 0.0000 externalptr 1488 1487 -0.0007 weakref 392 391 -0.0026 raw 393 392 -0.0025 S4 1392 1392 0.0000 $increase_r [1] 0.2991453 $increase_os [1] 13.00485
我也尝试过不同的版本.嗯,我试着尝试;-)
从源头,从omegahat.org
FYI:最新的Rtools 3.1安装并包含在Windows PATH中(例如安装stringr,源代码的工作正常).
> install.packages("XML",repos="http://www.omegahat.org/R",type="source") trying URL 'http://www.omegahat.org/R/src/contrib/XML_3.98-1.tar.gz' Content type 'application/x-gzip' length 1543387 bytes (1.5 Mb) opened URL downloaded 1.5 Mb * installing *source* package 'XML' ... Please define LIB_XML (and LIB_ZLIB,LIB_ICONV) Warning: running command 'sh ./configure.win' had status 1 ERROR: configuration Failed for package 'XML' * removing 'R:/home/apps/lsqmapps/apps/r/R-3.1.0/library/XML' * restoring prevIoUs 'R:/home/apps/lsqmapps/apps/r/R-3.1.0/library/XML' The downloaded source packages are in 'C:\Users\rappster_admin\AppData\Local\Temp\RtmpQFZ2Ck\downloaded_packages' Warning messages: 1: running command '"R:/home/apps/lsqmapps/apps/r/R-3.1.0/bin/x64/R" CMD INSTALL -l "R:\home\apps\lsqmapps\apps\r\R-3.1.0\library" C:\Users\RAPPST~1\AppData\Local\Temp\RtmpQFZ2Ck/downloaded_packages/XML_3.98-1.tar.gz' had status 1 2: In install.packages("XML",repos = "http://www.omegahat.org/R",: installation of package 'XML' had non-zero exit status
Github上
我没有按照README的github回购建议,因为它指向this directory,只包含一个3.94-0版本的tar.gz(当时我们在CRAN是3.98-1.1).
即使说gihub repo不是一个标准的R包结构,我试过它,无论如何,install_github – 并且失败;-)
require("devtools") > install_github(repo="XML",username="omegahat") Installing github repo XML/master from omegahat Downloading master.zip from https://github.com/omegahat/XML/archive/master.zip Installing package from C:\Users\RAPPST~1\AppData\Local\Temp\RtmpQFZ2Ck/master.zip Installing XML "R:/home/apps/lsqmapps/apps/r/R-3.1.0/bin/x64/R" --vanilla CMD INSTALL \ "C:\Users\rappster_admin\AppData\Local\Temp\RtmpQFZ2Ck\devtools15c82d7c2b4c\XML-master" \ --library="R:/home/apps/lsqmapps/apps/r/R-3.1.0/library" --with-keep.source \ --install-tests * installing *source* package 'XML' ... Please define LIB_XML (and LIB_ZLIB,LIB_ICONV) Warning: running command 'sh ./configure.win' had status 1 ERROR: configuration Failed for package 'XML' * removing 'R:/home/apps/lsqmapps/apps/r/R-3.1.0/library/XML' * restoring prevIoUs 'R:/home/apps/lsqmapps/apps/r/R-3.1.0/library/XML' Error: Command Failed (1)
> read_xml()来读取一个XML文件
> xml_children()来获取节点的子节点
> xml_text()来获取标签中的文本
> xml_attrs()来获取节点的属性和值的字符向量,可以使用as.list()将其转换为命名列表
请注意,在完成之后,您仍然需要确保您的XML节点对象为rm(),并强制使用gc()进行垃圾收集,但是内存实际上会被释放到O / S(免责声明:只有在Windows 7上测试,但这似乎是最“内存泄漏”的平台).
希望这有助于某人!