使用R中的rvest来抓取一个网页,我想从节点中提取相当于inner
HTML的内容,特别是在应用html_text之前将换行符更改为换行符.
所需功能的示例:
library(rvest) doc <- read_html('<html><p class="pp">First Line<br />Second Line</p>') innerHTML(doc,".pp")
应产生以下输出:
[1] "<p class=\"pp\">First Line<br>Second Line</p>"
使用rvest 0.2,可以通过toString.XMLNode实现
# run under rvest 0.2 library(XML) html('<html><p class="pp">First Line<br />Second Line</p>') %>% html_node(".pp") %>% toString.XMLNode [1] "<p class=\"pp\">First Line<br>Second Line</p>"
随着较新的rvest 0.2.0.900,这不再起作用了.
# run under rvest 0.2.0.900 library(XML) html_node(doc,".pp") %>% toString.XMLNode [1] "{xml_node}\n<p>\n[1] <br/>"
所需的功能通常在包xml2的write_xml函数中可用,rvest现在依赖于该函数 – 如果只有write_xml可以将其输出提供给变量而不是坚持写入文件. (也不接受textConnection).
# extract innerHTML,workaround: write/read to/from temp file html_innerHTML <- function(x,css,xpath) { file <- tempfile() html_node(x,css) %>% write_xml(file) txt <- readLines(file,warn=FALSE) unlink(file) txt } html_innerHTML(doc,".pp") [1] "<p class=\"pp\">First Line<br>Second Line</p>"
然后,我可以将换行标记转换为换行符:
html_innerHTML(doc,".pp") %>% gsub("<br\\s*/?\\s*>","\n",.) %>% read_html %>% html_text [1] "First Line\nSecond Line"
解决方法
正如@ r2evans所指出的那样,as.character(doc)就是解决方案.
关于你的最后一个代码片段,它希望在转换< br>时从节点中提取< br>分离的文本.到换行符,目前尚未解决的rvest issue #175,comment #2有一个解决方法:
此问题的简化版本:
doc <- read_html('<html><p class="pp">First Line<br />Second Line</p>') # r2evan's solution: as.character(rvest::html_node(doc,xpath="//p")) ##[1] "<p class=\"pp\">First Line<br>Second Line</p>" # rentrop@github's solution,simplified: innerHTML <- function(x,trim = FALSE,collapse = "\n"){ paste(xml2::xml_find_all(x,".//text()"),collapse = collapse) } innerHTML(doc) ## [1] "First Line\nSecond Line"