我从rNomads包中获取了以下代码,并对其进行了一些修改.
最初运行时我得到:
> WebCrawler(url = "www.bikeforums.net") [1] "www.bikeforums.net" [1] "www.bikeforums.net" Warning message: XML content does not seem to be XML: 'www.bikeforums.net'
这是代码:
require("XML") # cleaning workspace rm(list = ls()) # This function recursively searches for links in the given url and follows every single link. # It returns a list of the final (dead end) URLs. # depth - How many links to return. This avoids having to recursively scan hundreds of links. Defaults to NULL,which returns everything. WebCrawler <- function(url,depth = NULL,verbose = TRUE) { doc <- XML::htmlParse(url) links <- XML::xpathSApply(doc,"//a/@href") XML::free(doc) if(is.null(links)) { if(verbose) { print(url) } return(url) } else { urls.out <- vector("list",length = length(links)) for(link in links) { if(!is.null(depth)) { if(length(unlist(urls.out)) >= depth) { break } } urls.out[[link]] <- WebCrawler(link,depth = depth,verbose = verbose) } return(urls.out) } } # Execution WebCrawler(url = "www.bikeforums.net")
任何建议我做错了什么?
UPDATE
大家好,
我开始这个赏金,因为我认为在R社区需要这样一个功能,可以抓取网页.赢得赏金的解决方案应该显示一个需要两个参数的函数:
WebCrawler(url = "www.bikeforums.net",xpath = "\\title" )
>作为输出我想有一个数据框架与两列:网站链接,如果示例xpath表达式匹配列与匹配的表达式.
我非常感谢你的回复
在函数中的< - XML :: xpathSApply(doc,“// a / @ href”)链接下插入以下代码.
links <- XML::xpathSApply(doc,"//a/@href") links1 <- links[grepl("http",links)] # As @Floo0 pointed out this is to capture non relative links links2 <- paste0(url,links[!grepl("http",links)]) # and to capture relative links links <- c(links1,links2)
还记得要把这个URL作为http:// www ……
另外你还没有更新你的urls.out列表.如你所见,它总是一个长度与链接长度相同的空列表