我正在从一系列网站(如reddit.com)中提取用户评论,Youtube也是我的另一个多汁的信息来源.我现有的刮刀是写在R:
# x is the url html = getURL(x) doc = htmlParse(html,asText=TRUE) txt = xpathSApply(doc,//body//text()[not(ancestor::script)][not(ancestor::style)][not(ancestor::noscript)]",xmlValue)
这在YouTube数据上不起作用,实际上,如果你看看像this这样的Youtube视频的来源,你会发现这个注释不会出现在源代码中.
有没有人有任何建议如何在这种情况下提取数据?
非常感谢!
解决方法
以下答案:
R: rvest: scraping a dynamic ecommerce page
您可以执行以下操作:
devtools::install_github("ropensci/RSelenium") # Install from github library(RSelenium) library(rvest) pJS <- phantom(pjs_cmd = "PATH TO phantomjs.exe") # as i am using windows Sys.sleep(5) # give the binary a moment remDr <- remoteDriver(browserName = 'phantomjs') remDr$open() remDr$navigate("https://www.youtube.com/watch?v=qRC4Vk6kisY") remDr$getTitle()[[1]] # [1] "YouTube" # scroll down for(i in 1:5){ remDr$executeScript(paste("scroll(0,",i*10000,");")) Sys.sleep(3) } # Get page source and parse it via rvest page_source <- remDr$getPageSource() author <- html(page_source[[1]]) %>% html_nodes(".user-name") %>% html_text() text <- html(page_source[[1]]) %>% html_nodes(".comment-text-content") %>% html_text() #combine the data in a data.frame dat <- data.frame(author = author,text = text) Result: > head(dat) author text 1 Kikyo bunny simpie Omg I love fluffy puff she's so adorable when she was dancing on a rainbow it's so cute!!! 2 Tatjana Celinska Ciao 0 3 Yvette Austin GET OUT OF MYÂ HEAD!!!! 4 Susan II Watch narhwals 5 Greg Ginger who in the entire fandom never watched this,should be ashamed,\n\nPFFFTT!!! 6 Arnav Sinha LOL what the hell is this?
评论1:您需要github版本,请参阅rselenium | get youtube page source
评论2:
这段代码给你最初的44条评论.一些评论有一个“显示所有答案”链接,必须点击.还要查看更多的评论,您必须点击页面底部的显示更多按钮.
在这个卓越的RSelenium教程中介绍了点击:
http://cran.r-project.org/web/packages/RSelenium/vignettes/RSelenium-basics.html