微信公众号搜"智元新知"关注
微信扫一扫可直接关注哦!

html – R跨越多个页面的网页抓取

我正在开展网络抓取计划,以搜索特定的葡萄酒,并返回该品种的当地葡萄酒清单.我遇到的问题是多页结果.下面的代码是我正在使用的基本示例
url2 <- "http://www.winemag.com/?s=washington+merlot&search_type=reviews"
htmlpage2 <- read_html(url2)
names2 <- html_nodes(htmlpage2,".review-listing .title")
Wines2 <- html_text(names2)

对于此特定搜索,有39页的结果.我知道url更改为http://www.winemag.com/?s=washington%20merlot&drink_type=wine&page=2,但是有一种简单的方法可以让代码循环遍历所有返回的页面并将所有39个页面的结果编译成单个列表吗?我知道我可以手动完成所有网址,但这看起来有些过分.

解决方法

如果您希望将所有信息作为data.frame,您可以使用purrr :: map_df()执行类似的操作:
library(rvest)
library(purrr)

url_base <- "http://www.winemag.com/?s=washington merlot&drink_type=wine&page=%d"

map_df(1:39,function(i) {

  # simple but effective progress indicator
  cat(".")

  pg <- read_html(sprintf(url_base,i))

  data.frame(wine=html_text(html_nodes(pg,".review-listing .title")),excerpt=html_text(html_nodes(pg,"div.excerpt")),rating=gsub(" Points","",html_text(html_nodes(pg,"span.rating"))),appellation=html_text(html_nodes(pg,"span.appellation")),price=gsub("\\$","span.price"))),stringsAsFactors=FALSE)

}) -> wines

dplyr::glimpse(wines)
## Observations: 1,170
## Variables: 5
## $wine        (chr) "Charles Smith 2012 Royal City Syrah (Columbia Valley (WA)...
## $excerpt     (chr) "Green olive,green stem and fresh herb aromas are at the ...
## $rating      (chr) "96","95","94","93","93"...
## $appellation (chr) "Columbia Valley","Columbia Valley","...
## $price       (chr) "140","70","20","40","135","50","60","3...

版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。

相关推荐