使用RSelenium和RVest从LinkedIn抓取数据

如何解决使用RSelenium和RVest从LinkedIn抓取数据

我正在尝试从LinkedIn上的知名人士那里抓取一些数据，但我遇到了一些问题。我想执行以下操作：

在Hadley Wickhams页面（https://www.linkedin.com/in/hadleywickham/）上，我想使用RSelenium登录并“单击”“显示1项更多的知识”-以及“显示1项更多的经验”（请注意，Hadley会这样做）不能选择“显示1个更多的经验”，但是可以选择“显示1个更多的教育”）。（通过点击“显示更多的经验/教育”，我可以从页面上获取全部的教育和经验）。另外，特德·克鲁兹（Ted Cruz）可以选择“展示5个更多的体验”，我想扩展和抓取。

代码：

library(RSelenium)
library(rvest)
library(stringr)
library(xml2)

userID = "myEmailLogin" # The linkedIn email to login
passID = "myPassword"   # and LinkedIn password

try(rsDriver(port = 4444L,browser = 'firefox'))
remDr <- remoteDriver()
remDr$open()
remDr$navigate("https://www.linkedin.com/login")

user <- remDr$findElement(using = 'id',"username")
user$sendKeysToElement(list(userID,key="tab"))

pass <- remDr$findElement(using = 'id',"password")
pass$sendKeysToElement(list(passID,key="enter"))

Sys.sleep(5) # give the page time to fully load
# Navgate to individual profiles
# remDr$navigate("https://www.linkedin.com/in/thejlo/") # Jennifer Lopez
# remDr$navigate("https://www.linkedin.com/in/cruzted/") # Ted Cruz
remDr$navigate("https://www.linkedin.com/in/hadleywickham/") # Hadley Wickham 

Sys.sleep(5) # give the page time to fully load
html <- remDr$getPageSource()[[1]]


signals <- read_html(html)

personFullNameLocationXPath <- '/html/body/div[9]/div[3]/div/div/div/div/div[2]/main/div[1]/section/div[2]/div[2]/div[1]/ul[1]/li[1]'
personName <- signals %>%
  html_nodes(xpath = personFullNameLocationXPath) %>% 
  html_text()

personTagLineXPath = '/html/body/div[9]/div[3]/div/div/div/div/div[2]/main/div[1]/section/div[2]/div[2]/div[1]/h2'
personTagLine <- signals %>% 
  html_nodes(xpath = personTagLineXPath) %>% 
  html_text()

personLocationXPath <- '//*[@id="ember49"]/div[2]/div[2]/div[1]/ul[2]/li[1]'
personLocation <- signals %>% 
  html_nodes(xpath = personLocationXPath) %>% 
  html_text()

personLocation %>% 
  gsub("[\r\n]","",.) %>% 
  str_trim(.)

# Here is where I have problems

personExperienceTotalXPath = '//*[@id="experience-section"]/ul'
personExperienceTotal <- signals %>% 
  html_nodes(xpath = personExperienceTotalXPath) %>% 
  html_text()

最后一个错误personExperienceTotal是我出问题的地方...我似乎无法刮除experience-section。当我放置自己的LinkedIn URL（或一些随机的人）时，它似乎可以工作...

我的问题是，如何单击expand experience/education并刮擦这些部分？

使用RSelenium和RVest从LinkedIn抓取数据

如何解决使用RSelenium和RVest从LinkedIn抓取数据

相关推荐