微信公众号搜"智元新知"关注
微信扫一扫可直接关注哦!

使用 Selenium 和 Python 查看抓取循环重复第一个条目而不是移动到下一个条目

如何解决使用 Selenium 和 Python 查看抓取循环重复第一个条目而不是移动到下一个条目

我目前正在尝试从 Tripadvisor (https://www.tripadvisor.com/Attraction_Review-g186338-d553603-Reviews-London_Eye-London_England.html) 自动获取评论,并使用 Selenium 和 Python 将它们保存到 csv。我遇到过这个代码,它适用于餐馆和酒店,但不适用于“要做的事”:https://bitbucket.org/devlobeslab/com.lobeslab.webseries.python/src/master/scraping/code/scraper.py

我已经修改了大部分代码,并设法将第一个数据条目存储在 csv 中,然后转到下一页。然而,对于每一页,第一个条目在 csv 中重复 10 次,然后程序移动到下一页,而不是浏览 10 条不同的评论。如果有人知道问题出在哪里,那将非常有帮助!

网站元素:


Details of my Application

  • 10 VFs are created on a mlx_core5 100G PF.
  • DPDK version is 19-11
  • Two pods Pod1 and Pod2 run on this machine.
  • Pod1 and uses VF1. Pod2 uses another VF2 of the same 100G PF. VFs are assigned to Pods using Kubernetes SRIO-DevicePlugin and SRIOV-CNI plugin.
  • Pod1 and Pod2 are supposed to exchange full-duplex UDP traffic.
  • Pod1 uses DPDK-PMD-over-VF1 for both send & receive UDP packets. VF1 is setup with 1 rxQ and 1 txQ for this purpose.
  • Pod2 uses DPDK-PMD-over-VF2 for send-alone. VF2 is set up with 1 rxQ and 1 txQ. For receiving UDP traffic,Pod2 uses a simple UDP-socket boudn to same IP address as of VF2.

Below are the traffic combinations tried

  1. pod2-dpdk-pmd-tx-over-vf2 --> pod1-Dpdk-pmd-rx-over-vf1 ==> SUCCESS.
  2. pod1-dpdk-pmd-tx-over-vf1 --> pod2-udp-socket-rx-bound-to-vf2 ==> FAILURE.
  3. pod1-udp-socket-tx-bound-to-vf1 --> pod2-udp-socket-rx-bound-to-vf2 ==> SUCCESS.

Looking for help in understanding the reason for FAILURE. I have verified that

  • A. Ethernet/IP/UDP headers filled by pod1 dpdk-sender (pod1-dpdk-pmd-tx-over-vf1) is correct . I have forwarded the packets constructed by this app to wireshark and wireshark did not show any errors.
  • B. Even tcpdump inside Pod2 does not show the packets sent by pod1. When dstMac address is correct,I expected the packet to at least show up on the dst-machine (pod) - it is ok if it Failed in the higher layers of pod's tcpip stack. But why does the packet not appear in wireshark?
  • Am I missing any settings (PMD APIs or ethtool commands?) in offloading all rx packets to linux tcpip stack in vf2 (as said above,i want to send using DPDK-PMD but receive over udp-socket).

Does dpdk-sender work seamlessly with nondpdk-receiver? It should in my opinion as sender/receiver don't always have control on each-others' design.

As I said above,there are no problems if both Sender and receiver are in socket-system call mode.

/proc/dev/net in Pod1 (VF1 device name inside pod1 = netsxu )

[root@**cs-dpdk-sender*-1-64c7d64877-5ml7p bin]# cat /proc/net/dev
Inter-|   Receive                                                |  Transmit
 face |bytes    packets errs drop fifo frame compressed multicast|bytes    packets errs drop fifo colls carrier compressed
**netsxu**: 1431238    4207    0    0    0     0          0      4207    32796     190    0    0    0     0       0          0
netf1u:  949856    2808    0    0    0     0          0      2808    36222     204    0    0    0     0       0          0
  eth0: 3017151   14452    0    0    0     0          0         0 20239378037 15505655    0    0    0     0       0          0
    lo: 7618450    3500    0    0    0     0          0         0  7618450    3500    0    0    0     0       0          0
nete1c: 1120380    5599    0    0    0     0          0      2775   211850    3039    0    0    0     0       0          0
neto1c: 1485613    4337    0    0    0     0          0      4265    36142     233    0    0    0     0       0          0 

我还尝试为 score、DATE、TITLE 和 REVIEW_TEXT 编写“.//div...”,因为这是在类似问题的答案中建议的,但不幸的是,这会导致“NoSuchElementException”。

循环:

NEXT_BTN = (By.XPATH,"//a[@aria-label='Next page']")
REVIEW_LIST = (By.XPATH,"//div[@class='_1c8_1ITO']")
REVIEWS = (By.XPATH,"//div[@class='_1c8_1ITO']/div")
score = (By.XPATH,"//div[@class='_1c8_1ITO']/div[1]/span/span/div[3]/*[local-name()='svg' and @class='zWXXYhVR' and contains(@title,'bubbles')]")
DATE = (By.XPATH,"//div[@class='_3JxPDYSx']")
TITLE = (By.XPATH,"//div[@class='DrjyGw-P _1SRa-qNz _19gl_zL- _1z-B2F-n _2AAjjcx8']/span[1]")
REVIEW_TEXT = (By.XPATH,"//div[@class='DrjyGw-P _26S7gyB4 _2nPM5Opx']/span[@class='_2tsgCuqy']")

def find_element(find_from,element):
return find_from.find_element(element[0],element[1])

网站的HTML: Screenshot of the structure

谢谢!!

解决方法

已经提到的问题是您的代码不是动态的。

您用于“REVIEWS”的 xpath 固定为第一个条目。我的建议是让您遍历该 xpath。例如在您共享的网页中,第一个条目有这个 xpath

//*[@id="tab-data-qa-reviews-0"]/div/div[5]/div[1]

第二个有这个 xpath

//*[@id="tab-data-qa-reviews-0"]/div/div[5]/div[2]

根据以上内容,我们可以推断您需要修改最后一个括号中的值以按照您想要的方式迭代评论。

版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。