如何解决从python中查找或选择元素以使用beautifulsoup进行抓取
我不知道如何在表 class="table-info" 中选择以下项目
使用python和beautifulsoup,我想提取:
-
电话
-
电子邮件
-
网站
-
main activity(不带div的li元素文本) “计算机咨询活动”。
<table class="table-info"> <tbody> <tr> <td class="col-1"> <div class="col-1-text">Business name</div> </td> <td class="col-2"> <div class="col-2-text">Company XYZ</div> </td> </tr> <tr> <td class="col-1"> <div class="col-1-text">Register code:</div> </td> <td class="col-2"> <div class="col-2-text">112233558</div> </td> </tr> <tr> <td class="col-1"> <div class="col-1-text">Operating address:</div> </td> <td class="col-2"> <div class="col-2-text"><a target="googlemaps" href="https://www.google.com/maps/place/Some-location" class="link-location">Some location strt. 233</a></div> </td> </tr> <tr> <td class="col-1"> <div class="col-1-text">Legal address</div> </td> <td class="col-2"> <div class="col-2-text"> <a class="link-location" href="https://www.google.com/maps/place/Some-location" target="_new">Some location </a> </div> </td> </tr> <tr> <td class="col-1"> <div class="col-1-text">VAT No:</div> </td> <td class="col-2"> <div class="col-2-text"><a href="javascript:void(0)" onclick="return getVAT(this,'12345678')">Get VAT liability</a></div> </td> </tr> <tr> <td class="col-1"> <div class="col-1-text">Age:</div> </td> <td class="col-2"> <div class="col-2-text">1 year 3 months</div> </td> </tr> <tr> <td class="col-1"> <div class="col-1-text">Founded:</div> </td> <td class="col-2"> <div class="col-2-text">20/09/2019</div> </td> </tr> <tr> <td class="col-1"> <div class="col-1-text">Capital:</div> </td> <td class="col-2"> <div class="col-2-text">2000 USD</div> </td> </tr> <tr> <td colspan="2" class="sep"></td> </tr> <tr> <td class="col-1"> <div class="col-1-text">Phone:</div> </td> <td class="col-2"> <div class="col-2-text">123456789</div> </td> </tr> <tr> <td class="col-1"> <div class="col-1-text">E-mail:</div> </td> <td class="col-2"> <div class="col-2-text"><a href="mailto:some@one.com">some@one.com</a></div> </td> </tr> <tr> <td colspan="2" class="sep"></td> </tr> <tr> <td class="col-1"> <div class="col-1-text">Representatives:</div> </td> <td class="col-2"> <div class="col-2-text"> <div class="box-message"> <p class="desc">To access information,please</p> <p> <a href="#" onclick="return loginClicked(this,'#');" class="btn btn-small btn-purple link-login">Log in</a> </p> </div> </div> </td> </tr> <tr> <td colspan="2" class="sep"></td> </tr> <tr> <td class="col-1"> <div class="col-1-text"> Main activity: <span class="tip info" title="" data-original-title="Activities are classified according to EMTAK 2008"></span> </div> </td> <td class="col-2"> <div class="col-2-text" id="activity_top5ffe2eab23d13"> <ul> <li> Computer consultancy activities <div class="main_activities_top_link_wrapper"> <a href="https://www.somesite.com/" target="_blank" onclick="ga('send','event','check','top_btn','Anonym');" class="btn btn-simple btn-open-graph"> <span>Open TOP 20</span> </a> </div> </li> </ul> </div> </td> </tr> </tbody>
注意:上面的代码是一个查询结果/html示例,但有时查询结果/公司没有电子邮件或网站/反之亦然。因此,重要的是,如果代码未找到所需的 html 内容,则不会出错。我发现最好遵循类名或 ID,而不是计算表/div 嵌套的深度 (xpath)。
我的代码在 atm 中无法正常工作:
import csv
import requests
import datetime
import time
from requests import get
from bs4 import BeautifulSoup
with open('data.csv',encoding='utf8') as csvfile:
reader = csv.reader(csvfile,delimiter=';')
next(reader)
count = 0
for row in reader:
timestamp = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")
url = f'https://www.somedomain.com/result?country=en&q={row[1]}'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML,like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
cookies = {'__test': '1bb6e881021f013463740eeb74840b18'}
content = get(url,headers=headers,cookies=cookies).content
soup = BeautifulSoup(content,"lxml")
table_info = soup.select_one('.table-info')
mail = table_info.select_one('.col-2 a[href^=mailto]')
mail = mail.get('href')
mail_clean = mail.split(':')[1]
website = soup.find(text='Website:')
website = table_info.select_one('.col-2 a[target^=_blank]')
website = website.get('href')
collected_data = row[1],mail_clean,website,timestamp
data_list = [["Regcode","Email","Website","Timestamp"],collected_data]
with open('extracted.csv','w',newline='') as file:
writer = csv.writer(file,delimiter=';')
writer.writerows(data_list)
print(row[1],"|",timestamp)
#print("Waiting 3 seconds...")
#time.sleep(3)
count+=1
解决方法
您是否考虑过使用 css 选择器来计算表格的子项?如果您的表始终反映示例代码,则使用 nth-child
属性可能更容易。
- 电话:
tr:nth-child(10) .col-2-text
- 电子邮件:
tr:nth-child(11) a
- 网站:
span
- 主要活动:
li
我使用 Selector Gadget 来抓取这些标签。您可能希望直接在您的页面上运行它,看看是否还有其他更容易实现的。
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。