微信公众号搜"智元新知"关注
微信扫一扫可直接关注哦!

requests-html:按内容查找td

如何解决requests-html:按内容查找td

所以我正尝试使用request-html抓取该表:

<table class="pet-listing__list rescue-details">
<tbody>
<tr>
<td>Rescue group:</td>
<td><a href="/groups/10282/Dog-Rescue-Newcastle">Dog Rescue Newcastle</a></td>
</tr>
<tr>
<td>PetRescue ID:</td>
<td>802283</td>
</tr>
<tr>
<td>Location:</td>
<td>Toronto,NSW</td>
</tr>
<tr>
<td class="first age">Age:</td>
<td class="first age">1 year 2 months</td>
</tr>
<tr>
<td class="adoption_fee">Adoption fee:</td>
<td class="adoption_fee">$550.00</td>
</tr>
<tr>
<td class="desexed">Desexed:</td>
<td class="desexed"><span class="boolean-image-true boolean-image-yes">Yes</span></td>
</tr>
<tr>
<td class="vaccinated">Vaccinated:</td>
<td class="vaccinated"><span class="boolean-image-true boolean-image-yes">Yes</span></td>
</tr>
<tr>
<td class="wormed">Wormed:</td>
<td class="wormed"><span class="boolean-image-true boolean-image-yes">Yes</span></td>
</tr>
<tr>
<td class="microchip_number">microchip number:</td>
<td class="microchip_number">OnFile</td>
</tr>
<tr>
<td class="rehoming_organisation_id">Rehoming organisation:</td>
<td class="rehoming_organisation_id">R251000026</td>
</tr>
</tbody>
</table>

文档似乎并未提及找到下一个td的方法,例如如果我想刮擦狗的救援小组或地点。有没有一种方法可以仅使用request-html来刮除表格中的那些单元格,否则是否需要另外处理例如bs4 / lxml / etc。要解析?

到目前为止的代码(返回错误,因为HTMLSession.html.find不接受bs4之类的属性文本):

class PetBarnCrawler(DogCrawler):
    """Looks for dogs on Petbarn"""
    def __init__(self,url="https://www.petrescue.com.au/listings/search/dogs"):
        super(PetBarnCrawler,self).__init__(url)
    def _get_dogs(self,**kwargs):
        """Get listing of all dogs"""
        for html in self.current_page.html:
            # grab all the dogs on the page
            dog_previews = html.find("article.cards-listings-preview")
            for preview in dog_previews:
                new_session = HTMLSession()
                page_link = preview.find("a.cards-listings-preview__content")[0].attrs["href"]
                dog_page = new_session.get(page_link)
                # populate the dictionary with all the parameters of interest
                this_dog = {
                    "id": os.path.split(urllib.parse.urlparse(dog_page.url).path)[1],"url": page_link,"name": dog_page.html.find(".pet-listing__content__name"),"breed": dog_page.html.find(".pet-listing__content__breed"),"age": dog_page.html.find("td.age")[1],"price": dog_page.html.find("td.adoption_fee")[1],"desexed": dog_page.html.find("td.desexed")[1],"vaccinated": dog_page.html.find("td.vaccinated")[1],"wormed": dog_page.html.find("td.wormed")[1],"feature": dog_page.html.find(".pet-listing__content__feature"),"rescue_group": dog_page.html.find("td",text="Rescue group:").find_next("td"),"rehoming_organisation_id": dog_page.html.find("td.rehoming_organisation_id")[1],"location": dog_page.html.find("td",text="Location:").find_next("td"),"description": dog_page.html.find(".personality"),"medical_notes": dog_page.html.find("."),"adoption_process": dog_page.html.find(".adoption_process"),}
                self.dogs.append(this_dog)
                new_session.close()

解决方法

这样的事情应该可以解决您的问题。

tr = table.findAll(['tr'])[3]

[3]指定位置。

##更新时间:09/25

在进一步查看了站点并浏览了标签之后,您要查找的位置详细信息存储在以下标签中。 'cards-listings-preview__content__section__location'

这段代码允许我从网站上抓取位置详细信息。

location = soup.find_all('strong',attrs={'class':'cards-listings-preview__content__section__location'})
,

事实证明,我没有足够仔细地阅读文档。

在request-html中使用xpath查询功能就足够了,而无需使用bs4或lxml之类的库遍历文档树:

{
    ...
    "location": dog_page.html.xpath("//tr[td='Location:']/td[2]")[0].text,...
}

cf。这篇文章:XPath:: Get following Sibling

版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。