微信公众号搜"智元新知"关注
微信扫一扫可直接关注哦!

删除 bs4

如何解决删除 bs4

我需要页面上所有文章的 slug。我使用 bs4 来获取所有文章的 href 内容,但有些文章链接有另一个我不需要的 URL。我想删除这些项目。我使用了这个代码

import requests
import re
from bs4 import BeautifulSoup



r = requests.get('https://davidventuri.medium.com/')


soup = BeautifulSoup(r.text,'html.parser')
all_slugs = soup.find_all('a',{'class': 'dn br'})

for i in range(len(all_slugs)):
    slug = all_slugs[i]['href']
    print(slug)

这是我获得 hrefs 的结果:

/this-is-not-a-real-data-science-degree-d170c660c1cf

/not-a-real-degree-data-science-curriculum-2021-19ba9af2c1d4

/bitcoin-learning-path-9ed73f2f11d9

/your-first-day-of-school-eaf363b19ded

https://medium.com/free-code-camp/an-overview-of-every-data-visualization-course-on-the-internet-9ccf24ea9c9b

https://medium.com/free-code-camp/the-best-data-science-courses-on-the-internet-ranked-by-your-reviews-6dc5b910ea40

https://medium.com/free-code-camp/every-single-machine-learning-course-on-the-internet-ranked-by-your-reviews-3c4a7b8026c0

https://medium.com/free-code-camp/dive-into-deep-learning-with-these-23-online-courses-bf247d289cc0

/how-ai-is-revolutionizing-mental-health-care-a7cec436a1ce

https://medium.com/free-code-camp/i-ranked-all-the-best-data-science-intro-courses-based-on-thousands-of-data-points-db5dc7e3eb8e

实际上我想要它们如下:

/this-is-not-a-real-data-science-degree-d170c660c1cf

/not-a-real-degree-data-science-curriculum-2021-19ba9af2c1d4

/bitcoin-learning-path-9ed73f2f11d9

/your-first-day-of-school-eaf363b19ded

/an-overview-of-every-data-visualization-course-on-the-internet-9ccf24ea9c9b

/the-best-data-science-courses-on-the-internet-ranked-by-your-reviews-6dc5b910ea40

/every-single-machine-learning-course-on-the-internet-ranked-by-your-reviews-3c4a7b8026c0

/dive-into-deep-learning-with-these-23-online-courses-bf247d289cc0

/how-ai-is-revolutionizing-mental-health-care-a7cec436a1ce

/i-ranked-all-the-best-data-science-intro-courses-based-on-thousands-of-data-points-db5dc7e3eb8e

如何使用正则表达式或其他方式删除它们?

解决方法

如果 commit 的子字符串始终相同,则可以不使用 replace,如下所示:

regex

示例

slug = a['href'].replace('https://medium.com/free-code-camp','')

输出

import requests
from bs4 import BeautifulSoup

r = requests.get('https://davidventuri.medium.com/')
soup = BeautifulSoup(r.text,'html.parser')

all_slugs = soup.find_all('a',{'class': 'dn br'})

for a in all_slugs:
    slug = a['href'].replace('https://medium.com/free-code-camp','')
    print(slug)

编辑

您也可以使用 /this-is-not-a-real-data-science-degree-d170c660c1cf /not-a-real-degree-data-science-curriculum-2021-19ba9af2c1d4 /bitcoin-learning-path-9ed73f2f11d9 /your-first-day-of-school-eaf363b19ded /an-overview-of-every-data-visualization-course-on-the-internet-9ccf24ea9c9b /the-best-data-science-courses-on-the-internet-ranked-by-your-reviews-6dc5b910ea40 /every-single-machine-learning-course-on-the-internet-ranked-by-your-reviews-3c4a7b8026c0 /dive-into-deep-learning-with-these-23-online-courses-bf247d289cc0 /how-ai-is-revolutionizing-mental-health-care-a7cec436a1ce /i-ranked-all-the-best-data-science-intro-courses-based-on-thousands-of-data-points-db5dc7e3eb8e

split()

示例

slug = a['href'].split('/')[-1]

版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。

相关推荐


Selenium Web驱动程序和Java。元素在(x,y)点处不可单击。其他元素将获得点击?
Python-如何使用点“。” 访问字典成员?
Java 字符串是不可变的。到底是什么意思?
Java中的“ final”关键字如何工作?(我仍然可以修改对象。)
“loop:”在Java代码中。这是什么,为什么要编译?
java.lang.ClassNotFoundException:sun.jdbc.odbc.JdbcOdbcDriver发生异常。为什么?
这是用Java进行XML解析的最佳库。
Java的PriorityQueue的内置迭代器不会以任何特定顺序遍历数据结构。为什么?
如何在Java中聆听按键时移动图像。
Java“Program to an interface”。这是什么意思?