如何解决删除 bs4
我需要页面上所有文章的 slug。我使用 bs4 来获取所有文章的 href 内容,但有些文章的链接有另一个我不需要的 URL。我想删除这些项目。我使用了这个代码:
import requests
import re
from bs4 import BeautifulSoup
r = requests.get('https://davidventuri.medium.com/')
soup = BeautifulSoup(r.text,'html.parser')
all_slugs = soup.find_all('a',{'class': 'dn br'})
for i in range(len(all_slugs)):
slug = all_slugs[i]['href']
print(slug)
这是我获得 hrefs 的结果:
/this-is-not-a-real-data-science-degree-d170c660c1cf
/not-a-real-degree-data-science-curriculum-2021-19ba9af2c1d4
/bitcoin-learning-path-9ed73f2f11d9
/your-first-day-of-school-eaf363b19ded
https://medium.com/free-code-camp/an-overview-of-every-data-visualization-course-on-the-internet-9ccf24ea9c9b
https://medium.com/free-code-camp/the-best-data-science-courses-on-the-internet-ranked-by-your-reviews-6dc5b910ea40
https://medium.com/free-code-camp/every-single-machine-learning-course-on-the-internet-ranked-by-your-reviews-3c4a7b8026c0
https://medium.com/free-code-camp/dive-into-deep-learning-with-these-23-online-courses-bf247d289cc0
/how-ai-is-revolutionizing-mental-health-care-a7cec436a1ce
https://medium.com/free-code-camp/i-ranked-all-the-best-data-science-intro-courses-based-on-thousands-of-data-points-db5dc7e3eb8e
实际上我想要它们如下:
/this-is-not-a-real-data-science-degree-d170c660c1cf
/not-a-real-degree-data-science-curriculum-2021-19ba9af2c1d4
/bitcoin-learning-path-9ed73f2f11d9
/your-first-day-of-school-eaf363b19ded
/an-overview-of-every-data-visualization-course-on-the-internet-9ccf24ea9c9b
/the-best-data-science-courses-on-the-internet-ranked-by-your-reviews-6dc5b910ea40
/every-single-machine-learning-course-on-the-internet-ranked-by-your-reviews-3c4a7b8026c0
/dive-into-deep-learning-with-these-23-online-courses-bf247d289cc0
/how-ai-is-revolutionizing-mental-health-care-a7cec436a1ce
/i-ranked-all-the-best-data-science-intro-courses-based-on-thousands-of-data-points-db5dc7e3eb8e
如何使用正则表达式或其他方式删除它们?
解决方法
如果 commit
的子字符串始终相同,则可以不使用 replace
,如下所示:
regex
示例
slug = a['href'].replace('https://medium.com/free-code-camp','')
输出
import requests
from bs4 import BeautifulSoup
r = requests.get('https://davidventuri.medium.com/')
soup = BeautifulSoup(r.text,'html.parser')
all_slugs = soup.find_all('a',{'class': 'dn br'})
for a in all_slugs:
slug = a['href'].replace('https://medium.com/free-code-camp','')
print(slug)
编辑
您也可以使用 /this-is-not-a-real-data-science-degree-d170c660c1cf
/not-a-real-degree-data-science-curriculum-2021-19ba9af2c1d4
/bitcoin-learning-path-9ed73f2f11d9
/your-first-day-of-school-eaf363b19ded
/an-overview-of-every-data-visualization-course-on-the-internet-9ccf24ea9c9b
/the-best-data-science-courses-on-the-internet-ranked-by-your-reviews-6dc5b910ea40
/every-single-machine-learning-course-on-the-internet-ranked-by-your-reviews-3c4a7b8026c0
/dive-into-deep-learning-with-these-23-online-courses-bf247d289cc0
/how-ai-is-revolutionizing-mental-health-care-a7cec436a1ce
/i-ranked-all-the-best-data-science-intro-courses-based-on-thousands-of-data-points-db5dc7e3eb8e
split()
示例
slug = a['href'].split('/')[-1]
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。