python3下载Beautiful Soup中的最新链接

如何解决python3下载Beautiful Soup中的最新链接

在我的 python 脚本中，我加载了一个包含 Beautiful Soup 的网页。如何仅下载最新（最新）文件？

  <a href="BAGGEM0498L-15012021.zip">BAGGEM0498L-15012021.zip</a>       2021-01-19 06:56  3.6M
  <a href="BAGGEM0498L-15022021.zip">BAGGEM0498L-15022021.zip</a>       2021-02-15 21:57  3.6M
  <a href="BAGGEM0498L-15102020.zip">BAGGEM0498L-15102020.zip</a>       2020-10-24 03:19  3.6M
  <a href="BAGGEM0498L-15112020.zip">BAGGEM0498L-15112020.zip</a>       2020-11-15 15:02  3.6M
  <a href="BAGGEM0498L-15122020.zip">BAGGEM0498L-15122020.zip</a>       2020-12-15 13:48  3.6M

页面的实际网址是https://extracten.bag.kadaster.nl/lvbag/extracten/Gemeente LVC/0498/

解决方法

如果您使用文件名来决定顺序，那么您首先需要提取日期并将其转换为 datetime 对象。构建文件名列表，然后使用此日期对它们进行排序。例如：

from bs4 import BeautifulSoup
from datetime import datetime

html = """<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
<html>
 <head>
  <title>Index of /lvbag/extracten/Gemeente LVC/0498</title>
 </head>
 <body>
<h1>Index of /lvbag/extracten/Gemeente LVC/0498</h1>
<pre>      <a href="?C=N;O=D">Name</a>                           <a href="?C=M;O=A">Last modified</a>      <a href="?C=S;O=A">Size</a>  <hr>      <a href="/lvbag/extracten/Gemeente%20LVC/">Parent Directory</a>                                    -   
      <a href="BAGGEM0498L-15012021.zip">BAGGEM0498L-15012021.zip</a>       2021-01-19 06:56  3.6M  
      <a href="BAGGEM0498L-15022021.zip">BAGGEM0498L-15022021.zip</a>       2021-02-15 21:57  3.6M  
      <a href="BAGGEM0498L-15102020.zip">BAGGEM0498L-15102020.zip</a>       2020-10-24 03:19  3.6M  
      <a href="BAGGEM0498L-15112020.zip">BAGGEM0498L-15112020.zip</a>       2020-11-15 15:02  3.6M  
      <a href="BAGGEM0498L-15122020.zip">BAGGEM0498L-15122020.zip</a>       2020-12-15 13:48  3.6M  
<hr></pre>
</body></html>"""


soup = BeautifulSoup(html,"html.parser")
files = []

for a in soup.find_all('a'):
    href = a['href']
    
    if '.zip' in href:
        date = datetime.strptime(href.split('.')[0].split('-')[1],'%d%m%Y')
        files.append([date,href])

files.sort(key=lambda x: x[0],reverse=True)
print("Latest:",files[0][1])

这会给你：

Latest: BAGGEM0498L-15022021.zip

zip 文件可以自动下载如下：

import requests
from bs4 import BeautifulSoup
from datetime import datetime

url = "https://extracten.bag.kadaster.nl/lvbag/extracten/Gemeente%20LVC/0498/"
r = requests.get(url)
soup = BeautifulSoup(r.content,reverse=True)
filename = files[0][1]
print("Latest:",filename)

# Download the zip file

with open(filename,'wb') as f_zip:
    r_zip = requests.get(f'{url}{filename}')
    f_zip.write(r_zip.content)