如何在 Python Google Cloud Storage 库方法 list_blobs 中获取下一个 blob

如何解决如何在 Python Google Cloud Storage 库方法 list_blobs 中获取下一个 blob

在我看来，google.cloud.storage.Client::list_blobs 返回一个 HTTPIterator，它不是一个合适的 Python 迭代器。见下文：

import google.cloud.storage as gcs

client = gcs.Client()

blobs = client.list_blobs("mybucket")
blob = next(blobs)  # TypeError: 'HTTPIterator' object is not an iterator

blob = blobs.__next__()  # AttributeError: 'HTTPIterator' object has no attribute '__next__'

我正在寻找一种不遍历整个迭代器的解决方案。我能想出的唯一解决方案是一个愚蠢的黑客：for 循环并在第一个循环后中断。

解决方法

在不了解 Page Iterator 的细节的情况下，您可以简单地将迭代器转换为列表：

blobs = client.list_blobs(bucketName)
blob_list = list(blobs)

# First blob
blob_list[0].name

# Second blob
blob_list[1].name

# Of course you can check the number of list items with len()
count = len(blob_list)

实际上，重要的是要了解函数 list_blobs() 不会一次获取所有内容。通常，该库将一次获取 1,000 个对象。这称为分页。假设一个存储桶有 1,500 个对象，将通过迭代获取两页对象（1000 个对象和 500 个对象）。但是，返回的对象可能少于 1,000 个。

blobs = client.list_blobs(bucketName)
for page in blobs.pages:
        print('Page number: ',blobs.page_number)
        print('Count:       ',page.num_items)

输出：

Page number:  1
Count:        1000
Page number:  2
Count:        500

当您将页面迭代器转换为列表时，会获取所有对象。对于大型存储桶，这可能需要大量时间才能仅显示第一个和下一个对象。

为了更好地理解，请研究页面迭代器的源代码。

Page Iterators