为什么 urllib.parse 在所有情况下都不能正确拆分 URL <scheme>:<number>？

如何解决为什么 urllib.parse 在所有情况下都不能正确拆分 URL <scheme>:<number>？

如果我输入 <scheme>:<integer> 形式的 URL，则两个函数都不会根据所使用的方案正确拆分方案。如果我通过添加非数字字符来更改 <integer>，这将按预期工作。（我使用的是 python 3.8.8）

>>> from urllib.parse import urlparse
>>> urlparse("custom:12345")  # does not work
ParseResult(scheme='',netloc='',path='custom:12345',params='',query='',fragment='')
>>> urlparse("zip:12345")  # does not work
ParseResult(scheme='',path='zip:12345',fragment='')
urlparse("custom:12345d") # this works  as expected
ParseResult(scheme='custom',path='12345d',fragment='')
>>> urlparse("custom:12345.")  # so does this
ParseResult(scheme='custom',path='12345.',fragment='')
>>> urlparse("http:12345")  # for some reason this works (!?)
ParseResult(scheme='http',path='12345',fragment='')
>>> urlparse("https:12345") # yet this does not
ParseResult(scheme='',path='https:12345',fragment='')
>>> urlparse("ftp:12345")  # no luck here neither   
ParseResult(scheme='',path='ftp:12345',fragment='')

根据维基百科，URI 需要一个方案。空方案应该对应于 URI references，它应该只将 <scheme>:<number> 视为包含冒号的无模式（相对）路径，如果它前面是 ./。

那么为什么这会破坏上面演示的方式呢？我所期望的是，上述所有情况都将 URI/URL 拆分为 <scheme>:<number>，其中 <number> 是路径。

解决方法

如果由于 this section 而导致路径中存在非数字字符，您会看到不同的结果：

# make sure "url" is not actually a port number (in which case
# "scheme" is really part of the path)
rest = url[i+1:]
if not rest or any(c not in '0123456789' for c in rest):
    # not a port number
    scheme,url = url[:i].lower(),rest

在 Python 3.8 中，如果输入的形式为 "<stuff>:<numbers>"，则假定 numbers 为 port，在这种情况下，stuff 不是t 被视为一个方案，它最终都在路径中。

这在 Python 3.9 中被报告为 a bug 和（经过相当多的来回！）fixed；上面的内容被简单地改写为：

scheme,url[i+1:]

（并删除了 url[:i] == 'http' 的一些特殊大小写）。