http响应pythyon中的随机3-4个长字符串

如何解决http响应pythyon中的随机3-4个长字符串

我正在尝试使用 python 中的套接字模块发出请求。它成功地发出请求、获取响应并对其进行解码。当我查看 HTML 文档时，除了 HTML 文档中有随机的 3-4 长随机字符串外，一切都是正确的。我想我的代码是正确的，但我不是 100% 确定。这是我的代码：

def recive_data(get,timeout):
  ready = select.select([get],[],timeout)
  if ready[0]:
    return get.recv(4096)
  return b""

def get_file(website,port,file,https=False):
  data = []
  new_data = ""

  if https:
    get = ssl.create_default_context().wrap_socket(socket.socket(socket.AF_INET,socket.soCK_STREAM),server_hostname=website)
  else:
    get = socket.socket(socket.AF_INET,socket.soCK_STREAM)
  get.connect((website,port))
  get.sendall(f"GET {file} HTTP/1.1\r\nHost: {website}:{port}\r\n\r\n".encode())
  while True:
    new_data = recive_data(get,5).decode()
    if new_data != "" and new_data != None:
      data.append(new_data)
      new_data = ""
    else:
      break

  data = "".join(data)
  header = data[0:data.find(newline+newline)]
  data = data[data.find(newline+newline):data.rfind(f"{newline}0{newline}{newline}")]

  data = BeautifulSoup(data,'html.parser').prettify()

  get.close()
  return (header,data)

如果我输入 https://stackoverflow.com 它会输出：

30d
<!DOCTYPE html>
<html class="html__responsive html__unpinned-leftnav">
 <head>
  <title>
   Stack Overflow - Where Developers Learn,Share,&amp; Build Careers
  </title>
  <link href="https://cdn.sstatic.net/Sites/stackoverflow/Img/favicon.ico?v=ec617d715196" rel="shortcut icon"/>
  <link href="https://cdn.sstatic.net/Sites/stackoverflow/Img/apple-touch-icon.png?v=c78bd457575a" rel="apple-touch-icon"/>
  <link href="https://cdn.sstatic.net/Sites/stackoverflow/Img/apple-touch-icon.png?v=c78bd457575a" rel="image_src"/>
  <link href="/opensearch.xml" rel="search" title="Stack Overflow" type="application/opensearchdescription+xml"/>
  <Meta content="Stack Overflow is the largest,most trusted online communi
20d0
ty for developers to learn,share their programming kNowledge,and build their careers." name="description"/>
  <Meta content="width=device-width,height=device-height,initial-scale=1.0,minimum-scale=1.0" name="viewport"/>
  <Meta content="website" property="og:type">

等等... 但是，有些网站比其他网站拥有更多，我也无法弄清楚。非常感谢任何帮助！

解决方法

响应中标题的最后一行为您提供了一个线索：

HTTP/1.1 200 OK
Connection: keep-alive
cache-control: private
...
transfer-encoding: chunked

transfer-encoding 表示标题后面的内容不是纯 HTML。来自the spec：

   The chunked encoding modifies the body of a message in order to
   transfer it as a series of chunks,each with its own size indicator,followed by an OPTIONAL trailer containing entity-header fields
...
   The chunk-size field is a string of hex digits indicating the size of
   the chunk. The chunked encoding is ended by any chunk whose size is
   zero,followed by the trailer,which is terminated by an empty line.

换句话说，您看到的是一个十六进制数，显示下一个块中的字节数。可能有不止一个块。您需要检查该 HTTP 标头，如果它存在，则在将页面解析为 HTML 之前找到所有块并将它们连接在一起。