微信公众号搜"智元新知"关注
微信扫一扫可直接关注哦!

http响应pythyon中的随机3-4个长字符串

如何解决http响应pythyon中的随机3-4个长字符串

我正在尝试使用 python 中的套接字模块发出请求。它成功地发出请求、获取响应并对其进行解码。当我查看 HTML 文档时,除了 HTML 文档中有随机的 3-4 长随机字符串外,一切都是正确的。我想我的代码是正确的,但我不是 100% 确定。这是我的代码

def recive_data(get,timeout):
  ready = select.select([get],[],timeout)
  if ready[0]:
    return get.recv(4096)
  return b""

def get_file(website,port,file,https=False):
  data = []
  new_data = ""

  if https:
    get = ssl.create_default_context().wrap_socket(socket.socket(socket.AF_INET,socket.soCK_STREAM),server_hostname=website)
  else:
    get = socket.socket(socket.AF_INET,socket.soCK_STREAM)
  get.connect((website,port))
  get.sendall(f"GET {file} HTTP/1.1\r\nHost: {website}:{port}\r\n\r\n".encode())
  while True:
    new_data = recive_data(get,5).decode()
    if new_data != "" and new_data != None:
      data.append(new_data)
      new_data = ""
    else:
      break

  data = "".join(data)
  header = data[0:data.find(newline+newline)]
  data = data[data.find(newline+newline):data.rfind(f"{newline}0{newline}{newline}")]

  data = BeautifulSoup(data,'html.parser').prettify()

  get.close()
  return (header,data)

如果我输入 https://stackoverflow.com 它会输出

30d
<!DOCTYPE html>
<html class="html__responsive html__unpinned-leftnav">
 <head>
  <title>
   Stack Overflow - Where Developers Learn,Share,&amp; Build Careers
  </title>
  <link href="https://cdn.sstatic.net/Sites/stackoverflow/Img/favicon.ico?v=ec617d715196" rel="shortcut icon"/>
  <link href="https://cdn.sstatic.net/Sites/stackoverflow/Img/apple-touch-icon.png?v=c78bd457575a" rel="apple-touch-icon"/>
  <link href="https://cdn.sstatic.net/Sites/stackoverflow/Img/apple-touch-icon.png?v=c78bd457575a" rel="image_src"/>
  <link href="/opensearch.xml" rel="search" title="Stack Overflow" type="application/opensearchdescription+xml"/>
  <Meta content="Stack Overflow is the largest,most trusted online communi
20d0
ty for developers to learn,share​ ​their programming ​kNowledge,and build their careers." name="description"/>
  <Meta content="width=device-width,height=device-height,initial-scale=1.0,minimum-scale=1.0" name="viewport"/>
  <Meta content="website" property="og:type">

等等... 但是,有些网站比其他网站拥有更多,我也无法弄清楚。非常感谢任何帮助!

解决方法

响应中标题的最后一行为您提供了一个线索:

HTTP/1.1 200 OK
Connection: keep-alive
cache-control: private
...
transfer-encoding: chunked

transfer-encoding 表示标题后面的内容不是纯 HTML。来自the spec

   The chunked encoding modifies the body of a message in order to
   transfer it as a series of chunks,each with its own size indicator,followed by an OPTIONAL trailer containing entity-header fields
...
   The chunk-size field is a string of hex digits indicating the size of
   the chunk. The chunked encoding is ended by any chunk whose size is
   zero,followed by the trailer,which is terminated by an empty line.

换句话说,您看到的是一个十六进制数,显示下一个块中的字节数。可能有不止一个块。您需要检查该 HTTP 标头,如果它存在,则在将页面解析为 HTML 之前找到所有块并将它们连接在一起。

版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。