我试图通过pid打开所有链接,但有两种情况:
>它打开所有网址(我的意思是即使是垃圾网址)
def get_links(self): links = [] host = urlparse( self.url ).hostname scheme = urlparse( self.url ).scheme domain_link = scheme+'://'+host pattern = re.compile(r'(/pid/)') for a in self.soup.find_all(href=True): href = a['href'] if not href or len(href) <= 1: continue elif 'javascript:' in href.lower(): continue elif 'forgotpassword' in href.lower(): continue elif 'images' in href.lower(): continue elif 'seller-account' in href.lower(): continue elif 'review' in href.lower(): continue else: href = href.strip() if href[0] == '/': href = (domain_link + href).strip() elif href[:4] == 'http': href = href.strip() elif href[0] != '/' and href[:4] != 'http': href = ( domain_link + '/' + href ).strip() if '#' in href: indx = href.index('#') href = href[:indx].strip() if href in links: continue links.append(self.re_encode(href)) return links
>在这种情况下,它只是打开带有pid的URL,但在这种情况下,它不会跟随链接,仅限于主页.用pid打开几个链接后就崩溃了.
def get_links(self): links = [] host = urlparse( self.url ).hostname scheme = urlparse( self.url ).scheme domain_link = scheme+'://'+host pattern = re.compile(r'(/pid/)') for a in self.soup.find_all(href=True): if pattern.search(a['href']) is not None: href = a['href'] if not href or len(href) <= 1: continue elif 'javascript:' in href.lower(): continue elif 'forgotpassword' in href.lower(): continue elif 'images' in href.lower(): continue elif 'seller-account' in href.lower(): continue elif 'review' in href.lower(): continue else: href= href.strip() if href[0] == '/': href = (domain_link + href).strip() elif href[:4] == 'http': href = href.strip() elif href[0] != '/' and href[:4] != 'http': href = ( domain_link + '/' + href ).strip() if '#' in href: indx = href.index('#') href = href[:indx].strip() if href in links: continue links.append(self.re_encode(href)) return links
解决方法
也许我错过了一些东西但你为什么不在from而不是正则表达式中输入if语句?所以它看起来像这样:
def get_links(self): links = [] host = urlparse( self.url ).hostname scheme = urlparse( self.url ).scheme domain_link = scheme+'://'+host for a in self.soup.find_all(href=True): href = a['href'] if not href or len(href) <= 1: continue if href.lower().find("/pid/") != -1: if 'javascript:' in href.lower(): continue elif 'forgotpassword' in href.lower(): continue elif 'images' in href.lower(): continue elif 'seller-account' in href.lower(): continue elif 'review' in href.lower(): continue if href[0] == '/': href = (domain_link + href).strip() elif href[:4] == 'http': href = href.strip() elif href[0] != '/' and href[:4] != 'http': href = ( domain_link + '/' + href ).strip() if '#' in href: indx = href.index('#') href = href[:indx].strip() if href in links: continue links.append(self.re_encode(href)) return links
此外,我删除了以下行,因为我相信否则您的代码将永远不会到达较低区域,因为您继续执行所有操作.
else: continue
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。