无需Python导入即可从原始HTML代码中删除HTML标签

如何解决无需Python导入即可从原始HTML代码中删除HTML标签

我的课程项目要求我从HTML代码中提取纯文本，而不使用任何导入库。因此，我尝试过的操作在下面，但是在处理大型html文件时，它确实非常慢。

def cleanTags(inStr):
while "<" in inStr and ">" in inStr:
    a = inStr.find('<')
    b = inStr.find('>')
    inStr = inStr.replace(inStr[a:b+1],'')
    print("deleted")
return inStr

解决方法

在这种情况下，最好使用Regex：

import re

def cleanTags(inStr):
  clean = re.compile('<.*?>')
  cleantxt = re.sub(clean,'"',inStr)
  return cleantxt