Python转换HTML到Text纯文本的方法

本文实例讲述了Python转换HTML到Text纯文本的方法。分享给大家供大家参考。具体分析如下：

今天项目需要将HTML转换为纯文本，去网上搜了一下，发现Python果然是神通广大，无所不能，方法是五花八门。

拿今天亲自试的两个方法举例，以方便后人：

1. 安装nltk，可以去pipy装

（注：需要依赖以下包：numpy,PyYAML）

2.测试代码：

>>> import nltk  

>>> aa = r''''' 

<html> 

    <body> 

 <b>Project:</b> DeHTML<br> 

 <b>Description</b>:<br> 

 This small script is intended to allow conversion from HTML markup to  

 plain text. 

    </body> 

</html> 

'''

>>> aa  

'\n<html>\n            <body>\n                <b>Project:</b> DeHTML<br>\n                <b>Description</b>:<br>\n                This small script is intended to allow conversion from HTML markup to \n                plain text.\n            </body>\n        </html>\n        '  

>>> <strong>print nltk.clean_html(aa)</strong>  

Project: DeHTML   

     Description :   

    This small script is intended to allow conversion from HTML markup to   

    plain text.

方法二：

如果觉得nltk太笨重，大材小用的话，可以自己写代码，代码如下:

复制代码代码如下:

from HTMLParser import HTMLParser
from re import sub
from sys import stderr
from traceback import print_exc

class _DeHTMLParser(HTMLParser):
 def __init__(self):
 HTMLParser.__init__(self)
 self.__text = []

 def handle_data(self,data):
 text = data.strip()
 if len(text) > 0:
 text = sub('[ \t\r\n]+',' ',text)
 self.__text.append(text + ' ')

 def handle_starttag(self,tag,attrs):
 if tag == 'p':
 self.__text.append('\n\n')
 elif tag == 'br':
 self.__text.append('\n')

 def handle_startendtag(self,attrs):
 if tag == 'br':
 self.__text.append('\n\n')

 def text(self):
 return ''.join(self.__text).strip()

def dehtml(text):
 try:
 parser = _DeHTMLParser()
 parser.Feed(text)
 parser.close()
 return parser.text()
 except:
 print_exc(file=stderr)
 return text

def main():
 text = r'''''
 <html>
 <body>
 Project: DeHTML 
 Description: 
 This small script is intended to allow conversion from HTML markup to
 plain text.
 </body>
 </html>
 '''
 print(dehtml(text))

if __name__ == '__main__':
 main()

运行结果：

>>> ================================ RESTART ================================
>>>
Project: DeHTML
Description :
This small script is intended to allow conversion from HTML markup to plain text.

希望本文所述对大家的Python程序设计有所帮助。

Python转换HTML到Text纯文本的方法

相关推荐