如何解决如何使用python遍历文本文件以删除标签和规范化
我正在尝试遍历文本文件以删除标签、标点符号、停用词。我已经使用 Python 3.8.3 Beautiful Soup 从网站上抓取文本(报纸文章)。它返回一个我保存为文件的列表。但是,我不确定如何使用循环来处理文本文件中的所有项目。
在下面的代码中,我使用了 listfilereduced.text(包含来自一篇新闻文章的数据,link to listfilereduced.txt here),但是我想在 listfile.text(包含来自多篇文章的数据,{{3 }})。任何帮助将不胜感激!
#This text file contains just one news item
with open('listfilereduced.txt','r',encoding='utf8') as my_file:
rawData = my_file.read()
print(rawData)
#Separating body text from other data
articleStart = rawData.find("<div class=\"story-element story-element-text\">")
articleData = rawData[:articleStart]
articleBody = rawData[articleStart:]
print(articleData)
print("*******")
print(articleBody)
print("*******")
#First,I define a function to strip tags from the body text
def stripTags(pageContents):
insideTag = 0
text = ''
for char in pageContents:
if char == '<':
insideTag = 1
elif (insideTag == 1 and char == '>'):
insideTag = 0
elif insideTag == 1:
continue
else:
text += char
return text
#Calling the function
articleBodyText = stripTags(articleBody)
print(articleBodyText)
#Isolating article title and publication date
TitleEndLoc = articleData.find("</h1>")
dateStartLoc = articleData.find("<div class=\"storyPageMetaData-m__publish-time__19bdV\">")
dateEndLoc=articleData.find("<div class=\"Meta-data-icons storyPageMetaDataIcons-m__icons__3E4Xg\">")
titleString = articleData[:TitleEndLoc]
dateString = articleData[dateStartLoc:dateEndLoc]
#Call stripTags function to clean
articleTitle= stripTags(titleString)
articleDate = stripTags(dateString)
print(articleTitle)
print(articleDate)
#Cleaning the date a bit more
startLocDate = articleDate.find(":")
endLocDate = articleDate.find(",")
articleDateClean = articleDate[startLocDate+2:endLocDate]
print(articleDateClean)
#save all this data to a dictionary that saves the title,data and the body text
PAloTextDict = {"Title": articleTitle,"Date": articleDateClean,"Text": articleBodyText}
print(PAloTextDict)
#normalize text by:
#1. Splitting paragraphs of text into lists of words
articleBodyWordList = articleBodyText.split()
print(articleBodyWordList)
#2.Removing punctuation and stopwords
#https://bnlp.readthedocs.io/en/latest/
from bnlp.corpus import stopwords,punctuations
#A. Remove punctuation first
listnopunct = []
for word in articleBodyWordList:
for mark in punctuations:
word=word.replace(mark,'')
listnopunct.append(word)
print(listnopunct)
#B. removing stopwords
banglastopwords = stopwords()
print(banglastopwords)
cleanList=[]
for word in listnopunct:
if word in banglastopwords:
continue
else:
cleanList.append(word)
print(cleanList)
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。