微信公众号搜"智元新知"关注
微信扫一扫可直接关注哦!

有没有人在使用 Trafilatura 时遇到重复文本的问题?

如何解决有没有人在使用 Trafilatura 时遇到重复文本的问题?

尝试使用 trafilatura 抓取以下网页时遇到重复问题,即使我设置了 deduplicate=True。有谁知道这是包的缺点还是我可以切换任何参数来摆脱这种行为?

https://www.federalreserve.gov/monetarypolicy/fomcminutes20200429.htm

import trafilatura

downloaded = trafilatura.fetch_url(url)
text = trafilatura.extract(downloaded,target_language='en',include_tables=True,deduplicate=True)

输出

全文太长,无法在此处引用,但有几个重复的片段:

**Notation Vote**
By notation Vote completed on May 19,2020,the Committee unanimously approved the minutes of the Committee meeting held on April 28–29,2020.
Notation Vote
By notation Vote completed on May 19,2020. 
**Staff Economic Outlook**
The projection for the U.S. economy prepared by the staff for the June FOMC meeting was downgraded,on balance,as compared with the April meeting forecast in response to information on the spread of the coronavirus and changes in the measures undertaken to contain it both at home and abroad,along with incoming economic data. U.S. real GDP was forecast to show a historically large decline in the second quarter of this year,and the unemployment rate was expected to be sharply higher than in the first quarter. The substantial fiscal policy measures and appreciable support from monetary policy,along with the Federal Reserve's liquidity and lending facilities,were expected to help mitigate the deterioration in current economic conditions and to help boost the recovery.
Staff Economic Outlook
The projection for the U.S. economy prepared by the staff for the June FOMC meeting was downgraded,were expected to help mitigate the deterioration in current economic conditions and to help boost the recovery.

版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。