使用Newspaper3k进行网络抓取，仅获得50篇文章

如何解决使用Newspaper3k进行网络抓取，仅获得50篇文章

我想在一个法国网站上用news3k剪贴数据，结果将只有50篇文章。该网站有50多个文章。我在哪里错了？

我的目标是抓取该网站上的所有文章。

我尝试过：

import newspaper

legorafi_paper = newspaper.build('http://www.legorafi.fr/',memoize_articles=False)

# Empty list to put all urls
papers = []

for article in legorafi_paper.articles:
    papers.append(article.url)

print(legorafi_paper.size())

此打印结果为50条文章。

我不明白为什么报纸3k只废弃50篇文章，而不会更多。

我尝试过的内容的更新：

def Foo(firstTime = []):
    if firstTime == []:
        webdriverwait(driver,30).until(EC.frame_to_be_available_and_switch_to_it((By.CSS_SELECTOR,"div#appconsent>iframe")))
        firstTime.append('Not Empty')
    else:
        print('Cookies already accepted')


%%time


categories = ['societe','politique']


import time
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.support.ui import webdriverwait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.action_chains import ActionChains

import newspaper
import requests
from newspaper.utils import BeautifulSoup
from newspaper import Article

categories = ['people','sports']
papers = []


driver = webdriver.Chrome(executable_path="/Users/name/Downloads/chromedriver 4")
driver.get('http://www.legorafi.fr/')


for category in categories:
    url = 'http://www.legorafi.fr/category/' + category
    #webdriverwait(self.driver,10)
    driver.get(url)
    Foo()
    webdriverwait(driver,30).until(EC.element_to_be_clickable((By.CSS_SELECTOR,"button.button--filled>span.baseText"))).click()

    pagesToGet = 2

    title = []
    content = []
    for page in range(1,pagesToGet+1):
        print('Processing page :',page)
        #url = 'http://www.legorafi.fr/category/france/politique/page/'+str(page)
        print(driver.current_url)
        #print(url)

        time.sleep(3)

        raw_html = requests.get(url)
        soup = BeautifulSoup(raw_html.text,'html.parser')
        for articles_tags in soup.findAll('div',{'class': 'articles'}):
            for article_href in articles_tags.find_all('a',href=True):
                if not str(article_href['href']).endswith('#commentaires'):
                    urls_set.add(article_href['href'])
                    papers.append(article_href['href'])


        for url in papers:
            article = Article(url)
            article.download()
            article.parse()
            if article.title not in title:
                title.append(article.title)
            if article.text not in content:
                content.append(article.text)
            #print(article.title,article.text)

        time.sleep(5)
        driver.execute_script("window.scrollTo(0,document.body.scrollHeight)")
        driver.find_element_by_xpath("//a[contains(text(),'Suivant')]").click()
        time.sleep(10)

解决方法

更新09-21-2020

我重新检查了您的代码，它正常工作，因为它正在提取Le Gorafi主页上的所有文章。此页面上的文章是类别页面的重点，例如社交，体育等。

下面的示例来自主页的源代码。这些文章中的每一篇也都列在“体育”类别页面上。

<div class="cat sports">
    <a href="http://www.legorafi.fr/category/sports/">
       <h4>Sports</h4>
          <ul>
              <li>
                 <a href="http://www.legorafi.fr/2020/07/24/chaque-annee-25-des-lutteurs-doivent-etre-operes-pour-defaire-les-noeuds-avec-leur-bras/" title="Voir l'article 'Chaque année,25% des lutteurs doivent être opérés pour défaire les nœuds avec leur bras'">
                  Chaque année,25% des lutteurs doivent être opérés pour défaire les nœuds avec leur bras</a>
              </li>
               <li>
                <a href="http://www.legorafi.fr/2020/07/09/frank-mccourt-lom-nest-pas-a-vendre-sauf-contre-beaucoup-dargent/" title="Voir l'article 'Frank McCourt « L'OM n'est pas à vendre sauf contre beaucoup d'argent »'">
                  Frank McCourt « L'OM n'est pas à vendre sauf contre beaucoup d'argent </a>
              </li>
              <li>
                <a href="http://www.legorafi.fr/2020/06/10/euphorique-un-parieur-appelle-son-fils-betclic/" title="Voir l'article 'Euphorique,un parieur appelle son fils Betclic'">
                  Euphorique,un parieur appelle son fils Betclic                 </a>
              </li>
           </ul>
               <img src="http://www.legorafi.fr/wp-content/uploads/2015/08/rubrique_sport1-300x165.jpg"></a>
        </div>
              </div>

主页上似乎有35个唯一的文章条目。

import newspaper

legorafi_paper = newspaper.build('http://www.legorafi.fr',memoize_articles=False)

papers = []
urls_set = set()
for article in legorafi_paper.articles:
   # check to see if the article url is not within the urls_set
   if article.url not in urls_set:
     # add the unique article url to the set
     urls_set.add(article.url)
     # remove all links for article commentaires
     if not str(article.url).endswith('#commentaires'):
        papers.append(article.url)

 print(len(papers)) 
 # output
 35

如果我将上面代码中的URL更改为http://www.legorafi.fr/category/sports，则它返回的文章数与http://www.legorafi.fr相同。在查看GitHub上的 Newspaper 的源代码之后，似乎该模块正在使用urlparse，该模块似乎正在使用 netloc 段 urlparse 。 netloc 是www.legorafi.fr。我注意到，根据打开的issue.

，这是报纸的已知问题

要获取所有文章，它将变得更加复杂，因为您必须使用一些其他模块，包括请求和 BeautifulSoup 。的后者可以从报纸中调用。可以使用请求和 BeautifulSoup 完善以下代码，以获取主页和类别页面上源代码中的所有文章。

import newspaper
import requests
from newspaper.utils import BeautifulSoup

papers = []
urls_set = set()

legorafi_paper = newspaper.build('http://www.legorafi.fr',fetch_images=False,memoize_articles=False)
for article in legorafi_paper.articles:
   if article.url not in urls_set:
     urls_set.add(article.url)
     if not str(article.url).endswith('#commentaires'):
       papers.append(article.url)

 
legorafi_urls = {'monde-libre': 'http://www.legorafi.fr/category/monde-libre','politique': 'http://www.legorafi.fr/category/france/politique','societe': 'http://www.legorafi.fr/category/france/societe','economie': 'http://www.legorafi.fr/category/france/economie','culture': 'http://www.legorafi.fr/category/culture','people': 'http://www.legorafi.fr/category/people','sports': 'http://www.legorafi.fr/category/sports','hi-tech': 'http://www.legorafi.fr/category/hi-tech','sciences': 'http://www.legorafi.fr/category/sciences','ledito': 'http://www.legorafi.fr/category/ledito/'
             }


for category,url in legorafi_urls.items():
   raw_html = requests.get(url)
   soup = BeautifulSoup(raw_html.text,'html.parser')
   for articles_tags in soup.findAll('div',{'class': 'articles'}):
      for article_href in articles_tags.find_all('a',href=True):
         if not str(article_href['href']).endswith('#commentaires'):
           urls_set.add(article_href['href'])
           papers.append(article_href['href'])

   print(len(papers))
   # output
   155

如果您需要获取类别页面的子页面中列出的文章（政治目前有120个子页面），则必须使用 Selenium 之类的东西来单击链接。

希望这段代码可以帮助您更进一步地实现目标。

使用Newspaper3k进行网络抓取，仅获得50篇文章

如何解决使用Newspaper3k进行网络抓取，仅获得50篇文章

解决方法

相关推荐