如何解决使用beautifulsoup和pandas将提取的数据从链接转换为数据框

脚本使用 beautifulsoup 从 url 列表中提取数据并将数据转换为数据框，以便导出为 excel 文件。

问题是当我尝试将数据转换为数据帧时，它显示以下错误：

vals['b']

我的问题是如何解决这个错误？

我认为如果提取的数据是空的，我可以提取它并显示 NaN 或 null

代码：

Traceback (most recent call last):
  File "f:\AIenv\web_scrapping\job_desc_email.py",line 144,in <module>
    scrap_website()
  File "f:\AIenv\web_scrapping\job_desc_email.py",line 88,in scrap_website
    convert_to_dataFrame(joineddd)
  File "f:\AIenv\web_scrapping\job_desc_email.py",line 98,in convert_to_dataFrame
    df = pd.DataFrame(joineddd,columns=["link","location","Company_Industry","Company_Type","Job_Role","Employment_Type","Monthly_Salary_Range","Number_of_Vacancies","Career_Level","Years_of_Experience","Residence_Location","Gender","Nationality","Degree","Age"])
  File "F:\AIenv\lib\site-packages\pandas\core\frame.py",line 509,in __init__
    arrays,columns = to_arrays(data,columns,dtype=dtype)
  File "F:\AIenv\lib\site-packages\pandas\core\internals\construction.py",line 524,in to_arrays
    return _list_to_arrays(data,coerce_float=coerce_float,line 567,in _list_to_arrays
    raise ValueError(e) from e
ValueError: 15 columns passed,passed data had 13 columns

解决方法

我已经在我的机器上运行了你的代码，但没有出现任何错误。但是我在您的代码中检测到了一些性能问题。首先，请停止在每个循环中在屏幕上打印内容，只是因为您想知道它是否有效。这是性能上的一个巨大问题：Related Question

如果您真的想知道您的代码是否有效，请每 100 次循环打印一次。像这样：

for index,link in enumerate(links):
        if index % 100 == 0 and index != 0:
            print(f"Scraping {index}. page.")
        s = BeautifulSoup(requests.get(link).content,"lxml")

这个错误是不言自明的，你有 15 列，但你给出了 13 列而不是 15 列。这意味着你的一些值在 convert_to_dataFrame() 的脚本末尾得到了一个空值。请在插入任何值之前，检查您要查找的元素是否在抓取过程中以某种方式存在。它可以通过多种方式实现。或者，您可以创建一个名为 clear_dataframe() 的函数，并在抓取过程结束时在此函数中传递列表。

使用beautifulsoup和pandas将提取的数据从链接转换为数据框

如何解决使用beautifulsoup和pandas将提取的数据从链接转换为数据框

代码：

解决方法

相关推荐