实体匹配包 deepmatcher 出错

如何解决实体匹配包 deepmatcher 出错

低于错误：

ValueError                                Traceback (most recent call last)
<ipython-input-6-2d323ffe212f> in <module>()
----> 1 train,validation,test = dm.data.process(path='/content/',train='train.csv',validation='validation.csv',test='test.csv')

1 frames
/usr/local/lib/python3.7/dist-packages/deepmatcher/data/process.py in _check_header(header,id_attr,left_prefix,right_prefix,label_attr,ignore_columns)
     32         if attr not in (id_attr,label_attr) and attr not in ignore_columns:
     33             if not attr.startswith(left_prefix) and not attr.startswith(right_prefix):
---> 34                 raise ValueError('Attribute ' + attr + ' is not a left or a right table '
     35                                  'column,not a label or id and is not ignored. Not sure '
     36                                  'what it is...')

ValueError: Attribute ltable_id is not a left or a right table column,not a label or id and is not ignored. Not sure what it is...

我正在使用

http://pages.cs.wisc.edu/~anhai/data1/deepmatcher_data/Textual/Company/company_exp_data.zip

本次学习的数据集，因为之前使用我自己的数据集进行的测试给出了同样的错误。

代码：

将 deepmatcher 导入为 dm 训练、验证、测试 = dm.data.process(path='/content/',test='test.csv')

就是这样。我正在关注以下 repo github.com/anhaidgroup/deepmatcher

寻求更好的理解和可能的解决方案。提前致谢。

解决方法

我没有测试它，但错误消息可能表明它需要特殊的列才能工作。

然后先看 repo DeepMatch 的页面就可以确认了。

有一个示例表，其中包含名称为 Left ... 和 Right ... 的列。

还有指向 Get Started 的链接，您可以在其中看到

Step 1. Process labeled data¶

Left" attributes (required): ... 
   These column names are expected to be prefixed with "left_" by default.

"Right" attributes (required): "... 
   These column names are expected to be prefixed with "right_" by default.

这表明列需要前缀 left_,right_ 但您的数据有列 ltable_id,rtable_id。因此，您必须在加载后和将其与 DeepMatch

一起使用之前更改列名

编辑：

在 repo 中也是链接 Data Processing

然后你就可以看到

"Left" attributes (required):  ...
  This can be customized by setting the left_prefix parameter (e.g.,use "ltable_" as the prefix).
"Right" attributes (required): ...
  This can be customized by setting the right_prefix parameter (e.g.,use "rtable_" as the prefix).

并显示示例代码

dm.data.process(... left_prefix='left_',right_prefix='right_',...)

这意味着你可以做到

dm.data.process(... left_prefix='ltable_',right_prefix='rtable_',...)

编辑：

我对其进行了测试，它通过 company_exp_data.zip 解决了这个问题。

import deepmatcher as dm 

train,validation,test = dm.data.process(
    path='/content/',#path='exp_data',train='train.csv',validation='valid.csv',test='test.csv',left_prefix='ltable_',)

但接下来又出现了其他问题。

RuntimeError: Google drive link https://drive.google.com/uc?export=download&id=1Vih8gAmgBnuYDxfblbT94P6WjB7s1ZSh is currently unavailable,because the quota was exceeded.

它试图从 Google Drive 读取一些数据，但是 the quota was exceeded.

也许它需要手动下载并更改源代码以从本地计算机加载它。但是对于新的问题是问题。或者这个问题应该发送给这个模块的作者。而且他应该把数据放在其他服务器上并更改源代码。

总结：你的问题是你没有阅读文档。