AWS Neptune 性能/Gremlin 问题

如何解决AWS Neptune 性能/Gremlin 问题

正在使用 gremlin 将数据加载到 Neptune，拥有数据库实例大小 (db.r5.4xlarge(16 vcpu)) 的 Neptune 基础设施。数据通过 AWS glue 作业和 5 个工作线程使用 pyspark 加载到 Neptune。

通过对去重数据集执行更新插入加载数据并将它们一起批处理（50 条记录/批）作为对 Neptune 的单个查询，

顶点：计算去重后所有要加载到图中的顶点（没有重复节点）

使用的查询：

g.V().has(T.id,record.id).fold().coalesce(__.unfold(),__.addV(record.source).property(T.id,record.id)
.V().has(T.id,record.id)
(Do 48 items).next()

执行 245 万个唯一顶点所需的时间为 5 分钟

Edges：计算去重后所有要加载到图中的边（没有重复的边）

使用的查询：

g.V(edgeData.id1).bothE().where(__.otherV().hasId(edgeData.id2)).fold().coalesce(__.unfold(),__.addE('coincided_with').from_(__.V(edgeData.id1)).to(__.V(edgeData.id2))).property(Cardinality.single,timestamp,edgeData.timestamp).property(Cardinality.single,count,edgeData.count)
.V(edgeData.id1).bothE().where(__.otherV().hasId(edgeData.id2)).fold().coalesce(__.unfold(),edgeData.count)
(Do 48 items).next()

执行具有属性的 1.88M 唯一边所需的时间为 21 分钟

如果我们只执行边创建而没有任何属性给 edge ，

使用的查询：

 g.V(edgeData.id1).bothE().where(__.otherV().hasId(edgeData.id2)).fold().coalesce(__.unfold(),__.addE('coincided_with').from_(__.V(edgeData.id1)).to(__.V(edgeData.id2)))
.V(edgeData.id1).bothE().where(__.otherV().hasId(edgeData.id2)).fold().coalesce(__.unfold(),__.addE('coincided_with').from_(__.V(edgeData.id1)).to(__.V(edgeData.id2)))
(Do 48 items).next()

执行 1.88M 条没有属性的唯一边所需的时间为 4 分钟

性能问题：

理想情况下，在插入顶点时，我们不应该看到任何 ConcurrentModification 异常，但即使在新的 Neptune 实例 (db.r5.4xlarge) 中创建顶点时，我们也经常得到它，我们通过在它们，在从 Vertex (A -> B) 进行边缘插入时，即使在以 300 毫秒的间隔重试 10 次后，仍然无法插入它们。总体而言，我们最终有更多的时间来插入我们的数据，并且即使我们避免了并发场景，也有办法避免并发异常。
在批量更新插入过程中添加边缘属性时，我们可以看到所花费的时间比没有属性更新边缘要长得多例如：向边添加 2 个属性具有属性的 1.8M 边缘花费了接近 21 分钟来更新我们的数据 180 万条没有属性的边花了将近 4 分钟来更新我们的数据带属性的边创建要慢得多，无论如何可以加快带属性的边的加载（我们有 40M 边，因此插入的时间要长得多）
添加更多的并行工作线程，我们最终会变得更慢，并发错误更多（cpu 负载约为 50%，但未达到最大值）

任何提高性能的建议都会有很大帮助

AWS Neptune 性能/Gremlin 问题

如何解决AWS Neptune 性能/Gremlin 问题

相关推荐