Is there a way to optimize this gremlin query?

Cha*_*ard 3 graph-databases gremlin amazon-neptune

I have a graph database which looks like this (simplified) diagram:

Graph diagram

Each unique ID has many properties, which are represented as edges from the ID to unique values of that property. Basically that means that if two ID nodes have the same email, then their has_email edges will both point to the same node. In the diagram, the two shown IDs share both a first name and a last name.

I'm having difficulty writing an efficient Gremlin query to find matching IDs, for a given set of "matching rules". A matching rule will consist of a set of properties which must all be the same for IDs to be considered to have come from the same person. The query I'm currently using to match people based on their first name, last name, and email looks like:

g.V().match(
    __.as("id").hasId("some_id"),
    __.as("id")
        .out("has_firstName")
        .in("has_firstName")
        .as("firstName"),
    __.as("id")
        .out("has_lastName")
        .in("has_lastName")
        .as("lastName"),
    __.as("id")
        .out("has_email")
        .in("has_email")
        .as("email"),
    where("firstName", eq("lastName")),
    where("firstName", eq("email")),
    where("firstName", neq("id"))
).select("firstName")
Run Code Online (Sandbox Code Playgroud)

The query returns a list of IDs which match the input some_id.

当此查询尝试将ID与一个特别通用的名字匹配时,它将变得非常非常慢。我怀疑这match是问题所在,但到目前为止,我一直在努力寻找一个没有运气的替代方法。

Dan*_*itz 5

该查询的性能将取决于图形中的边缘度。由于许多人使用相同的名字,因此您很可能在特定的firstName顶点上拥有大量的优势。您可以进行如下假设:具有相同姓氏的人少于具有相同名字的人。当然,共享相同电子邮件地址的人应该更少。有了这些知识,您就可以首先开始遍历度数最低的顶点,然后从那里进行过滤:

g.V().hasId("some_id").as("id").
  out("has_email").in("has_email").where(neq("id")).
  filter(out("has_lastName").where(__.in("has_lastName").as("id"))).
  filter(out("has_firstName").where(__.in("has_firstName").as("id")))
Run Code Online (Sandbox Code Playgroud)

这样,性能将主要取决于边缘度最低的顶点。