泰坦的"超级节目"

gak*_*gak 5 groovy cassandra graph-databases titan

我正在开发一个可以很好地处理图形数据库(Titan)的应用程序,除了它有很多边缘的顶点问题,即超节点.

上面的超节点链接指向Titan作者的博客文章,解释了解决问题的方法.解决方案似乎是通过边缘过滤来减少顶点的数量.

Unfortunately I want to groupCount attributes of edges or vertices. For example I have 1 million users and each user belongs to a country. How can I do a fast groupCount to work out the number of users in each country?

What I've tried so far can be shown in this elaborate groovy script:

g = TitanFactory.open('titan.properties')  // Cassandra
r = new Random(100)
people = 1e6

def newKey(g, name, type) {
    return g
        .makeType()
        .name(name)
        .simple()
        .functional()
        .indexed()
        .dataType(type)
        .makePropertyKey()
}

def newLabel(g, name, key) {
    return g
        .makeType()
        .name(name)
        .primaryKey(key)
        .makeEdgeLabel()
}

country = newKey(g, 'country', String.class)
newLabel(g, 'lives', country)

g.stopTransaction(SUCCESS)

root = g.addVertex()
countries = ['AU', 'US', 'CN', 'NZ', 'UK', 'PL', 'RU', 'NL', 'FR', 'SP', 'IT']

(1..people).each {
    country = countries[(r.nextFloat() * countries.size()).toInteger()]
    g.startTransaction()
    person = g.addVertex([name: 'John the #' + it])
    g.addEdge(g.getVertex(root.id), person, 'lives', [country: country])
    g.stopTransaction(SUCCESS)
}

t0 = new Date().time

m = [:]    
root = g.getVertex(root.id)
root.outE('lives').country.groupCount(m).iterate()

t1 = new Date().time

println "groupCount seconds: " + ((t1 - t0) / 1000)
Run Code Online (Sandbox Code Playgroud)

Basically one root node (for the sake of Titan not having an "all" nodes lookup), linked to many person via edges that have the country property. When I run the groupCount() on 1 million vertices, it takes over a minute.

I realise Titan is probably iterating over each edge and collecting counts, but is there a way to make this run faster in Titan, or any other graph database? Can the index itself be counted so it doesn't have to traverse? Are my indexes correct?

小智 8

如果您将"国家/地区"作为"生活"标签的主键,那么您可以更快地检索特定国家/地区的所有人.但是,在您的情况下,您感兴趣的是一个组计数,该组计数需要检索该根节点的所有边缘,以便迭代它们并对这些国家进行抢占.

因此,这种分析查询更适合图形分析框架Faunus.它不需要根顶点,因为它通过完整的数据库扫描执行groupcount,从而避免了超级节点问题.Faunus还使用Gremlin作为查询语言,因此您只需稍微修改您的查询:

g.V.country.groupCount.cap...
Run Code Online (Sandbox Code Playgroud)

HTH,马蒂亚斯