我使用Michael Hunger的批量导入导入了数据,我通过它创建了: -
4,612,893 nodes
14,495,063 properties
node properties are indexed.
5,300,237 relationships
Run Code Online (Sandbox Code Playgroud)
{ 问题 } Cypher查询执行得太慢,几乎爬行,简单的遍历需要> 5分钟才能返回结果集,请让我知道如何调整服务器以获得更好的性能以及我做错了什么.
店铺详情: -
-rw-r--r-- 1 root root 567M Jul 12 12:42 data/graph.db/neostore.propertystore.db
-rw-r--r-- 1 root root 167M Jul 12 12:42 data/graph.db/neostore.relationshipstore.db
-rw-r--r-- 1 root root 40M Jul 12 12:42 data/graph.db/neostore.nodestore.db
-rw-r--r-- 1 root root 7.8M Jul 12 12:42 data/graph.db/neostore.propertystore.db.strings
-rw-r--r-- 1 root root 330 Jul 12 12:42 data/graph.db/neostore.propertystore.db.index.keys
-rw-r--r-- 1 root root 292 Jul 12 12:42 data/graph.db/neostore.relationshiptypestore.db.names
-rw-r--r-- 1 root root 153 Jul 12 12:42 data/graph.db/neostore.propertystore.db.arrays
-rw-r--r-- 1 root root 88 Jul 12 12:42 data/graph.db/neostore.propertystore.db.index
-rw-r--r-- 1 root root 69 Jul 12 12:42 data/graph.db/neostore
-rw-r--r-- 1 root root 58 Jul 12 12:42 data/graph.db/neostore.relationshiptypestore.db
-rw-r--r-- 1 root root 9 Jul 12 12:42 data/graph.db/neostore.id
-rw-r--r-- 1 root root 9 Jul 12 12:42 data/graph.db/neostore.nodestore.db.id
-rw-r--r-- 1 root root 9 Jul 12 12:42 data/graph.db/neostore.propertystore.db.arrays.id
-rw-r--r-- 1 root root 9 Jul 12 12:42 data/graph.db/neostore.propertystore.db.id
-rw-r--r-- 1 root root 9 Jul 12 12:42 data/graph.db/neostore.propertystore.db.index.id
-rw-r--r-- 1 root root 9 Jul 12 12:42 data/graph.db/neostore.propertystore.db.index.keys.id
-rw-r--r-- 1 root root 9 Jul 12 12:42 data/graph.db/neostore.propertystore.db.strings.id
-rw-r--r-- 1 root root 9 Jul 12 12:42 data/graph.db/neostore.relationshipstore.db.id
-rw-r--r-- 1 root root 9 Jul 12 12:42 data/graph.db/neostore.relationshiptypestore.db.id
-rw-r--r-- 1 root root 9 Jul 12 12:42 data/graph.db/neostore.relationshiptypestore.db.names.id
Run Code Online (Sandbox Code Playgroud)
我在用
neo4j-community-1.9.1
java version "1.7.0_25"
Amazon EC2 m1.large instance with Ubuntu 12.04.2 LTS (GNU/Linux 3.2.0-40-virtual x86_64)
RAM ~8GB.
EBS 200 GB, neo4j is running on EBS volume.
Run Code Online (Sandbox Code Playgroud)
调用为./neo4j-community-1.9.1/bin/neo4j start
以下是neo4j服务器信息:
neostore.nodestore.db.mapped_memory 161M
neostore.relationshipstore.db.mapped_memory 714M
neostore.propertystore.db.mapped_memory 90M
neostore.propertystore.db.index.keys.mapped_memory 1M
neostore.propertystore.db.strings.mapped_memory 130M
neostore.propertystore.db.arrays.mapped_memory 130M
mapped_memory_page_size 1M
all_stores_total_mapped_memory_size 500M
Run Code Online (Sandbox Code Playgroud)
{ Data Model }就像社交图: -
User-User
User-[:FOLLOWS]->User
User-Item
User-[:CREATED]->Item
User-[:LIKE]->Item
User-[:COMMENT]->Item
User-[:VIEW]->Item
Cluster-User
User-[:FACEBOOK]->SocialLogin_Cluster
Cluster-Item
Item-[:KIND_OF]->Type_Cluster
Cluster-Cluster
Cluster-[:KIND_OF]->Type_Cluster
Run Code Online (Sandbox Code Playgroud)
{ 一些查询 }和时间:
START u=node(467242)
MATCH u-[r1:LIKE|COMMENT]->a<-[r2:LIKE|COMMENT]-lu-[r3:LIKE]-b
WHERE NOT(a=b)
RETURN u,COUNT(b)
Run Code Online (Sandbox Code Playgroud)
查询耗时1015348ms.返回70956115结果计数.
START a=node:nodes(kind="user")
RETURN a,length(a-[:CREATED|LIKE|COMMENT|FOLLOWS]-()) AS cnt
ORDER BY cnt DESC
LIMIT 10
Run Code Online (Sandbox Code Playgroud)
查询花了231613ms
根据建议,我将盒子升级到M1.xlarge和M2.2xlarge
我调整了下面的属性,并从实例存储运行(与EBS相比)
neo4j.properties
neostore.nodestore.db.mapped_memory=1800M
neostore.relationshipstore.db.mapped_memory=1800M
neostore.propertystore.db.mapped_memory=100M
neostore.propertystore.db.strings.mapped_memory=150M
neostore.propertystore.db.arrays.mapped_memory=10M
Run Code Online (Sandbox Code Playgroud)
Neo4j的-wrapper.conf
wrapper.java.additional.1=-d64
wrapper.java.additional.1=-server
wrapper.java.additional=-XX:+UseConcMarkSweepGC
wrapper.java.additional=-XX:+CMSClassUnloadingEnabled
wrapper.java.initmemory=4098
wrapper.java.maxmemory=8192
Run Code Online (Sandbox Code Playgroud)
但仍然是查询(如下所示)在几分钟~5-8分钟内运行,从推荐的角度来看这是不可接受的.
查询:
START u=node(467242)
MATCH u-[r1:LIKE]->a<-[r2:LIKE]-lu-[r3:LIKE]-b
RETURN u,COUNT(b)
Run Code Online (Sandbox Code Playgroud)
{ Profiling }
neo4j-sh (0)$ profile START u=node(467242) MATCH u-[r1:LIKE|COMMENT]->a<-[r2:LIKE|COMMENT]-lu-[r3:LIKE]-b RETURN u,COUNT(b);
==> +-------------------------+
==> | u | COUNT(b) |
==> +-------------------------+
==> | Node[467242] | 70960482 |
==> +-------------------------+
==> 1 row
==>
==> ColumnFilter(symKeys=["u", " INTERNAL_AGGREGATEad2ab10d-cfc3-48c2-bea9-be4b9c1b5595"], returnItemNames=["u", "COUNT(b)"], _rows=1, _db_hits=0)
==> EagerAggregation(keys=["u"], aggregates=["( INTERNAL_AGGREGATEad2ab10d-cfc3-48c2-bea9-be4b9c1b5595,Count)"], _rows=1, _db_hits=0)
==> TraversalMatcher(trail="(u)-[r1:LIKE|COMMENT WHERE true AND true]->(a)<-[r2:LIKE|COMMENT WHERE true AND true]-(lu)-[r3:LIKE WHERE true AND true]-(b)", _rows=70960482, _db_hits=71452891)
==> ParameterPipe(_rows=1, _db_hits=0)
neo4j-sh (0)$ profile START u=node(467242) MATCH u-[r1:LIKE|COMMENT]->a<-[r2:LIKE|COMMENT]-lu-[r3:LIKE]-b RETURN count(distinct a),COUNT(distinct b),COUNT(*);
==> +--------------------------------------------------+
==> | count(distinct a) | COUNT(distinct b) | COUNT(*) |
==> +--------------------------------------------------+
==> | 1950 | 91294 | 70960482 |
==> +--------------------------------------------------+
==> 1 row
==>
==> ColumnFilter(symKeys=[" INTERNAL_AGGREGATEe6b94644-0a55-43d9-8337-491ac0b29c8c", " INTERNAL_AGGREGATE1cfcd797-7585-4240-84ef-eff41a59af33", " INTERNAL_AGGREGATEea9176b2-1991-443c-bdd4-c63f4854d005"], returnItemNames=["count(distinct a)", "COUNT(distinct b)", "COUNT(*)"], _rows=1, _db_hits=0)
==> EagerAggregation(keys=[], aggregates=["( INTERNAL_AGGREGATEe6b94644-0a55-43d9-8337-491ac0b29c8c,Distinct)", "( INTERNAL_AGGREGATE1cfcd797-7585-4240-84ef-eff41a59af33,Distinct)", "( INTERNAL_AGGREGATEea9176b2-1991-443c-bdd4-c63f4854d005,CountStar)"], _rows=1, _db_hits=0)
==> TraversalMatcher(trail="(u)-[r1:LIKE|COMMENT WHERE true AND true]->(a)<-[r2:LIKE|COMMENT WHERE true AND true]-(lu)-[r3:LIKE WHERE true AND true]-(b)", _rows=70960482, _db_hits=71452891)
==> ParameterPipe(_rows=1, _db_hits=0)
Run Code Online (Sandbox Code Playgroud)
请让我知道用于调优的配置和neo4j启动参数.提前致谢
Mic*_*ger 14
在我的macbook air上运行这个,带有少量RAM和带有数据集的CPU.
通过更多内存映射,GCR缓存和更多缓存堆,您将比我的结果快得多.还要确保在查询中使用参数.
你正在遇到组合爆炸.
路径的每一步都将"times rels"元素/行添加到匹配的子图中.
例如,见:你最终得到269268场比赛,但你只有81674个不同的局面
问题是每行都会扩展下一个匹配.因此,如果您在两者之间使用distinct来再次限制大小,那么将会减少一些数据量级.对于下一个级别也是如此.
START u=node(467242)
MATCH u-[:LIKED|COMMENTED]->a
WITH distinct a
MATCH a<-[r2:LIKED|COMMENTED]-lu
RETURN count(*),count(distinct a),count(distinct lu);
+---------------------------------------------------+
| count(*) | count(distinct a) | count(distinct lu) |
+---------------------------------------------------+
| 269268 | 1952 | 81674 |
+---------------------------------------------------+
1 row
895 ms
START u=node(467242)
MATCH u-[:LIKED|COMMENTED]->a
WITH distinct a
MATCH a<-[:LIKED|COMMENTED]-lu
WITH distinct lu
MATCH lu-[:LIKED]-b
RETURN count(*),count(distinct lu), count(distinct b)
;
+---------------------------------------------------+
| count(*) | count(distinct lu) | count(distinct b) |
+---------------------------------------------------+
| 2311694 | 62705 | 91294 |
+---------------------------------------------------+
Run Code Online (Sandbox Code Playgroud)
在这里你有2.3M的总比赛和91k不同的元素.所以差不多2个数量级.
这是一个巨大的聚合,而不是OLTP查询的BI /统计查询.通常,您可以将结果存储在用户节点上,只在后台重新执行.
这些类型的查询再次是全局图形查询(统计/ BI),在这种情况下是前10位用户.
通常你会在后台运行这些(例如每天或每小时一次)并将前10个用户节点连接到一个特殊的节点或索引,然后可以在几毫秒内查询.
START a=node:nodes(kind="user") RETURN count(*);
+----------+
| count(*) |
+----------+
| 3889031 |
+----------+
1 row
27329 ms
Run Code Online (Sandbox Code Playgroud)
毕竟你在整个图表中运行匹配,即4M用户是全局图,而不是图本地查询.
START n=node:nodes(kind="top-user")
MATCH n-[r?:TOP_USER]-()
DELETE r
WITH distinct n
START a=node:nodes(kind="user")
MATCH a-[:CREATED|LIKED|COMMENTED|FOLLOWS]-()
WITH n, a,count(*) as cnt
ORDER BY cnt DESC
LIMIT 10
CREATE a-[:TOP_USER {count:cnt} ]->n;
+-------------------+
| No data returned. |
+-------------------+
Relationships created: 10
Properties set: 10
Relationships deleted: 10
70316 ms
Run Code Online (Sandbox Code Playgroud)
然后查询将是:
START n=node:nodes(kind="top-user")
MATCH n-[r:TOP_USER]-a
RETURN a, r.count
ORDER BY r.count DESC;
+--------------------------------------------------------------------------------------------------------------------------------------------------------------+
| a | r.count |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------+
….
+--------------------------------------------------------------------------------------------------------------------------------------------------------------+
10 rows
4 ms
Run Code Online (Sandbox Code Playgroud)
好的,首先,只有8GB的内存,这是一个非常大的图形.你应该认真考虑买一个更大的盒子.Neo4j实际上提供了一个非常好的硬件计算器,可以让您确切地确定适合您需求的内容:
http://neotechnology.com/calculatorv2/
Run Code Online (Sandbox Code Playgroud)
以一种人为的方式(由于确定大小有更多相关指标),他们的计算器估计值应该至少为10GB.
其次,Neo4j和任何图形数据库都会遇到具有大量连接的节点的问题.如果你想调整你的实例以更好地执行(在获得更大的盒子之后),我建议寻找具有大量连接的任何大型节点,因为这些会严重影响性能.
在看到你的例子后,我非常肯定你有一个包含许多节点的图表,这些节点与其他节点的连接数量要大得多.这本质上会降低您的性能.您也可以尝试更窄的查询.特别是当你已经在一个太小的服务器上工作时,你不想运行那种极其繁重的大回报查询.
您的查询有一些可以清理的内容,但我真的建议您为图表获取适当大小的框,并实际对您连接节点最多的连接数进行内省.
看起来你的Java堆大小也有一个人工上限.如果您尝试使用以下命令启动java:
java -Xmx8g //Other stuff
Run Code Online (Sandbox Code Playgroud)
你将分配8个演出而不是标准的~500 Megs,这也会有所帮助.