如何优化返回可选属性的SPARQL查询?

RMo*_*sey 3 semantic-web sparql marklogic marklogic-8

如何优化SPARQL查询,如下所示?

此查询的目的是:

  1. 指定资源(国家资源在哪里countryCode = "US")
  2. 获取资源上定义的可选属性.

不幸的是,在OPTIONAL父块之前正在评估块,这导致查询引擎加载所有国家的所有数据.

我想要的是类似LEFT OUTER JOIN行为,但查询引擎并没有这样处理它.

我该怎么做才能提高查询性能?

SELECT  *
WHERE
  { 
    ?type (rdfs:subClassOf)* gj:Country .
    ?this_0  rdf:type        ?type ;
             gn:countryCode  "US"
    # each of these blocks is executed as a standalone query in the engine
    OPTIONAL
      { ?this_0  gn:countryCode  ?countryCode_1}
    OPTIONAL
      { ?this_0  gn:name  ?name_2}
    OPTIONAL
      { ?this_0 gj:cscId  ?cscId_3} 
  }
Run Code Online (Sandbox Code Playgroud)

我在MarkLogic 8.4中使用SPARQL REST端点.

更新:

我尝试使用该optimize=2选项进行查询,但它没有给我带来显着的性能提升:

/v1/graphs/sparql?optimize=2

相关: 如何在MarkLogic的SPARQL REST端点中指定选项?

更新2:

即使我创建了一个必需的可选属性,查询仍然运行缓慢:

WHERE
  {
        ?type (rdfs:subClassOf)* gj:Country .
        ?this_0  rdf:type        ?type ;
             gn:countryCode  "US"; gj:cscId ?cscId_3 ;
  }
Run Code Online (Sandbox Code Playgroud)

我需要做一些特殊的事情来索引这个gj:cscId属性吗?

更新3:

以下是查询控制台中的配置文件信息.

查询个人资料

更新4:

以下是诊断跟踪信息:

2017-04-27 13:30:17.238 Info: [Event:id=SPARQL Value Frequencies] sessionKey=13846462700334370907 namedGraphs=0 values=
2017-04-27 13:30:17.238 Info: <triple-value-statistics count="154569757" unique-subjects="25445373" unique-predicates="104" unique-objects="67520361" xmlns="cts:triple-value-statistics">
2017-04-27 13:30:17.238 Info:   <triple-value-entries>
2017-04-27 13:30:17.238 Info:     <triple-value-entry count="181">
2017-04-27 13:30:17.238 Info:       <triple-value>http://kb.everest.cscglobal.com/geonames-jurisdiction/1.0/schema#Country</triple-value>
2017-04-27 13:30:17.238 Info:       <subject-statistics count="2" unique-predicates="2" unique-objects="2"/>
2017-04-27 13:30:17.238 Info:       <predicate-statistics count="0" unique-subjects="0" unique-objects="0"/>
2017-04-27 13:30:17.238 Info:       <object-statistics count="179" unique-subjects="179" unique-predicates="4"/>
2017-04-27 13:30:17.238 Info:     </triple-value-entry>
2017-04-27 13:30:17.238 Info:     <triple-value-entry count="15">
2017-04-27 13:30:17.238 Info:       <triple-value>http://www.w3.org/2000/01/rdf-schema#subClassOf</triple-value>
2017-04-27 13:30:17.238 Info:       <subject-statistics count="0" unique-predicates="0" unique-objects="0"/>
2017-04-27 13:30:17.238 Info:       <predicate-statistics count="15" unique-subjects="15" unique-objects="5"/>
2017-04-27 13:30:17.238 Info:       <object-statistics count="0" unique-subjects="0" unique-predicates="0"/>
2017-04-27 13:30:17.238 Info:     </triple-value-entry>
2017-04-27 13:30:17.238 Info:     <triple-value-entry count="8739716">
2017-04-27 13:30:17.238 Info:       <triple-value>http://www.w3.org/1999/02/22-rdf-syntax-ns#type</triple-value>
2017-04-27 13:30:17.238 Info:       <subject-statistics count="0" unique-predicates="0" unique-objects="0"/>
2017-04-27 13:30:17.238 Info:       <predicate-statistics count="8359510" unique-subjects="8341619" unique-objects="14"/>
2017-04-27 13:30:17.238 Info:       <object-statistics count="0" unique-subjects="0" unique-predicates="0"/>
2017-04-27 13:30:17.238 Info:     </triple-value-entry>
2017-04-27 13:30:17.238 Info:     <triple-value-entry count="8697064">
2017-04-27 13:30:17.238 Info:       <triple-value>http://www.geonames.org/ontology#countryCode</triple-value>
2017-04-27 13:30:17.238 Info:       <subject-statistics count="2" unique-predicates="2" unique-objects="2"/>
2017-04-27 13:30:17.238 Info:       <predicate-statistics count="8323137" unique-subjects="8323137" unique-objects="517"/>
2017-04-27 13:30:17.238 Info:       <object-statistics count="1" unique-subjects="1" unique-predicates="1"/>
2017-04-27 13:30:17.238 Info:     </triple-value-entry>
2017-04-27 13:30:17.238 Info:     <triple-value-entry count="2119305">
2017-04-27 13:30:17.238 Info:       <triple-value datatype="http://www.w3.org/2001/XMLSchema#string">US</triple-value>
2017-04-27 13:30:17.238 Info:       <subject-statistics count="0" unique-predicates="0" unique-objects="0"/>
2017-04-27 13:30:17.238 Info:       <predicate-statistics count="0" unique-subjects="0" unique-objects="0"/>
2017-04-27 13:30:17.238 Info:       <object-statistics count="2061783" unique-subjects="2061783" unique-predicates="3"/>
2017-04-27 13:30:17.238 Info:     </triple-value-entry>
2017-04-27 13:30:17.238 Info:     <triple-value-entry count="13946907">
2017-04-27 13:30:17.238 Info:       <triple-value>http://kb.everest.cscglobal.com/geonames-jurisdiction/1.0/schema#cscId</triple-value>
2017-04-27 13:30:17.238 Info:       <subject-statistics count="3" unique-predicates="3" unique-objects="3"/>
2017-04-27 13:30:17.238 Info:       <predicate-statistics count="11739004" unique-subjects="11739004" unique-objects="11739004"/>
2017-04-27 13:30:17.238 Info:       <object-statistics count="0" unique-subjects="0" unique-predicates="0"/>
2017-04-27 13:30:17.238 Info:     </triple-value-entry>
2017-04-27 13:30:17.238 Info:   </triple-value-entries>
2017-04-27 13:30:17.238 Info: </triple-value-statistics>
2017-04-27 13:30:17.239 Info: [Event:id=SPARQL AST] sessionKey=13846462700334370907
2017-04-27 13:30:17.239 Info:   initialPlan=SPARQLModule[
2017-04-27 13:30:17.239 Info:   Prolog[]
2017-04-27 13:30:17.239 Info:   SPARQLSelect[SPARQLProject[order()
2017-04-27 13:30:17.239 Info:       GraphNode[Var type 0]
2017-04-27 13:30:17.239 Info:       GraphNode[Var this_0 1]
2017-04-27 13:30:17.239 Info:       GraphNode[Var cscId_3 2]
2017-04-27 13:30:17.239 Info:       SPARQLLeftNestedLoopJoin[order() hash(1==1) scatter(1 = 1)
2017-04-27 13:30:17.239 Info:         SPARQLNestedLoopJoin[order() hash(1==1) scatter(1 = 1)
2017-04-27 13:30:17.239 Info:           SPARQLScatterJoin[order(0,1) hash(0==0) scatter(0 = 0)
2017-04-27 13:30:17.239 Info:             SPARQLZeroOrOne[
2017-04-27 13:30:17.239 Info:               GraphNode[Var type 0]
2017-04-27 13:30:17.239 Info:               GraphNode[IRI <http://kb.everest.cscglobal.com/geonames-jurisdiction/1.0/schema#Country>]
2017-04-27 13:30:17.239 Info:               SPARQLScatterOneOrMore[
2017-04-27 13:30:17.239 Info:                 GraphNode[Var type 0]
2017-04-27 13:30:17.239 Info:                 GraphNode[Var ANON16629111911678922088 0]
2017-04-27 13:30:17.239 Info:                 GraphNode[Var ANON7634081659815295853 1]
2017-04-27 13:30:17.239 Info:                 GraphNode[IRI <http://kb.everest.cscglobal.com/geonames-jurisdiction/1.0/schema#Country>]
2017-04-27 13:30:17.239 Info:                 TriplePattern[order(0,1) PSO
2017-04-27 13:30:17.239 Info:                   GraphNode[Var ANON16629111911678922088 0]
2017-04-27 13:30:17.239 Info:                   GraphNode[IRI <http://www.w3.org/2000/01/rdf-schema#subClassOf>]
2017-04-27 13:30:17.239 Info:                   GraphNode[Var ANON7634081659815295853 1]]]]
2017-04-27 13:30:17.239 Info:             TriplePattern[order(0,1) OPS
2017-04-27 13:30:17.239 Info:               GraphNode[Var this_0 1]
2017-04-27 13:30:17.239 Info:               GraphNode[IRI <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>]
2017-04-27 13:30:17.239 Info:               GraphNode[Var type 0]]]
2017-04-27 13:30:17.239 Info:           TriplePattern[order(1) SOP
2017-04-27 13:30:17.239 Info:             GraphNode[Var this_0 1]
2017-04-27 13:30:17.239 Info:             GraphNode[IRI <http://www.geonames.org/ontology#countryCode>]
2017-04-27 13:30:17.239 Info:             GraphNode[Literal "US"]]]
2017-04-27 13:30:17.239 Info:         TriplePattern[order(1,2) PSO
2017-04-27 13:30:17.239 Info:           GraphNode[Var this_0 1]
2017-04-27 13:30:17.239 Info:           GraphNode[IRI <http://kb.everest.cscglobal.com/geonames-jurisdiction/1.0/schema#cscId>]
2017-04-27 13:30:17.239 Info:           GraphNode[Var cscId_3 2]]]]]]
2017-04-27 13:30:17.239 Info: [Event:id=SPARQL Cost Analysis] sessionKey=13846462700334370907 optimize=1 r=3 t=1.28811 os=360 is=15 mutations=30 seed=7088858925989728751
2017-04-27 13:30:17.239 Info:   initialCost=(m:5.99223e+11,r:0,io:(52.9404/167736/1.17487e+09),cpu(1):(0/1.77017e+08/1.18652e+12),mem:8185,c:1.03266e+07,crd:[14,2.06178e+06,1.03266e+07])
2017-04-27 13:30:17.320 Info: [Event:id=SPARQL Cost Analysis] sessionKey=13846462700334370907 diff=-5.98971e+11 diff%=-99.958 r=0
2017-04-27 13:30:17.320 Info:   cost=(m:2.51757e+08,r:0,io:(52.9404/322.031/4.68406e+07),cpu(4):(0/159/3.51041e+07),mem:415.68,c:6.46969e+06,crd:[14,2.06178e+06,6.46969e+06])
2017-04-27 13:30:17.320 Info: [Event:id=SPARQL Cost Analysis] sessionKey=13846462700334370907 diff=-5.98971e+11 diff%=-99.958 r=1
2017-04-27 13:30:17.320 Info:   cost=(m:2.51757e+08,r:0,io:(52.9404/322.031/4.68406e+07),cpu(4):(0/159/3.51041e+07),mem:415.68,c:6.46969e+06,crd:[14,2.06178e+06,6.46969e+06])
2017-04-27 13:30:17.326 Info: [Event:id=SPARQL Cost Analysis] sessionKey=13846462700334370907 diff=-5.98971e+11 diff%=-99.958 r=2
2017-04-27 13:30:17.326 Info:   cost=(m:2.51757e+08,r:0,io:(52.9404/322.031/4.68406e+07),cpu(4):(0/159/3.51041e+07),mem:415.68,c:6.46969e+06,crd:[14,2.06178e+06,6.46969e+06])
2017-04-27 13:30:17.326 Info: [Event:id=SPARQL Cost Analysis] sessionKey=13846462700334370907
2017-04-27 13:30:17.326 Info:   bestCost=(m:2.51757e+08,r:0,io:(52.9404/322.031/4.68406e+07),cpu(4):(0/159/3.51041e+07),mem:415.68,c:6.46969e+06,crd:[14,2.06178e+06,6.46969e+06])
2017-04-27 13:30:17.326 Info: [Event:id=SPARQL AST] sessionKey=13846462700334370907
2017-04-27 13:30:17.326 Info:   plan=SPARQLModule[
2017-04-27 13:30:17.326 Info:   Prolog[]
2017-04-27 13:30:17.326 Info:   SPARQLSelect[SPARQLProject[order(1,0)
2017-04-27 13:30:17.326 Info:       GraphNode[Var type 0]
2017-04-27 13:30:17.326 Info:       GraphNode[Var this_0 1]
2017-04-27 13:30:17.326 Info:       GraphNode[Var cscId_3 2]
2017-04-27 13:30:17.326 Info:       SPARQLRightMergeJoin[order(1,0) hash(1==1) scatter()
2017-04-27 13:30:17.326 Info:         TriplePattern[order(1,2) PSO
2017-04-27 13:30:17.326 Info:           GraphNode[Var this_0 1]
2017-04-27 13:30:17.326 Info:           GraphNode[IRI <http://kb.everest.cscglobal.com/geonames-jurisdiction/1.0/schema#cscId>]
2017-04-27 13:30:17.326 Info:           GraphNode[Var cscId_3 2]]
2017-04-27 13:30:17.326 Info:         SPARQLHashJoin[order(1,0) hash(0==0) scatter()
2017-04-27 13:30:17.326 Info:           SPARQLZeroOrOne[
2017-04-27 13:30:17.326 Info:             GraphNode[Var type 0]
2017-04-27 13:30:17.326 Info:             GraphNode[IRI <http://kb.everest.cscglobal.com/geonames-jurisdiction/1.0/schema#Country>]
2017-04-27 13:30:17.326 Info:             SPARQLBloomOneOrMore[
2017-04-27 13:30:17.326 Info:               GraphNode[IRI <http://kb.everest.cscglobal.com/geonames-jurisdiction/1.0/schema#Country>]
2017-04-27 13:30:17.326 Info:               GraphNode[Var ANON7634081659815295853 1]
2017-04-27 13:30:17.326 Info:               GraphNode[Var ANON16629111911678922088 0]
2017-04-27 13:30:17.326 Info:               GraphNode[Var type 0]
2017-04-27 13:30:17.326 Info:               TriplePattern[order(0,1) PSO
2017-04-27 13:30:17.326 Info:                 GraphNode[Var ANON16629111911678922088 0]
2017-04-27 13:30:17.326 Info:                 GraphNode[IRI <http://www.w3.org/2000/01/rdf-schema#subClassOf>]
2017-04-27 13:30:17.326 Info:                 GraphNode[Var ANON7634081659815295853 1]]]]
2017-04-27 13:30:17.326 Info:           SPARQLMergeJoin[order(1,0) hash(1==1) scatter()
2017-04-27 13:30:17.326 Info:             TriplePattern[order(1) OPS
2017-04-27 13:30:17.326 Info:               GraphNode[Var this_0 1]
2017-04-27 13:30:17.326 Info:               GraphNode[IRI <http://www.geonames.org/ontology#countryCode>]
2017-04-27 13:30:17.326 Info:               GraphNode[Literal "US"]]
2017-04-27 13:30:17.326 Info:             TriplePattern[order(1,0) PSO
2017-04-27 13:30:17.326 Info:               GraphNode[Var this_0 1]
2017-04-27 13:30:17.326 Info:               GraphNode[IRI <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>]
2017-04-27 13:30:17.326 Info:               GraphNode[Var type 0]]]]]]]]
Run Code Online (Sandbox Code Playgroud)

更新5:

在某些用例中,我发现可以?type从查询中消除属性路径表达式.在一个这样的情况下,性能提高了两个数量级:

WHERE
  { 
    ?this_0  rdf:type        gj:Country ;
             gn:countryCode  "US"
    # each of these blocks is executed as a standalone query in the engine
    OPTIONAL
      { ?this_0  gn:countryCode  ?countryCode_1}
    OPTIONAL
      { ?this_0  gn:name  ?name_2}
    OPTIONAL
      { ?this_0 gj:cscId  ?cscId_3} 
  }
Run Code Online (Sandbox Code Playgroud)

由于此解决方案更改了查询的输出,因此它无法解决所有用例.

似乎问题不在于OPTIONAL本身,而是与属性路径表达式混淆查询规划器有关,因此可以独立查找OPTIONAL块中的属性(这不是高性能).

Joh*_*son 5

查询优化器依赖于使用统计信息来确定最佳操作顺序.通常会有一个限制性三重模式,可用于限制使用散点连接的进一步操作.

在您的情况下,统计数据不提供如此明显的限制性三重模式.您可以通过查看三重值统计信息输出看到字符串"US"作为对象出现2061783次 - 因此这不是非常严格的限制.

gj:Country IRI是限制性的(在对象位置上是179次),但不幸的是你需要在传递闭包运算符的右侧使用它.很难预测传递闭包运算符将返回多少结果,因为它在很大程度上取决于实际数据.

您会发现使用类似下面的属性路径将允许MarkLogic避免使用零或一运算符,这可能会带来一些小的性能提升:

?this_0 a/rdfs:subClassOf* gj:Country .
Run Code Online (Sandbox Code Playgroud)

如果您知道(例如)只有一个gj:国家/地区代码为"US"的国家/地区,您可以添加一个限制,以便优化程序提示如何处理查询,即:

select * {
  {
    select * {
      ?this_0 a/rdfs:subClassOf* gj:Country .
      ?this_0  gn:countryCode  'US' .
    } limit 1
  }
  OPTIONAL { ?this_0 gj:cscId  ?cscId_3 } 
}
Run Code Online (Sandbox Code Playgroud)