我们在调用connectedComponents函数时遇到了GraphX的错误,它在错误时出现以下错误:java.lang.ArrayIndexOutOfBoundsException:-1
我发现了这个错误报告:https: //issues.apache.org/jira/browse/SPARK-5480
有没有其他人遇到这个问题,如果是这样,你是如何解决它或解决它?这是在scala 2.10中的Spark 1.6.2上运行的
来自Shell的堆栈跟踪:
17/10/13 17:05:58 ERROR TaskSetManager: Task 12 in stage 2036.0 failed 4 times; aborting job
org.apache.spark.SparkException: Job aborted due to stage failure: Task 12 in stage 2036.0 failed 4 times, most recent failure: Lost task 12.3 in stage 2036.0 (TID 106840, cl-bigdata5.hosting.dbg.internal): java.lang.ArrayIndexOutOfBoundsException: -1
at org.apache.spark.graphx.util.collection.GraphXPrimitiveKeyOpenHashMap$mcJI$sp.apply$mcJI$sp(GraphXPrimitiveKeyOpenHashMap.scala:64)
at org.apache.spark.graphx.impl.EdgePartition.updateVertices(EdgePartition.scala:91)
at org.apache.spark.graphx.impl.ReplicatedVertexView$$anonfun$2$$anonfun$apply$1.apply(ReplicatedVertexView.scala:75)
at org.apache.spark.graphx.impl.ReplicatedVertexView$$anonfun$2$$anonfun$apply$1.apply(ReplicatedVertexView.scala:73)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at org.apache.spark.graphx.impl.EdgeRDDImpl$$anonfun$mapEdgePartitions$1.apply(EdgeRDDImpl.scala:120)
at org.apache.spark.graphx.impl.EdgeRDDImpl$$anonfun$mapEdgePartitions$1.apply(EdgeRDDImpl.scala:118)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$22.apply(RDD.scala:717)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$22.apply(RDD.scala:717)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
at org.apache.spark.graphx.EdgeRDD.compute(EdgeRDD.scala:51)
at …
Run Code Online (Sandbox Code Playgroud) 当我尝试使用Scala在Graphx中实现一个算法时,我没有发现可以激活下一个解决方案中的所有顶点.如何向所有图形顶点发送消息?在我的算法中,有一些超级步骤应该由所有顶点执行(无论它们是否接收到消息,因为即使没有接收消息也是应该在下一次迭代中处理的事件).
我在这里给出了在pregel的逻辑中实现的SSSP算法的官方代码,你可以看到只有接收消息的顶点将在下一次迭代中执行它们的程序但是对于我的情况,我希望pregel函数迭代运行,即每个超级步骤顶点执行他们的程序,他们可以投票停止,如果需要!这个例子中的推理看起来并不像Pregel的纸质逻辑.请问有关如何实现Pregel真实逻辑的任何想法?
val graph: Graph[Long, Double] =
GraphGenerators.logNormalGraph(sc, numVertices = 100).mapEdges(e => e.attr.toDouble)
val sourceId: VertexId = 42 // The ultimate source
// Initialize the graph such that all vertices except the root have distance infinity.
val initialGraph = graph.mapVertices((id, _) =>
if (id == sourceId) 0.0 else Double.PositiveInfinity)
val sssp = initialGraph.pregel(Double.PositiveInfinity)(
(id, dist, newDist) => math.min(dist, newDist), // Vertex Program
triplet => { // Send Message
if (triplet.srcAttr + triplet.attr < triplet.dstAttr) {
Iterator((triplet.dstId, triplet.srcAttr + triplet.attr))
} …
Run Code Online (Sandbox Code Playgroud) 我需要一些帮助来确认我的选择...并了解你是否可以给我一些信息.我的存储数据库是TitanDb和Cassandra.我有一个非常大的图表.我的目标是在后面的图上使用Mllib.
我的第一个想法是:使用Titan和GraphX,但我没有找到任何东西或正在开发中...... TinkerPop还没准备好.所以我看看Giraph.TinkerPop,Titan可以与TinkerPop的Rexster通信.
我的问题是:使用Giraph有什么好处?Gremlin似乎也在思考并分发.
非常感谢你解释我.我想我并不真正理解Gremlin和Giraph(或GraphX)之间的区别.
祝你今天愉快.
Apache Spark(Graphx)上的内存不足有问题.应用程序运行,但一段时间后关闭.我使用Spark 1.2.0.群集有足够的内存和多个内核.我没有使用GraphX的其他应用程序,运行没有问题.应用使用Pregel.
我在Hadoop YARN模式下提交申请:
HADOOP_CONF_DIR =/etc/hadoop/conf spark-submit --class DPFile --deploy-mode cluster --master yarn --num-executors 4 --driver-memory 10g --executor-memory 6g --executor-cores 8 - -files log4j.properties spark_routing_2.10-1.0.jar road_cr_big2 1000
Spark配置:
val conf = new SparkConf(true)
.set("spark.eventLog.overwrite", "true")
.set("spark.driver.extraJavaOptions", "-Dlog4j.configuration=log4j.properties")
.set("spark.yarn.applicationMaster.waitTries", "60")
.set("yarn.log-aggregation-enable","true")
.set("spark.akka.frameSize", "500")
.set("spark.akka.askTimeout", "600")
.set("spark.core.connection.ack.wait.timeout", "600")
.set("spark.akka.timeout","1000")
.set("spark.akka.heartbeat.pauses","60000")
.set("spark.akka.failure-detector.threshold","3000.0")
.set("spark.akka.heartbeat.interval","10000")
.set("spark.ui.retainedStages","100")
.set("spark.ui.retainedJobs","100")
.set("spark.driver.maxResultSize","4G")
Run Code Online (Sandbox Code Playgroud)
谢谢你的回答.
日志:
ERROR Utils: Uncaught exception in thread SparkListenerBus
java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:2367)
at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:130)
at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:114)
at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:415)
at java.lang.StringBuilder.append(StringBuilder.java:132)
at scala.collection.mutable.StringBuilder.append(StringBuilder.scala:197)
at …
Run Code Online (Sandbox Code Playgroud) 嗨,我是图形世界的新手.我已经被分配到图形处理工作,现在我知道Apache Spark因此想到使用它来处理大图.然后我遇到了Gephi提供了很好的GUI来操作图形.Graphx是否有这样的工具或主要是并行图处理库.我可以将来自Gephi的json图形数据导入graphx吗?请指导.我知道这是基本但有效的问题.提前致谢.
假设我们在Apache GraphX中获得了以下输入:
顶点RDD:
val vertexArray = Array(
(1L, "Alice"),
(2L, "Bob"),
(3L, "Charlie"),
(4L, "David"),
(5L, "Ed"),
(6L, "Fran")
)
Run Code Online (Sandbox Code Playgroud)
Edge RDD:
val edgeArray = Array(
Edge(1L, 2L, 1),
Edge(2L, 3L, 1),
Edge(3L, 4L, 1),
Edge(5L, 6L, 1)
)
Run Code Online (Sandbox Code Playgroud)
我需要连接到Apache Spark GraphX中的节点的所有组件
1,[1,2,3,4]
5,[5,6]
Run Code Online (Sandbox Code Playgroud) 我正在使用spark-shell来运行我的代码。在我的代码中,我定义了一个函数,并使用其参数调用该函数。
问题是调用函数时出现以下错误。
error: type mismatch;
found : org.apache.spark.graphx.Graph[VertexProperty(in class $iwC)(in class $iwC)(in class $iwC)(in class $iwC),String]
required: org.apache.spark.graphx.Graph[VertexProperty(in class $iwC)(in class $iwC)(in class $iwC)(in class $iwC),String]
Run Code Online (Sandbox Code Playgroud)
此错误的原因是什么?它与Spark中的Graph数据类型有关系吗?
代码:这是我代码的一部分,涉及函数“ countpermissions”的定义和调用。
class VertexProperty(val id:Long) extends Serializable
case class User(val userId:Long, val userCode:String, val Name:String, val Surname:String) extends VertexProperty(userId)
case class Entitlement(val entitlementId:Long, val name:String) extends VertexProperty(entitlementId)
def countpermissions(es:String, sg:Graph[VertexProperty,String]):Long = {
return 0
}
val triplets = graph.triplets
val temp = triplets.map(t => t.attr)
val distinct_edge_string = temp.distinct
var bcast_graph = sc.broadcast(graph) …
Run Code Online (Sandbox Code Playgroud) Spark 版本 1.6.1
创建 Edge 和 Vertex RDD
val vertices_raw = sqlContext.read.json("vertices.json.gz")
val vertices = vertices_raw.rdd.map(row=> ((row.getAs[String]("toid").stripPrefix("osgb").toLong),row.getAs[String]("index")))
val verticesRDD: RDD[(VertexId, String)] = vertices
val edges_raw = sqlContext.read.json("edges.json.gz")
val edgesRDD = edges_raw.rdd.map(row=>(Edge(row.getAs[String]("positiveNode").stripPrefix("osgb").toLong, row.getAs[String]("negativeNode").stripPrefix("osgb").toLong, row.getAs[Double]("length"))))
Run Code Online (Sandbox Code Playgroud)
我有一个可以检查的 EdgesRDD
[IN] edgesRDD
res10: org.apache.spark.rdd.RDD[org.apache.spark.graphx.Edge[Double]] = MapPartitionsRDD[19] at map at <console>:38
[IN] edgesRDD.foreach(println)
Edge(5000005125036254,5000005125036231,42.26548472559799)
Edge(5000005125651333,5000005125651330,29.557979625165135)
Edge(5000005125651329,5000005125651330,81.9310872300414)
Run Code Online (Sandbox Code Playgroud)
我有一个顶点RDD
[IN] verticesRDD
res12: org.apache.spark.rdd.RDD[(Long, String)] = MapPartitionsRDD[9] at map at <console>:38
[IN] verticesRDD.foreach(println)
(5000005125651331,343722)
(5000005125651332,343723)
(5000005125651333,343724)
Run Code Online (Sandbox Code Playgroud)
我将这些结合起来创建一个图表。
[IN] val graph: Graph[(String),Double] = Graph(verticesRDD, edgesRDD)
graph: org.apache.spark.graphx.Graph[String,Double] = org.apache.spark.graphx.impl.GraphImpl@303bbd02
Run Code Online (Sandbox Code Playgroud)
我可以检查图形对象中的 …
我使用graphx创建了有向图.
#src->dest
a -> b 34
a -> c 23
b -> e 10
c -> d 12
d -> c 12
c -> d 11
Run Code Online (Sandbox Code Playgroud)
我想得到这样的所有两个跳邻居:
a -> e 44
a -> d 34
Run Code Online (Sandbox Code Playgroud)
我的图表非常大,所以我想优雅高效地完成它.有没有人对图形实例的最佳方法有什么建议?
我正在尝试使用连接的组件,但缩放时遇到问题。我的这里就是我所拥有的-
// get vertices
val vertices = stage_2.flatMap(x => GraphUtil.getVertices(x)).cache
// get edges
val edges = stage_2.map(x => GraphUtil.getEdges(x)).filter(_ != null).flatMap(x => x).cache
// create graph
val identityGraph = Graph(vertices, edges)
// get connected components
val cc = identityGraph.connectedComponents.vertices
Run Code Online (Sandbox Code Playgroud)
在哪里,GraphUtil具有帮助程序功能以返回顶点和边。在这一点上,我的图有大约100万个节点和200万个边(顺便说一句,预计将增长到1亿个节点)。我的图非常稀疏,因此我希望有很多小图。
当我执行上述操作时,我会不断得到java.lang.OutOfMemoryError: Java heap space
。我已经尝试executor-memory 32g
并运行了15个节点的集群,其纱线容器大小为45g。
这是异常详细信息:
16/10/26 10:32:26 ERROR util.Utils: uncaught error in thread SparkListenerBus, stopping SparkContext
java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOfRange(Arrays.java:2694)
at java.lang.String.<init>(String.java:203)
at java.lang.StringBuilder.toString(StringBuilder.java:405)
at com.fasterxml.jackson.core.util.TextBuffer.contentsAsString(TextBuffer.java:360)
at com.fasterxml.jackson.core.io.SegmentedStringWriter.getAndClear(SegmentedStringWriter.java:98)
at com.fasterxml.jackson.databind.ObjectMapper.writeValueAsString(ObjectMapper.java:2216)
at org.json4s.jackson.JsonMethods$class.compact(JsonMethods.scala:32)
at …
Run Code Online (Sandbox Code Playgroud) spark-graphx ×10
apache-spark ×9
scala ×4
graph ×2
giraph ×1
graph-theory ×1
gremlin ×1
hadoop ×1
hadoop-yarn ×1
spark-shell ×1
tinkerpop ×1
titan ×1