Low cpu usage while running a spark job

Question

Low cpu usage while running a spark job

I am running a Spark job. I have 4 cores and worker memory set to 5G. Application master is on another machine in the same network, and does not host any workers. This is my code:

private void myClass() {
    // configuration of the spark context
    SparkConf conf = new SparkConf().setAppName("myWork").setMaster("spark://myHostIp:7077").set("spark.driver.allowMultipleContexts", "true");
    // creation of the spark context in wich we will run the algorithm
    JavaSparkContext sc = new JavaSparkContext(conf);

    // algorithm
    for(int i = 0; i<200; i++) {
        System.out.println("===============================================================");
        System.out.println("iteration : " + i);
        System.out.println("===============================================================");
        ArrayList<Boolean> list = new ArrayList<Boolean>();
        for(int j = 0; j < 1900; j++){
            list.add(true);
        }
        JavaRDD<Ant> ratings = sc.parallelize(list, 100)
                    .map(bool -> new myObj())
                    .map(obj -> this.setupObj(obj))
                    .map(obj -> this.moveObj(obj))
                    .cache();
        int[] stuff = ratings
                    .map(obj -> obj.getStuff())
                    .reduce((obj1,obj2)->this.mergeStuff(obj1,obj2));
        this.setStuff(tour);

        ArrayList<TabObj> tabObj = ratings
                    .map(obj -> this.objToTabObjAsTab(obj))
                    .reduce((obj1,obj2)->this.mergeTabObj(obj1,obj2));
        ratings.unpersist(false);

        this.setTabObj(tabObj);
    }

    sc.close();
}

Run Code Online (Sandbox Code Playgroud)

When I start it, I can see progress on the Spark UI, but it is really slow (I have to set the parrallelize quite high, otherwise I have a timeout issue). I thought it was a CPU bottleneck, but the JVM CPU consumption is actually very low (most of the time it is 0%, sometime a bit more than 5%...).

The JVM is using around 3G Of memory according to the monitor, with only 19M cached.

The master host has 4 cores, and less memory (4G). That machine shows 100% CPU consumption (a full core) and I don't understand why it is that high... It just has to send partitions to the worker on the other machine, right?

Why is CPU consumption low on the worker, and high on the master?

Answer 1

Sha*_*pLu 5

确保您已通过集群中的 Yarn 或 mesos 提交您的 Spark 作业，否则它可能仅在您的主节点中运行。
由于您的代码非常简单，因此完成计算应该非常快，但我建议使用 wordcount 示例尝试读取几 GB 的输入源来测试 CPU 消耗情况。
请使用 "local[*]" 。* 表示使用您的所有内核进行计算

SparkConf sparkConf = new SparkConf().set("spark.driver.host", "localhost").setAppName("unit-testing").setMaster("local[*]"); 参考资料：https : //spark.apache.org/docs/latest/configuration.html
在 spark 中，有很多因素会影响 CPU 和内存的使用，例如执行程序和您喜欢分配的每个 spark.executor.memory。

归档时间：	8 年，2 月前
查看次数：	3400 次
最近记录：	4 年，1 月前