小编gun*_*erd的帖子

Python vs Scala(用于Spark作业)

我是Spark的新手,目前通过玩pyspark和spark-shell来探索它.

所以情况就是这样,我用pyspark和spark-shell运行相同的火花作业.

这是来自pyspark:

textfile = sc.textFile('/var/log_samples/mini_log_2')
textfile.count()
Run Code Online (Sandbox Code Playgroud)

而这一个来自火花壳:

textfile = sc.textFile("file:///var/log_samples/mini_log_2")
textfile.count()
Run Code Online (Sandbox Code Playgroud)

我试了两次,第一次(python)完成了30-35秒,而第二次(scala)需要大约15秒.我对可能导致这种不同性能结果的原因感到好奇吗?是因为语言的选择还是火花贝壳在背景中做了一些pyspark不做的事情?

UPDATE

所以我对较大的数据集进行了一些测试,总共大约550 GB(压缩).我使用Spark Standalone作为主人.

我观察到,在使用pyspark时,任务在执行者之间平均分配.但是,在使用spark-shell时,任务不会平等分配.更强大的机器可以获得更多任务,而较弱的机器可以减少任务.

使用火花壳,工作在25分钟内完成,而使用pyspark则大约需要55分钟.如何使用pyspark使Spark Standalone分配任务,因为它使用spark-shell分配任务?

火花壳

Pyspark

python scala apache-spark pyspark

8
推荐指数
1
解决办法
3078
查看次数

如何列出elasticsearch中所有可用的搜索模板?

尝试过这个localhost:9200/_template?,但它似乎列出了索引模板。

我想列出系统中现有的搜索模板,有办法做到这一点吗?

elasticsearch

6
推荐指数
2
解决办法
4374
查看次数

Ambari 2.0 安装失败,“<urlopen 错误 [Errno 111] 连接被拒绝>”

尝试通过Ambari 2.0建立Hadoop集群,但在安装阶段失败。以下是来自其中一个数据节点的故障日志:

stderr:   /var/lib/ambari-agent/data/errors-416.txt

Traceback (most recent call last):
  File "/var/lib/ambari-agent/cache/stacks/HDP/2.0.6/hooks/before-ANY/scripts/hook.py", line 34, in <module>
    BeforeAnyHook().execute()
  File "/usr/lib/python2.6/site-packages/resource_management/libraries/script/script.py", line 214, in execute
    method(env)
  File "/var/lib/ambari-agent/cache/stacks/HDP/2.0.6/hooks/before-ANY/scripts/hook.py", line 29, in hook
    setup_jce()
  File "/var/lib/ambari-agent/cache/stacks/HDP/2.0.6/hooks/before-ANY/scripts/shared_initialization.py", line 40, in setup_jce
    content = DownloadSource(format("{jce_location}/{jce_policy_zip}")),
  File "/usr/lib/python2.6/site-packages/resource_management/core/base.py", line 148, in __init__
    self.env.run()
  File "/usr/lib/python2.6/site-packages/resource_management/core/environment.py", line 152, in run
    self.run_action(resource, action)
  File "/usr/lib/python2.6/site-packages/resource_management/core/environment.py", line 118, in run_action
    provider_action()
  File "/usr/lib/python2.6/site-packages/resource_management/core/providers/system.py", line 108, in action_create
    content = self._get_content()
  File "/usr/lib/python2.6/site-packages/resource_management/core/providers/system.py", line 150, in _get_content
    return content()
  File "/usr/lib/python2.6/site-packages/resource_management/core/source.py", …
Run Code Online (Sandbox Code Playgroud)

hadoop bigdata hortonworks-data-platform ambari

5
推荐指数
1
解决办法
3923
查看次数

来自python worker的错误:/ bin/python:没有名为pyspark的模块

我正在尝试使用ipython建立一个漂亮的spark开发环境.首先启动ipython,然后:

import findspark
findspark.init()

from pyspark.conf import SparkConf
from pyspark.context import SparkContext
conf = SparkConf()
conf.setMaster('yarn-client')
sc = SparkContext(conf=conf)
Run Code Online (Sandbox Code Playgroud)

这是来自应用程序UI,我可以看到执行程序在工作节点上.

申请ui

但是,当我尝试这个:

rdd = sc.textFile("/LOGS/201511/*/*")
rdd.first()
Run Code Online (Sandbox Code Playgroud)

我明白了:

Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, d142.dtvhadooptest.com): org.apache.spark.SparkException:
Error from python worker:
  /bin/python: No module named pyspark
PYTHONPATH was:
  /data/sdb/hadoop/yarn/local/usercache/hdfs/filecache/64/spark-assembly-1.4.1.2.3.2.0-2950-hadoop2.7.1.2.3.2.0-2950.jar
java.io.EOFException
        at java.io.DataInputStream.readInt(DataInputStream.java:392)
        at org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:163)
        at …
Run Code Online (Sandbox Code Playgroud)

python ipython ipython-notebook apache-spark pyspark

4
推荐指数
1
解决办法
5161
查看次数

如果在函数中定义了正则表达式,则找不到匹配项

斯卡拉新手在这里!我正在尝试定义一个函数,该函数将字符串作为输入并返回该字符串的一部分.当我使用正则表达式手动执行此操作时,它可以正常工作,但是当我在函数中定义它时,它似乎找不到匹配项.谁可以给我解释一下这个?

这是我的字符串:

val str = """1.1.1.1 - - [30/Apr/2015:13:23:20 +0200] "GET /S1/HLS_LIVE/slowturk/32/prog_index21964.ts?key=36ec178eee7ae44f1b204aec4627a120&app=com.radyolar.slowturk.iphone HTTP/1.1" 200 0 "-" "AppleCoreMedia/1.0.0.12F70 (iPhone; U; CPU OS 8_3 like Mac OS X; de_de)" "-" 0.005 ut="0.005" cs="MISS""""
Run Code Online (Sandbox Code Playgroud)

这里定义功能:

def foo(record: String): String = {
    val p_ip = "(\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}\\.\\d{1,3})"
    val p_client = "(\\S+)"
    val p_user = "(\\S+)"
    val p_dateTime = "(\\[.+?\\])"
    val p_request = "\"(.+?)\""
    val p_status = "(\\d{3})"
    val p_bytes = "(\\S+)"
    val p_referer = "(\\S+)"
    val p_agent = "\\\"([^\"]+)\\\""
    val p_forward = "(\\S+)"
    val p_req_time …
Run Code Online (Sandbox Code Playgroud)

regex scala

1
推荐指数
1
解决办法
475
查看次数

0
推荐指数
1
解决办法
177
查看次数