小编Wan*_*rer的帖子

在Apache Spark中缓存RDD的目的是什么?

我是Apache Spark的新手,我在Spark中有几个基本问​​题,在阅读Spark资料时无法理解。每种材料都有自己的解释风格。我在Ubuntu上使用PySpark Jupyter笔记本进行练习。

根据我的理解,当我运行以下命令时,testfile.csv中的数据将被分区并存储在各个节点的内存中(实际上我知道这是一个懒惰的评估,只有在看到action命令后才会处理),但仍然是

rdd1 = sc.textFile("testfile.csv")
Run Code Online (Sandbox Code Playgroud)

我的问题是,当我运行下面的transformation and action命令时,rdd2数据将存储在哪里。

1.是否存储在内存中?

rdd2 = rdd1.map( lambda x: x.split(",") )

rdd2.count()
Run Code Online (Sandbox Code Playgroud)

我知道rdd2中的数据将一直可用,直到我关闭jupyter笔记本为止。然后需要cache(),无论如何rdd2都可以进行所有转换。听说所有转换后的内存中的数据都被清除了,这是怎么回事?

  1. 将RDD保留在内存和cache()之间有什么区别

    rdd2.cache()

caching apache-spark rdd pyspark

5
推荐指数
1
解决办法
3109
查看次数

Apache Spark 和 Apache Arrow 有什么区别?

Apache Arrow 和 Apache Spark 之间有什么区别?Apache Arrow 会取代 Hadoop 吗?

hadoop bigdata apache-spark apache-arrow

4
推荐指数
1
解决办法
1714
查看次数

我应该如何在Ubuntu 12.04上集成Jupyter笔记本和pyspark?

我是Pyspark的新手.我在ubuntu上安装了"bash Anaconda2-4.0.0-Linux-x86_64.sh".还安装了pyspark.终端一切正常.我想在jupyter上工作.当我在我的ubuntu终端中创建配置文件时,如下所示:

wanderer@wanderer-VirtualBox:~$ ipython profile create pyspark
[ProfileCreate] Generating default config file: u'/home/wanderer/.ipython/profile_pyspark/ipython_config.py'
[ProfileCreate] Generating default config file: u'/home/wanderer/.ipython/profile_pyspark/ipython_kernel_config.py'

wanderer@wanderer-VirtualBox:~$ export ANACONDA_ROOT=~/anaconda2
wanderer@wanderer-VirtualBox:~$ export PYSPARK_DRIVER_PYTHON=$ANACONDA_ROOT/bin/ipython
wanderer@wanderer-VirtualBox:~$ export PYSPARK_PYTHON=$ANACONDA_ROOT/bin/python

wanderer@wanderer-VirtualBox:~$ cd spark-1.5.2-bin-hadoop2.6/
wanderer@wanderer-VirtualBox:~/spark-1.5.2-bin-hadoop2.6$ PYTHON_OPTS=”notebook” ./bin/pyspark
Python 2.7.11 |Anaconda 4.0.0 (64-bit)| (default, Dec  6 2015, 18:08:32) 
Type "copyright", "credits" or "license" for more information.

IPython 4.1.2 -- An enhanced Interactive Python.
?         -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help      -> Python's own help system.
object?   -> Details …
Run Code Online (Sandbox Code Playgroud)

ipython apache-spark pyspark jupyter jupyter-notebook

3
推荐指数
2
解决办法
1万
查看次数

创建配置表错误以加载Twitter数据

我正在尝试创建外部表并尝试将twitter数据加载到表中.在创建表时,我遇到以下错误,无法跟踪错误.

hive> ADD JAR /usr/local/hive/lib/hive-serdes-1.0-SNAPSHOT.jar
    > ;
Added [/usr/local/hive/lib/hive-serdes-1.0-SNAPSHOT.jar] to class path
Added resources: [/usr/local/hive/lib/hive-serdes-1.0-SNAPSHOT.jar]
hive> CREATE EXTERNAL TABLE tweets (
    >    id BIGINT,
    >    created_at STRING,
    >    source STRING,
    >    favorited BOOLEAN,
    >    retweeted_status STRUCT<
    >      text:STRING,
    >      user:STRUCT<screen_name:STRING,name:STRING>,
    >      retweet_count:INT>,
    >    entities STRUCT<
    >      urls:ARRAY<STRUCT<expanded_url:STRING>>,
    >      user_mentions:ARRAY<STRUCT<screen_name:STRING,name:STRING>>,
    >      hashtags:ARRAY<STRUCT<text:STRING>>>,
    >    text STRING,
    >    user STRUCT<
    >      screen_name:STRING,
    >      name:STRING,
    >      friends_count:INT,
    >      followers_count:INT,
    >      statuses_count:INT,
    >      verified:BOOLEAN,
    >      utc_offset:INT,
    >      time_zone:STRING>,
    >    in_reply_to_screen_name STRING
    >  )
    >  PARTITIONED …
Run Code Online (Sandbox Code Playgroud)

twitter hadoop hive bigdata flume

1
推荐指数
1
解决办法
5990
查看次数

PySpark SparkContext名称错误'sc'在jupyter中

我是pyspark的新手,想在我的Ubuntu 12.04机器上使用Ipython笔记本使用pyspark.以下是pyspark和Ipython笔记本的配置.

sparkuser@Ideapad:~$ echo $JAVA_HOME
/usr/lib/jvm/java-8-oracle

# Path for Spark
sparkuser@Ideapad:~$ ls /home/sparkuser/spark/
bin    CHANGES.txt  data  examples  LICENSE   NOTICE  R          RELEASE  scala-2.11.6.deb
build  conf         ec2   lib       licenses  python  README.md  sbin     spark-1.5.2-bin-hadoop2.6.tgz
Run Code Online (Sandbox Code Playgroud)

我安装了Anaconda2 4.0.0和anaconda的路径:

sparkuser@Ideapad:~$ ls anaconda2/
bin  conda-meta  envs  etc  Examples  imports  include  lib  LICENSE.txt  mkspecs  pkgs  plugins  share  ssl  tests
Run Code Online (Sandbox Code Playgroud)

为IPython创建PySpark配置文件.

ipython profile create pyspark

sparkuser@Ideapad:~$ cat .bashrc

export SPARK_HOME="$HOME/spark"
export PYSPARK_SUBMIT_ARGS="--master local[2]"
# added by Anaconda2 4.0.0 installer
export PATH="/home/sparkuser/anaconda2/bin:$PATH"
Run Code Online (Sandbox Code Playgroud)

创建一个名为〜/ .ipython/profile_pyspark/startup/00-pyspark-setup.py的文件:

sparkuser@Ideapad:~$ cat .ipython/profile_pyspark/startup/00-pyspark-setup.py 
import os
import …
Run Code Online (Sandbox Code Playgroud)

ipython anaconda apache-spark pyspark jupyter-notebook

0
推荐指数
1
解决办法
7484
查看次数