我是Apache Spark的新手,我在Spark中有几个基本问题,在阅读Spark资料时无法理解。每种材料都有自己的解释风格。我在Ubuntu上使用PySpark Jupyter笔记本进行练习。
根据我的理解,当我运行以下命令时,testfile.csv中的数据将被分区并存储在各个节点的内存中(实际上我知道这是一个懒惰的评估,只有在看到action命令后才会处理),但仍然是
rdd1 = sc.textFile("testfile.csv")
Run Code Online (Sandbox Code Playgroud)
我的问题是,当我运行下面的transformation and action命令时,rdd2数据将存储在哪里。
1.是否存储在内存中?
rdd2 = rdd1.map( lambda x: x.split(",") )
rdd2.count()
Run Code Online (Sandbox Code Playgroud)
我知道rdd2中的数据将一直可用,直到我关闭jupyter笔记本为止。然后需要cache(),无论如何rdd2都可以进行所有转换。听说所有转换后的内存中的数据都被清除了,这是怎么回事?
将RDD保留在内存和cache()之间有什么区别
rdd2.cache()
Apache Arrow 和 Apache Spark 之间有什么区别?Apache Arrow 会取代 Hadoop 吗?
我是Pyspark的新手.我在ubuntu上安装了"bash Anaconda2-4.0.0-Linux-x86_64.sh".还安装了pyspark.终端一切正常.我想在jupyter上工作.当我在我的ubuntu终端中创建配置文件时,如下所示:
wanderer@wanderer-VirtualBox:~$ ipython profile create pyspark
[ProfileCreate] Generating default config file: u'/home/wanderer/.ipython/profile_pyspark/ipython_config.py'
[ProfileCreate] Generating default config file: u'/home/wanderer/.ipython/profile_pyspark/ipython_kernel_config.py'
wanderer@wanderer-VirtualBox:~$ export ANACONDA_ROOT=~/anaconda2
wanderer@wanderer-VirtualBox:~$ export PYSPARK_DRIVER_PYTHON=$ANACONDA_ROOT/bin/ipython
wanderer@wanderer-VirtualBox:~$ export PYSPARK_PYTHON=$ANACONDA_ROOT/bin/python
wanderer@wanderer-VirtualBox:~$ cd spark-1.5.2-bin-hadoop2.6/
wanderer@wanderer-VirtualBox:~/spark-1.5.2-bin-hadoop2.6$ PYTHON_OPTS=”notebook” ./bin/pyspark
Python 2.7.11 |Anaconda 4.0.0 (64-bit)| (default, Dec 6 2015, 18:08:32)
Type "copyright", "credits" or "license" for more information.
IPython 4.1.2 -- An enhanced Interactive Python.
? -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help -> Python's own help system.
object? -> Details …
Run Code Online (Sandbox Code Playgroud) 我正在尝试创建外部表并尝试将twitter数据加载到表中.在创建表时,我遇到以下错误,无法跟踪错误.
hive> ADD JAR /usr/local/hive/lib/hive-serdes-1.0-SNAPSHOT.jar
> ;
Added [/usr/local/hive/lib/hive-serdes-1.0-SNAPSHOT.jar] to class path
Added resources: [/usr/local/hive/lib/hive-serdes-1.0-SNAPSHOT.jar]
hive> CREATE EXTERNAL TABLE tweets (
> id BIGINT,
> created_at STRING,
> source STRING,
> favorited BOOLEAN,
> retweeted_status STRUCT<
> text:STRING,
> user:STRUCT<screen_name:STRING,name:STRING>,
> retweet_count:INT>,
> entities STRUCT<
> urls:ARRAY<STRUCT<expanded_url:STRING>>,
> user_mentions:ARRAY<STRUCT<screen_name:STRING,name:STRING>>,
> hashtags:ARRAY<STRUCT<text:STRING>>>,
> text STRING,
> user STRUCT<
> screen_name:STRING,
> name:STRING,
> friends_count:INT,
> followers_count:INT,
> statuses_count:INT,
> verified:BOOLEAN,
> utc_offset:INT,
> time_zone:STRING>,
> in_reply_to_screen_name STRING
> )
> PARTITIONED …
Run Code Online (Sandbox Code Playgroud) 我是pyspark的新手,想在我的Ubuntu 12.04机器上使用Ipython笔记本使用pyspark.以下是pyspark和Ipython笔记本的配置.
sparkuser@Ideapad:~$ echo $JAVA_HOME
/usr/lib/jvm/java-8-oracle
# Path for Spark
sparkuser@Ideapad:~$ ls /home/sparkuser/spark/
bin CHANGES.txt data examples LICENSE NOTICE R RELEASE scala-2.11.6.deb
build conf ec2 lib licenses python README.md sbin spark-1.5.2-bin-hadoop2.6.tgz
Run Code Online (Sandbox Code Playgroud)
我安装了Anaconda2 4.0.0和anaconda的路径:
sparkuser@Ideapad:~$ ls anaconda2/
bin conda-meta envs etc Examples imports include lib LICENSE.txt mkspecs pkgs plugins share ssl tests
Run Code Online (Sandbox Code Playgroud)
为IPython创建PySpark配置文件.
ipython profile create pyspark
sparkuser@Ideapad:~$ cat .bashrc
export SPARK_HOME="$HOME/spark"
export PYSPARK_SUBMIT_ARGS="--master local[2]"
# added by Anaconda2 4.0.0 installer
export PATH="/home/sparkuser/anaconda2/bin:$PATH"
Run Code Online (Sandbox Code Playgroud)
创建一个名为〜/ .ipython/profile_pyspark/startup/00-pyspark-setup.py的文件:
sparkuser@Ideapad:~$ cat .ipython/profile_pyspark/startup/00-pyspark-setup.py
import os
import …
Run Code Online (Sandbox Code Playgroud)