我Hadoop 2.6.0.2.2.0.0-2041用Hive 0.14.0.2.2.0.0-2041
After命令构建Spark:
mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.6.0 -Phive -Phive-thriftserver -DskipTests package
Run Code Online (Sandbox Code Playgroud)
我尝试使用以下命令在YARN上运行Pi示例:
export HADOOP_CONF_DIR=/etc/hadoop/conf
/var/home2/test/spark/bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master yarn-cluster \
--executor-memory 3G \
--num-executors 50 \
hdfs:///user/test/jars/spark-examples-1.3.0-hadoop2.4.0.jar \
1000
Run Code Online (Sandbox Code Playgroud)
我得到例外:application_1427875242006_0029 failed 2 times due to AM Container for appattempt_1427875242006_0029_000002 exited with exitCode: 1事实上是这样Diagnostics: Exception from container-launch.(请参阅下面的日志).
应用程序跟踪网址显示以下消息:
java.lang.Exception: Unknown container. Container either has not started or has already completed or doesn't belong to this node at all
Run Code Online (Sandbox Code Playgroud)
并且:
Error: Could …Run Code Online (Sandbox Code Playgroud) 尝试查找几周内两个日期之间的差异时:
import pandas as pd
def diff(start, end):
x = millis(end) - millis(start)
return x / (1000 * 60 * 60 * 24 * 7 * 1000)
def millis(s):
return pd.to_datetime(s).to_datetime64()
diff("2013-06-10","2013-06-16")
Run Code Online (Sandbox Code Playgroud)
结果,我得到:
Out[15]: numpy.timedelta64(857,'ns')
Run Code Online (Sandbox Code Playgroud)
这显然是错误的。问题:
如何获得以周为单位的差异,而不是以纳秒为单位取整?
如何从“ numpy.timedelta64”对象中获取价值?
如何连接所有列表:
org.apache.spark.rdd.RDD[List[Record]]
Run Code Online (Sandbox Code Playgroud)
获得一个集合:
val values: org.apache.spark.rdd.RDD[Record]
Run Code Online (Sandbox Code Playgroud)
有任何想法吗?
找到两个序列A和B的差异(补码)D的最佳方法是什么,其中D = A-B是属于A但不属于B的所有对象的序列.例如:
val A = Seq((1,1), (2,1), (3,1), (4,1), (5,1))
val B = Seq((1,1), (5,1))
Run Code Online (Sandbox Code Playgroud)
要得到:
val D = Seq((2,1), (3,1), (4,1))
Run Code Online (Sandbox Code Playgroud)
用B元素过滤A及其子集似乎不是"长"序列的有效解决方案.还有其他想法吗?
在根目录下构建Spark 1.3.0之后,无论构建examples目录是什么命令:
mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.6.0 -DskipTests clean package
Run Code Online (Sandbox Code Playgroud)
要不就:
mvn -DskipTest clean package
Run Code Online (Sandbox Code Playgroud)
我明白了:
[错误]无法执行目标org.scalastyle:scalastyle-maven-plugin:0.4.0:检查(默认)项目spark-examples_2.10:w期间失败:无法在位置scalastyle-config.xml找到配置文件 - > [帮助1] [错误]
如何查找系列元素计数?使用此代码:
import pandas as pd
d = { 'x' : [1,2,2,2,3,4,5,5,7] }
df = pd.DataFrame(d)
cnt1 = len(df[df.x == 1])
cnt2 = len(df[df.x == 2])
cnt3 = len(df[df.x == 3])
...
Run Code Online (Sandbox Code Playgroud)
没有多大帮助。有没有什么方法可以计算元素出现的次数,因此结果将是一个带有“元素,计数”对的字典,如下所示:
cnts = {'1':1, '2': 3, '3':1, ...}
Run Code Online (Sandbox Code Playgroud)
或者在其他一些易于查找和迭代的结构中?
apache-spark ×3
pandas ×2
python ×2
scala ×2
build ×1
difference ×1
hadoop ×1
hadoop-yarn ×1
maven ×1
set ×1