我想在Spark中做累计和。这是寄存器表(输入):
+---------------+-------------------+----+----+----+
| product_id| date_time| ack|val1|val2|
+---------------+-------------------+----+----+----+
|4008607333T.upf|2017-12-13:02:27:01|3-46| 53| 52|
|4008607333T.upf|2017-12-13:02:27:03|3-47| 53| 52|
|4008607333T.upf|2017-12-13:02:27:08|3-46| 53| 52|
|4008607333T.upf|2017-12-13:02:28:01|3-47| 53| 52|
|4008607333T.upf|2017-12-13:02:28:07|3-46| 15| 1|
+---------------+-------------------+----+----+----+
Run Code Online (Sandbox Code Playgroud)
配置单元查询:
select *, SUM(val1) over ( Partition by product_id, ack order by date_time rows between unbounded preceding and current row ) val1_sum, SUM(val2) over ( Partition by product_id, ack order by date_time rows between unbounded preceding and current row ) val2_sum from test
Run Code Online (Sandbox Code Playgroud)
输出:
+---------------+-------------------+----+----+----+-------+--------+
| product_id| date_time| ack|val1|val2|val_sum|val2_sum|
+---------------+-------------------+----+----+----+-------+--------+
|4008607333T.upf|2017-12-13:02:27:01|3-46| 53| 52| 53| 52|
|4008607333T.upf|2017-12-13:02:27:08|3-46| 53| …Run Code Online (Sandbox Code Playgroud) 我想为 emr 步骤创建 JSON 数组。我已经为单个 json 字符串创建了数组。这是我的 bash 代码 -
export source="s3a://sourcebucket"
export destination="s3a://destinationbucket"
EMR_DISTCP_STEPS=$( jq -n \
--arg source "$source" \
--arg destination "$destination" \
'{
"Name":"S3DistCp step",
"HadoopJarStep": {
"Args":["s3-dist-cp","--s3Endpoint=s3.amazonaws.com", "'"--src=${sourcepath}"'" ,"'"--dest=${destinationpath}"'"],
"Jar":"command-runner.jar"
},
"ActionOnFailure":"CONTINUE"
}' )
Run Code Online (Sandbox Code Playgroud)
输出
echo $EMR_DISTCP_STEPS
[{ "Name": "S3DistCp step", "HadoopJarStep": { "Args": [ "s3-dist-cp", "--s3Endpoint=s3.amazonaws.com", "--src=s3a://sourcebucket", "--dest=s3a://destinationbucket" ], "Jar": "command-runner.jar" }, "ActionOnFailure": "CONTINUE" }]
Run Code Online (Sandbox Code Playgroud)
现在我想创建具有多个源和目标输出的 JSON 数组
[{ "Name": "S3DistCp step", "HadoopJarStep": { "Args": [ "s3-dist-cp", "--s3Endpoint=s3.amazonaws.com", "--src=s3a://sourcebucket1", "--dest=s3a://destinationbucket1" ], "Jar": "command-runner.jar" }, …Run Code Online (Sandbox Code Playgroud) 我想删除volume目录下的所有文件。该目录位于 Kubernetes pod 内。所以我正在使用 exec 命令。
我的命令——
kubectl exec $POD -- rm -rf /usr/local/my-app/volume/*
Run Code Online (Sandbox Code Playgroud)
上面的命令不起作用。终端上没有上述命令的输出。我尝试使用以下命令并且它正在工作 -
kubectl exec $POD -- rm -rf /usr/local/my-app/volume
Run Code Online (Sandbox Code Playgroud)
但它会删除目录。我无法删除该目录,因为它用于安装目的。
我怎样才能实现上述功能?
谢谢
我有一个数据框,它是带有 json 字符串的 json 列。下面的例子。有 3 列 - a、b、c。列 c 是 stringType
| a | b | c |
--------------------------------------------------------
|77 |ABC | {"12549":38,"333513":39} |
|78 |ABC | {"12540":38,"333513":39} |
Run Code Online (Sandbox Code Playgroud)
我想让它们成为数据框的列(枢轴)。下面的例子 -
| a | b | 12549 | 333513 | 12540
---------------------------------------------
|77 |ABC |38 |39 | null
|77 |ABC | null |39 | 38
Run Code Online (Sandbox Code Playgroud) apache-spark ×2
bash ×2
scala ×2
amazon-emr ×1
ash ×1
exec ×1
hive ×1
jq ×1
json ×1
kubernetes ×1
linux ×1
rm ×1
sql ×1