我处理一个包含两列mvv和count的数据帧.
+---+-----+
|mvv|count|
+---+-----+
| 1 | 5 |
| 2 | 9 |
| 3 | 3 |
| 4 | 1 |
Run Code Online (Sandbox Code Playgroud)
我想获得两个包含mvv值和计数值的列表.就像是
mvv = [1,2,3,4]
count = [5,9,3,1]
Run Code Online (Sandbox Code Playgroud)
所以,我尝试了以下代码:第一行应该返回一个python列表行.我想看到第一个值:
mvv_list = mvv_count_df.select('mvv').collect()
firstvalue = mvv_list[0].getInt(0)
Run Code Online (Sandbox Code Playgroud)
但是我收到第二行的错误消息:
AttributeError:getInt
大家好,我的 Elasticsearch 有 100 个索引,我想通过一个查询删除它们。它们都以 myindex 开头:
myindex-1
myindex-2
myindex-3
myindex-4
.
.
.
myindex-100
Run Code Online (Sandbox Code Playgroud)
当我尝试这个查询时,它不起作用:
curl -XDELETE http://localhost:9200/myindex*
Run Code Online (Sandbox Code Playgroud)
我得到:
{"error":{"root_cause":[{"type":"illegal_argument_exception","reason":"Wildcard expressions or all indices are not allowed"}],"type":"illegal_argument_exception","reason":"Wildcard expressions or all indices are not allowed"},"status":400}
Run Code Online (Sandbox Code Playgroud)
你有什么主意吗?
我有一个DataFrame格式如下:
+---+------------------------------------------------------+
|Id |DateInfos |
+---+------------------------------------------------------+
|B |[[3, 19/06/2012-02.42.01], [4, 17/06/2012-18.22.21]] |
|A |[[1, 15/06/2012-18.22.16], [2, 15/06/2012-09.22.35]] |
|C |[[5, 14/06/2012-05.20.01]] |
+---+------------------------------------------------------+
Run Code Online (Sandbox Code Playgroud)
我想按日期将DateInfos列的每个元素与我的Array的第二个元素中的时间戳排序
+---+------------------------------------------------------+
|Id |DateInfos |
+---+------------------------------------------------------+
|B |[[4, 17/06/2012-18.22.21], [3, 19/06/2012-02.42.01]] |
|A |[[2, 15/06/2012-09.22.35], [1, 15/06/2012-18.22.16]] |
|C |[[5, 14/06/2012-05.20.01]] |
+---+------------------------------------------------------+
Run Code Online (Sandbox Code Playgroud)
我的DataFrame的架构打印如下:
root
|-- C1: string (nullable = true)
|-- C2: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- _1: integer (nullable = false)
| | |-- _2: string …Run Code Online (Sandbox Code Playgroud) 我创建了一个这样的配置单元表:
CREATE EXTERNAL TABLE table_df (v1 String, v2 String, v3 String, v4 String, v5 String, v6 String, v7 String, v8 String, v9 String, v10 String, v11 String, v12 String, v13 String, v14 String, v15 String, v16 String, v17 String, v18 String, v19 String, v20 String, v21 String, v22 String, v23 String, v24 String, v25 String, v26 String, v27 String, v28 String, v29 String, v30 String, v31 String, v32 Double, v33 Int, v34 Int, v35 Int)
STORED AS PARQUET LOCATION '/data/test/table_df.parquet'; …Run Code Online (Sandbox Code Playgroud) 正如我在标题中所说,对于以下两种情况,我想知道如何删除 Spark 字符串列的第一个字符:
\n\nval myDF1 = Seq(("\xc2\xa314326"),("\xc2\xa31258634"),("\xc2\xa315626"),("\xc2\xa3163262")).toDF("A")\nval myDF2 = Seq(("a14326"),("c1258634"),("t15626"),("f163262")).toDF("A")\nmyDF1.show\nmyDF2.show\n\n+--------+\n| A|\n+--------+\n|\xc2\xa314326 |\n|\xc2\xa31258634|\n|\xc2\xa315626 |\n|\xc2\xa3163262 |\n+--------+\n\n+--------+\n| A |\n+--------+\n|a14326 |\n|c1258634|\n|t15626 |\n|f163262 |\n+--------+\nRun Code Online (Sandbox Code Playgroud)\n\n我想获得:
\n\n+--------+-------+\n| A| B|\n+--------+-------+\n|\xc2\xa314326 | 14326|\n|\xc2\xa31258634|1258634|\n|\xc2\xa315626 | 15626|\n|\xc2\xa3163262 | 163262|\n+--------+-------+\n\n+--------+-------+\n| A| B|\n+--------+-------+\n|a14326 |14326 |\n|c1258634|1258634|\n|t15626 |15626 |\n|f163262 |163262 |\n+--------+-------+\nRun Code Online (Sandbox Code Playgroud)\n\n你有什么主意吗?
\n我有一些数据包含在如下的字符串数组中(仅用于示例):
val myArray = Array("1499955986039", "1499955986051", "1499955986122")
Run Code Online (Sandbox Code Playgroud)
我想将我的列表映射到Timestamp数组,以便创建一个RDD(myRdd),然后创建一个这样的数据帧
val df = createdataframe(myRdd, StructType(StructField("myTymeStamp", TimestampType,true)
Run Code Online (Sandbox Code Playgroud)
我的问题不是如何创建Rdd,而是如何用毫秒时间戳替换字符串.你有什么主意吗?谢谢
我有一个由函数生成的列表.当我print在我的列表上执行时:
print preds_labels
我获得:
[(0.,8.),(0.,13.),(0.,19.),(0.,19.),(0.,19.),(0.,19.),(0.,19.),(0.,20.),(0.,21.),(0.,23.)]
Run Code Online (Sandbox Code Playgroud)
但是当我想DataFrame用这个命令创建一个时:
df = sqlContext.createDataFrame(preds_labels, ["prediction", "label"])
Run Code Online (Sandbox Code Playgroud)
我收到一条错误消息:
不支持的类型:输入'numpy.float64'
如果我手动创建列表,我没有问题.你有好主意吗?
我有两个这样的数组:
val l1 = Array((1,2,3), (6,2,-3), (6,2,-4))
val l2 = Array("a","b","c")
Run Code Online (Sandbox Code Playgroud)
我想将l2的值放在l1中的相同位置,并获得这样的最终数组
Array((1,2,3,"a"), (6,2,-3,"b"), (6,2,-4,"c"))
Run Code Online (Sandbox Code Playgroud)
我在考虑这样的事情:
val l3 = l1.map( code...)
Run Code Online (Sandbox Code Playgroud)
但我不知道如何在l1的地图上迭代l2.
你有什么主意吗?
我只想显示该列而不会截断成一个选择,在这里我有一个数组或一个长度很大的Map。
我使用齐柏林飞艇来查询df寄存器作为临时表:
%livy.sql
select * from maTable
Run Code Online (Sandbox Code Playgroud)
你有什么主意吗?
大家好,在下面的代码中找不到类StreamingContext。
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.{SparkConf, SparkContext}
object Exemple {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setMaster("local[*]").setAppName("Exemple")
val sc = new SparkContext(conf)
val ssc = new StreamingContext(sc, Seconds(2)) //this line throws error
}
}
Run Code Online (Sandbox Code Playgroud)
这是错误:
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/streaming/StreamingContext
at Exemple$.main(Exemple.scala:16)
at Exemple.main(Exemple.scala)
Caused by: java.lang.ClassNotFoundException: org.apache.spark.streaming.StreamingContext
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:338)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 2 more
Process finished with exit code 1
Run Code Online (Sandbox Code Playgroud)
我使用以下build.sbt文件:
name := "exemple"
version := "1.0.0"
scalaVersion := "2.11.11" …Run Code Online (Sandbox Code Playgroud)