首先,如果我的问题很简单,我深表歉意。我确实花了很多时间研究它。
我正在尝试按照此处所述在PySpark 脚本中设置标量 Pandas UDF。
这是我的代码:
from pyspark import SparkContext
from pyspark.sql import functions as F
from pyspark.sql.types import *
from pyspark.sql import SQLContext
sc.install_pypi_package("pandas")
import pandas as pd
sc.install_pypi_package("PyArrow")
df = spark.createDataFrame(
[("a", 1, 0), ("a", -1, 42), ("b", 3, -1), ("b", 10, -2)],
("key", "value1", "value2")
)
df.show()
@F.pandas_udf("double", F.PandasUDFType.SCALAR)
def pandas_plus_one(v):
return pd.Series(v + 1)
df.select(pandas_plus_one(df.value1)).show()
# Also fails
#df.select(pandas_plus_one(df["value1"])).show()
#df.select(pandas_plus_one("value1")).show()
#df.select(pandas_plus_one(F.col("value1"))).show()
Run Code Online (Sandbox Code Playgroud)
脚本在最后一条语句失败:
调用 o209.showString 时出错。:org.apache.spark.SparkException:作业因阶段失败而中止:阶段 8.0 中的任务 2 失败 4 次,最近失败:阶段 …
我想避免使用外部列表:
list <- c("Google", "Yahoo", "Amazon")
Run Code Online (Sandbox Code Playgroud)
数据帧中在第一个时间戳(最旧的时间戳)中记录的值,如下所示:
dframe <- structure(list(id = c(1L, 1L, 1L, 1L, 2L, 2L, 2L), name = c("Google",
"Google", "Yahoo", "Amazon", "Amazon", "Google", "Amazon"), date = c("2008-11-01",
"2008-11-02", "2008-11-01", "2008-11-04", "2008-11-01", "2008-11-02",
"2008-11-03")), class = "data.frame", row.names = c(NA, -7L))
Run Code Online (Sandbox Code Playgroud)
预期的输出是这样的:
Run Code Online (Sandbox Code Playgroud)id name date 1 Google 2008-11-01 1 Yahoo 2008-11-01 1 Amazon 2008-11-04 2 Amazon 2008-11-01 2 Google 2008-11-02
如何做到这一点?
使用此功能,它仅保留每个id的第一条记录,而不保留第一次记录的列表中的每个单个值的第一条记录
library(data.table)
setDT(dframe)
date_list_first = dframe[order(date)][!duplicated(id)]
Run Code Online (Sandbox Code Playgroud) 我正在尝试按名为“团队”的列名称合并两个数据框。
我的合并声明-
merge(RB,LB,by.x ="team")
Run Code Online (Sandbox Code Playgroud)
我收到的错误是-
merge.data.frame(RB, LB, by.x = "team") 中的错误:“by.x”和“by.y”指定了不同的列数。
#Create a data frame to store set of Right-Backs
RB=data.frame(
team=c("Liverpool",
"Manchester United",
"Chelsea","Atletico Madrid",
"Juventus",
"Real Madrid"),
players=c("Trent-Alexandre Arnold",
"Diogo Dalot",
"Cesar Azpilicueta",
"Keiran Trippier",
"Danilo","Carvajal")
,stringsAsFactors = FALSE)
#Create a data frame to store set of Left-Backs
LB=data.frame(
team=c("Manchester United",
"Real Madrid",
"Liverpool",
"Chelsea",
"Juventus",
"Atletico Madrid"
),
players=c("Luke Shaw","Marcelo","Andrew Robertson","Marcos Alonso","Alex Sandro", "Renan Lodi" ),
stringsAsFactors = FALSE
)
Run Code Online (Sandbox Code Playgroud)