从时间戳[us, tz=Etc/UTC] 转换为时间戳[ns] 将导致时间戳超出范围

Question

从时间戳[us, tz=Etc/UTC] 转换为时间戳[ns] 将导致时间戳超出范围

ant*_*o12 12 pandas apache-spark apache-spark-sql pyspark databricks

我有一个功能可以让我从客户端应用程序查询数据块增量表。这是我用于此目的的代码：

df = spark.sql('SELECT * FROM EmployeeTerritories LIMIT 100')
dataframe = df.toPandas()
dataframe_json = dataframe.to_json(orient='records', force_ascii=False)

Run Code Online (Sandbox Code Playgroud)

但是，第二行抛出了错误

从时间戳[us, tz=Etc/UTC] 转换为时间戳[ns] 将导致时间戳超出范围

我知道这个错误是什么意思，我的日期类型字段超出范围，我尝试寻找解决方案，但它们都不适合我的场景。

我找到的解决方案是关于特定的数据框列，但就我而言，我遇到了一个全局问题，因为我有大量的增量表，并且我不知道特定的日期类型列，所以我可以进行类型操作以避免这种情况。

是否可以找到所有Timestamp类型列并将它们转换为string？这看起来是一个很好的解决方案吗？对于如何实现我想要做的事情，您还有其他想法吗？

Answer 1

bla*_*hop 12

是否可以找到所有Timestamp类型列并将它们转换为字符串？

是的，这就是要走的路。您可以在调用之前通过将df.dtype列type = "timestamp"转换为字符串来循环和处理列df.toPandas()：

import pyspark.sql.functions as F

df = df.select(*[
    F.col(c).cast("string").alias(c) if t == "timestamp" else F.col(c)
    for c, t in df.dtypes
])

dataframe = df.toPandas()

Run Code Online (Sandbox Code Playgroud)

您可以将其定义为一个函数，该函数接受df参数并将其与所有表一起使用：

def stringify_timestamps(df: DataFrame) -> DataFrame:
    return df.select(*[
        F.col(c).cast("string").alias(c) if t == "timestamp" else F.col(c).alias(c)
        for c, t in df.dtypes
    ])

Run Code Online (Sandbox Code Playgroud)

如果您想保留时间戳类型，可以考虑将大于本文所示的时间戳值设为无效，而pd.Timestamp.max不是转换为字符串。

归档时间：	4 年前
查看次数：	15343 次
最近记录：	4 年前