Job*_*obs 4 python dataframe apache-spark rdd apache-spark-sql
我正在使用python,这是Spark Rdd/dataframes.
我试过isinstance(thing, RDD)但是RDD没有得到认可.
我需要这样做的原因:
我正在编写一个可以传入RDD和数据帧的函数,所以如果传入数据帧,我需要输入input.rdd来获取底层的rdd.
isinstance 会工作得很好:
from pyspark.sql import DataFrame
from pyspark.rdd import RDD
def foo(x):
if isinstance(x, RDD):
return "RDD"
if isinstance(x, DataFrame):
return "DataFrame"
foo(sc.parallelize([]))
## 'RDD'
foo(sc.parallelize([("foo", 1)]).toDF())
## 'DataFrame'
Run Code Online (Sandbox Code Playgroud)
但是单一派遣的方式要优雅得多:
from functools import singledispatch
@singledispatch
def bar(x):
pass
@bar.register(RDD)
def _(arg):
return "RDD"
@bar.register(DataFrame)
def _(arg):
return "DataFrame"
bar(sc.parallelize([]))
## 'RDD'
bar(sc.parallelize([("foo", 1)]).toDF())
## 'DataFrame'
Run Code Online (Sandbox Code Playgroud)
如果您不介意其他依赖项multipledispatch也是一个有趣的选项:
from multipledispatch import dispatch
@dispatch(RDD)
def baz(x):
return "RDD"
@dispatch(DataFrame)
def baz(x):
return "DataFrame"
baz(sc.parallelize([]))
## 'RDD'
baz(sc.parallelize([("foo", 1)]).toDF())
## 'DataFrame'
Run Code Online (Sandbox Code Playgroud)
最后,最Pythonic的方法是简单地检查一个接口:
def foobar(x):
if hasattr(x, "rdd"):
## It is a DataFrame
else:
## It (probably) is a RDD
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
7775 次 |
| 最近记录: |