小编mal*_*har的帖子

在 PySpark DataFrame 中的 ArrayType 上使用 udf 时出现“NoneType”对象不可迭代错误

我有一个具有以下架构的数据框

hello.printSchema()
root
 |-- list_a: array (nullable = true)
 |    |-- element: long (containsNull = true)
 |-- list_b: array (nullable = true)
 |    |-- element: integer (containsNull = true)

Run Code Online (Sandbox Code Playgroud)

和以下示例数据

hello.take(2)
[Row(list_a=[7, 11, 1, 14, 13, 15,999], list_b=[15, 13, 7, 11, 1, 14]),
 Row(list_a=[7, 11, 1, 14, 13, 15], list_b=[11, 1, 7, 14, 15, 13, 12])]

Run Code Online (Sandbox Code Playgroud)

所需输出

排序list_a和list_b
创建一个新列，如果不存在此类差异，list_diff则为Empty ArrayType。list_diff = list(set(list_a) - set(list_b))

我尝试过的方法是UDF。

正如问题中提到的，我正在尝试使用以下 UDF

sort_udf=udf(lambda x: sorted(x), ArrayType(IntegerType()))
differencer=udf(lambda x,y: …

Run Code Online (Sandbox Code Playgroud)

python apache-spark apache-spark-sql pyspark

mal*_*har

2018 08-09

5
推荐指数

1
解决办法

5397
查看次数

在Python中过滤,映射和缩小是否会创建列表的新副本？

用Python 2.7.让我们说我们已经list_of_nums = [1,2,2,3,4,5] 并且我们想要删除所有出现的2.我们可以通过list_of_nums[:] = filter(lambda x: x! = 2, list_of_nums)或实现它 list_of_nums = filter(lambda x: x! = 2, list_of_nums).

这是"就地"替代吗？另外,我们在使用过滤器时是否创建了列表副本？

python lambda higher-order-functions python-2.7

mal*_*har

2016 09-03

4
推荐指数

1
解决办法

1122
查看次数

如何在 PySpark 中将字符串转换为字典 (JSON) 的 ArrayType

尝试将 StringType 转换为 JSON 的 ArrayType 以获取从 CSV 格式生成的数据帧。

使用pyspark上Spark2

我正在处理的 CSV 文件；如下——

date,attribute2,count,attribute3
2017-09-03,'attribute1_value1',2,'[{"key":"value","key2":2},{"key":"value","key2":2},{"key":"value","key2":2}]'
2017-09-04,'attribute1_value2',2,'[{"key":"value","key2":20},{"key":"value","key2":25},{"key":"value","key2":27}]'

Run Code Online (Sandbox Code Playgroud)

如上所示，它"attribute3"在文字字符串中包含一个属性，从技术上讲，它是一个精确长度为 2 的字典（JSON）列表。（这是功能 distinct 的输出）

摘录自 printSchema()

attribute3: string (nullable = true)

Run Code Online (Sandbox Code Playgroud)

我试图施放"attribute3"到ArrayType如下

temp = dataframe.withColumn(
    "attribute3_modified",
    dataframe["attribute3"].cast(ArrayType())
)

Run Code Online (Sandbox Code Playgroud)

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: __init__() takes at least 2 arguments (1 given)

Run Code Online (Sandbox Code Playgroud)

实际上，ArrayType期望数据类型作为参数。我尝试过"json"，但没有奏效。

所需的输出 - 最后，我需要转换attribute3为ArrayType()简单的 Python 列表。（我试图避免使用eval） …

python pyspark pyspark-sql

mal*_*har

2018 08-07

4
推荐指数

2
解决办法

9293
查看次数

标准输出到 dev/null 不适用于 python3.6 中的子进程模块

使用Python 3.6.7于Ubuntu 18.04.2 LTS

我试图通过 python 脚本调用 shell 脚本，并期望 stdout 为空，即我不想要控制台输出。

程序片段

def command_execution(self, cmd, cwd=None):
    """ Execute the command cmd without console output
    and return the exitcode
    """
    FNULL = open(os.devnull, 'w') # Method1
    self.log.debug("Executing command " +  cmd)
    exec_cmd = subprocess.Popen(cmd, cwd=cwd, shell=True,  stdout=subprocess.DEVNULL)
    # Method1 call exec_cmd = subprocess.Popen(cmd, cwd=cwd, shell=True,  stdout=FNULL)

    (_,_) = exec_cmd.communicate()
    exitcode = exec_cmd.returncode
    self.log.debug("Executed command {0} with exitcode {1}".format(cmd, exitcode))
    return exitcode

Run Code Online (Sandbox Code Playgroud)

正如上面提到的，我尝试了两种FNULL方法subprocess.DEVNULL。但是，我仍然在控制台上看到输出。

我在这里错过了什么吗？

python subprocess python-3.x python-3.6

mal*_*har

lucky-day

3
推荐指数

1
解决办法

2829
查看次数

在python中查找并删除以特定子字符串开头和结尾的字符串

我有一个类似的字符串"dasdasdsafs[image : image name : image]vvfd gvdfvg dfvgd".从这个字符串,我想删除星星的[image :结尾部分 : image].我试图使用以下代码找到'子字符串' -

result = re.search('%s(.*)%s' % (start, end), st).group(1)

Run Code Online (Sandbox Code Playgroud)

但它没有给我所需的结果.帮我找到从字符串中删除子字符串的正确方法.

python regex python-2.7 python-3.x

n.i*_*imp

2015 08-15

2
推荐指数

2
解决办法

2万
查看次数

标签统计

python ×5

pyspark ×2

python-2.7 ×2

python-3.x ×2

apache-spark ×1

apache-spark-sql ×1

higher-order-functions ×1

lambda ×1

pyspark-sql ×1

python-3.6 ×1

regex ×1

subprocess ×1

在 PySpark DataFrame 中的 ArrayType 上使用 udf 时出现“NoneType”对象不可迭代错误

在Python中过滤,映射和缩小是否会创建列表的新副本？

如何在 PySpark 中将字符串转换为字典 (JSON) 的 ArrayType

标准输出到 dev/null 不适用于 python3.6 中的子进程模块

在python中查找并删除以特定子字符串开头和结尾的字符串

标签 统计

小编mal_har的帖子

标签统计