小编Use*_*345的帖子

在 Pyspark 中将布尔值转换为字符串时使用 when 和 else

我有一个数据框 Pyspark

df.show()


+---+----+-------+----------+-----+------+
| id|name|testing|avg_result|score|active|
+---+----+-------+----------+-----+------+
|  1| sam|   null|      null| null|  true|
|  2| Ram|      Y|      0.05|   10| false|
|  3| Ian|      N|      0.01|    1| false|
|  4| Jim|      N|       1.2|    3|  true|
+---+----+-------+----------+-----+------+

Run Code Online (Sandbox Code Playgroud)

架构如下：

DataFrame[id: int, name: string, testing: string, avg_result: string, score: string, active: boolean]

Run Code Online (Sandbox Code Playgroud)

我想转换Y到True，N到False true到True和false到False。

当我喜欢以下内容时：

for col in cols:
    df = df.withColumn(col, f.when(f.col(col) == 'N', 'False').when(f.col(col) == 'Y', 'True'). …

Run Code Online (Sandbox Code Playgroud)

apache-spark pyspark

Use*_*345

2018 07-03

4
推荐指数

1
解决办法

2万
查看次数

在bash/linux中并行运行shell脚本

我有一个shell脚本job.sh.

内容如下:

#!/bin/bash

table=$1

sqoop job --exec ${table}

Run Code Online (Sandbox Code Playgroud)

现在当我做./job.sh table1

该脚本成功执行.

我在一个文件中有表名tables.txt.

现在我想循环遍历tables.txt文件并job.sh并行执行脚本10次.

我怎样才能做到这一点？

理想情况下,当我执行脚本时,我希望它在下面执行;

./job.sh table1
./job.sh table2
./job.sh table3
./job.sh table4
./job.sh table5
./job.sh table6
./job.sh table7
./job.sh table8
./job.sh table9
./job.sh table10

Run Code Online (Sandbox Code Playgroud)

有哪些选择？

linux bash shell cron crontab

Use*_*345

2017 04-27

3
推荐指数

1
解决办法

474
查看次数

当列表值匹配 Pyspark 数据框中列值的子字符串时填充新列

我有一个Pyspark如下所示的数据框

df.show()

+---+----------------------+
| id|                   con|
+---+----------------------+
|  3|           mac,mac pro|
|  1|        iphone5,iphone|
|  1| android,android phone|
|  1|    windows,windows pc|
|  1| spy camera,spy camera|
|  2|               camera,|
|  3|             cctv,cctv|
|  2|   apple iphone,iphone|
|  3|           ,spy camera|
+---+----------------------+

Run Code Online (Sandbox Code Playgroud)

我想基于某些lists. 名单如下

phone_list = ['iphone', 'android', 'nokia']
pc_list = ['windows', 'mac']

Run Code Online (Sandbox Code Playgroud)

Condition:

if a element in a list matches a string/substring in a column then flag the column to the value of that particular …

Run Code Online (Sandbox Code Playgroud)

python apache-spark pyspark

Use*_*345

lucky-day

3
推荐指数

1
解决办法

2514
查看次数

不像Hive中的声明

在mysql我使用以下语句来查找not like数据库中的表.

show tables where `Tables_in_db` not like '%_table'

Run Code Online (Sandbox Code Playgroud)

我可以使用下面的语句like在hive中查找表

show tables like '*table'

Run Code Online (Sandbox Code Playgroud)

但无法使用该not like声明

show tables where `Tables_in_db` not like '*_table'

Run Code Online (Sandbox Code Playgroud)

是否有相同的声明Hive.

mysql hive

Use*_*345

lucky-day

2
推荐指数

1
解决办法

957
查看次数

获取 Hive 中最近 7 天的记录

hive我有一张如下所示的表格。我想从这个表中insertdate选择在哪里。customer_idinsertdatecurrent_date - 7 days

original table

+------------------------+--------------+
|       insertdate       | customer_id  |
+------------------------+--------------+
| 2018-04-21 04:00:00.0  | 39550695     |
| 2018-04-22 04:00:00.0  | 38841612     |
| 2018-04-23 03:59:00.0  | 23100419     |
| 2018-04-24 03:58:00.0  | 39550688     |
| 2018-04-25 03:58:00.0  | 39550691     |
| 2018-05-12 03:57:00.0  | 39550685     |
| 2018-05-13 03:57:00.0  | 39550687     |
| 2018-05-14 03:57:00.0  | 39550677     |
| 2018-05-14 03:56:00.0  | 30254216     |
| 2018-05-14 03:56:00.0  | 39550668     |
+------------------------+--------------+ …

Run Code Online (Sandbox Code Playgroud)

hive hiveql

Use*_*345

2018 05-16

2
推荐指数

1
解决办法

8225
查看次数

Dataframes Pyspark中Timestamp列的分区

我DataFrame在PSspark中有以下格式

Date        Id  Name    Hours   Dno Dname
12/11/2013  1   sam     8       102 It
12/10/2013  2   Ram     7       102 It
11/10/2013  3   Jack    8       103 Accounts
12/11/2013  4   Jim     9       101 Marketing

Run Code Online (Sandbox Code Playgroud)

我想做基于分区的分区,dno并使用Parquet格式保存为Hive中的表.

df.write.saveAsTable(
    'default.testing', mode='overwrite', partitionBy='Dno', format='parquet')

Run Code Online (Sandbox Code Playgroud)

该查询工作正常,并在Hive中使用Parquet输入创建了表.

现在我想根据日期列的年份和月份进行分区.时间戳是Unix时间戳

我们怎样才能在PySpark中实现这一目标.我已经在蜂巢中完成了它但无法做到PySpark

timestamp dataframe partition apache-spark pyspark

Use*_*345

2017 08-09

1
推荐指数

1
解决办法

5547
查看次数

多个字符串到一个字符串比较bash

我有一个shell script喜欢下面。它的工作条件是 if $tablecontains testthen small.shexecutes , elde big.sh。

if [[ "$table" =~ "test" ]]
then 
  echo "events"
else
  echo "history"
fi

Run Code Online (Sandbox Code Playgroud)

现在我想检查表是否包含test,_test_and_results和successin if [[ "$table" =~ "test" ]]。

我怎样才能做到这一点

我试过如下

if [[ "$table" =~ "test" && "$table" =~ "success" ]];
then
    echo "events"
else
    echo "history"
fi

Run Code Online (Sandbox Code Playgroud)

但是当我传递表名时，abc1_success它正在打印history而不是events.

我在这里做错了什么

linux bash

Use*_*345

lucky-day

1
推荐指数

1
解决办法

2499
查看次数

Pyspark 数据框将 false 和 true 转换为 0 和 1

我在 Pyspark 中有一个数据框

df.show()


+-----+-----+
|test1|test2|
+-----+-----+
|false| true|
| true| true|
| true|false|
|false| true|
|false|false|
|false|false|
|false|false|
| true| true|
|false|false|
+-----+-----+

Run Code Online (Sandbox Code Playgroud)

我想将false数据框中的所有值转换为0和true to 1。

我正在做如下

df1 = df.withColumn('test1', F.when(df.test1 == 'false', 0).otherwise(1)).withColumn('test2', F.when(df.test2 == 'false', 0).otherwise(1))

Run Code Online (Sandbox Code Playgroud)

我得到了我的结果。但我认为可能有更好的方法来做到这一点。

python apache-spark pyspark

Use*_*345

2018 06-27

1
推荐指数

1
解决办法

5661
查看次数

删除行中的空格和数字，直到行中的第一个字符

我test_file在Linux.

ags;'s


dkfprper


sdkl;d;;'s

ip access

 100 200 remark
 50 deny pdldsl;l;sd;;l;d
 permit eyuopopqwopq
 10 permit eteioe
 200 200 200 deny abc

remark aiii[dsigdfidflkfk

1 deny

Run Code Online (Sandbox Code Playgroud)

现在我想提取包含线ip or remark or deny or permit在其中

我做了如下。

grep -E 'ip|remark|deny|permit' test_file >> string_check

Run Code Online (Sandbox Code Playgroud)

结果如下

ip access
 100 200 remark
 50 deny pdldsl;l;sd;;l;d
 permit eyuopopqwopq
 10 permit eteioe
 200 200 200 deny abc
remark aiii[dsigdfidflkfk
1 deny

Run Code Online (Sandbox Code Playgroud)

现在，我想numbers and spaces从行中删除所有的，直到我得到ip or remark or deny or …

linux bash grep sed

Use*_*345

lucky-day

1
推荐指数

1
解决办法

59
查看次数

基于其他列pyspark删除重复记录

我有一个data frame在pyspark像下面。

df.show()
+---+----+
| id|test|
+---+----+
|  1|   Y|
|  1|   N|
|  2|   Y|
|  3|   N|
+---+----+

Run Code Online (Sandbox Code Playgroud)

我想在有重复记录时删除记录id并且test是N

现在当我查询 new_df

new_df.show()
+---+----+
| id|test|
+---+----+
|  1|   Y|
|  2|   Y|
|  3|   N|
+---+----+

Run Code Online (Sandbox Code Playgroud)

我无法弄清楚用例。

我已经完成了 groupbyid计数，但它只给出了id列和count.

我做了如下。

grouped_df = new_df.groupBy("id").count()

Run Code Online (Sandbox Code Playgroud)

我怎样才能达到我想要的结果

编辑

我有一个如下所示的数据框。

+-------------+--------------------+--------------------+
|           sn|              device|           attribute|
+-------------+--------------------+--------------------+
|4MY16A5602E0A|       Android Phone|                   N|
|4MY16A5W02DE8|       Android Phone|                   N|
|4MY16A5W02DE8|       Android Phone| …

Run Code Online (Sandbox Code Playgroud)

apache-spark pyspark

Use*_*345

2018 05-05

0
推荐指数

1
解决办法

1003
查看次数

选择所有列并加入 pyspark 数据帧的更好方法

我有两个数据框pyspark。他们的架构如下

df1 
DataFrame[customer_id: int, email: string, city: string, state: string, postal_code: string, serial_number: string]

df2 
DataFrame[serial_number: string, model_name: string, mac_address: string]

Run Code Online (Sandbox Code Playgroud)

现在我想full outer join通过coalesce使用data frames.

我已经做了如下。我得到了预期的结果。

full_df = df1.join(df2, df1.serial_number == df2.serial_number, 'full_outer').select(df1.customer_id, df1.email, df1.city, df1.state, df1.postal_code,  f.coalesce(df1.serial_number, df2.serial_number).alias('serial_number'), df2.model_name, df2.mac_address)

Run Code Online (Sandbox Code Playgroud)

现在我想以不同的方式做上述事情。我不想在 join 语句中编写 select 附近的所有列名称，而是想做一些类似*在data frame. 基本上我想要像下面这样的东西。

full_df = df1.join(df2, df1.serial_number == df2.serial_number, 'full_outer').select('df1.*', f.coalesce(df1.serial_number, df2.serial_number).alias('serial_number1'), df2.model_name, df2.mac_address).drop('serial_number')

Run Code Online (Sandbox Code Playgroud)

我得到了我想要的。有没有更好的方法来进行这种操作pyspark