如何以两行划分pyspark数据帧

Question

如何以两行划分pyspark数据帧

Dat*_*101 8 python pyspark spark-dataframe databricks

我在Databricks工作.

我有一个包含500行的数据帧,我想创建包含100行的两个数据帧,另一个包含剩余的400行.

+--------------------+----------+
|              userid| eventdate|
+--------------------+----------+
|00518b128fc9459d9...|2017-10-09|
|00976c0b7f2c4c2ca...|2017-12-16|
|00a60fb81aa74f35a...|2017-12-04|
|00f9f7234e2c4bf78...|2017-05-09|
|0146fe6ad7a243c3b...|2017-11-21|
|016567f169c145ddb...|2017-10-16|
|01ccd278777946cb8...|2017-07-05|

Run Code Online (Sandbox Code Playgroud)

我试过以下但是收到错误

df1 = df[:99]
df2 = df[100:499]


TypeError: unexpected item type: <type 'slice'>

Run Code Online (Sandbox Code Playgroud)

Answer 1

Mic*_*l N 9

Spark数据帧无法像您编写的那样编制索引.您可以使用head方法创建以获取n个顶行.这将返回Row()对象的列表,而不是数据帧.因此,您可以将它们转换回数据帧,并使用原始数据帧中的减法来获取其余行.

#Take the 100 top rows convert them to dataframe 
#Also you need to provide the schema also to avoid errors
df1 = sqlContext.createDataFrame(df.head(100), df.schema)

#Take the rest of the rows
df2 = df.subtract(df1)

Run Code Online (Sandbox Code Playgroud)

如果你使用spark 2.0+,你也可以使用SparkSession而不是spark sqlContext.此外,如果您对前100行不感兴趣并且想要随机拆分,可以使用randomSplit,如下所示:

df1,df2 = df.randomSplit([0.20, 0.80],seed=1234)

Run Code Online (Sandbox Code Playgroud)

Answer 2

pau*_*ult 8

Initially I misunderstood and thought you wanted to slice the columns. If you want to select a subset of rows, one method is to create an index column using monotonically_increasing_id(). From the docs:

The generated ID is guaranteed to be monotonically increasing and unique, but not consecutive.

You can use this ID to sort the dataframe and subset it using limit() to ensure you get exactly the rows you want.

For example:

import pyspark.sql.functions as f
import string

# create a dummy df with 500 rows and 2 columns
N = 500
numbers = [i%26 for i in range(N)]
letters = [string.ascii_uppercase[n] for n in numbers]

df = sqlCtx.createDataFrame(
    zip(numbers, letters),
    ('numbers', 'letters')
)

# add an index column
df = df.withColumn('index', f.monotonically_increasing_id())

# sort ascending and take first 100 rows for df1
df1 = df.sort('index').limit(100)

# sort descending and take 400 rows for df2
df2 = df.sort('index', ascending=False).limit(400)

Run Code Online (Sandbox Code Playgroud)

Just to verify that this did what you wanted:

df1.count()
#100
df2.count()
#400

Run Code Online (Sandbox Code Playgroud)

Also we can verify that the index column doesn't overlap:

df1.select(f.min('index').alias('min'), f.max('index').alias('max')).show()
#+---+---+
#|min|max|
#+---+---+
#|  0| 99|
#+---+---+

df2.select(f.min('index').alias('min'), f.max('index').alias('max')).show()
#+---+----------+
#|min|       max|
#+---+----------+
#|100|8589934841|
#+---+----------+

Run Code Online (Sandbox Code Playgroud)

Answer 3

Bal*_*ala 5

如果我不介意两个数据框中有相同的行，那么我可以使用sample. 例如，我有一个包含 354 行的数据框。

>>> df.count()
354

>>> df.sample(False,0.5,0).count() //approx. 50%
179

>>> df.sample(False,0.1,0).count() //approx. 10%
34

Run Code Online (Sandbox Code Playgroud)

或者，如果我想严格拆分而不存在重复项，我可以这样做

df1 = df.limit(100)     //100 rows
df2 = df.subtract(df1)  //Remaining rows

Run Code Online (Sandbox Code Playgroud)

归档时间：	7 年，8 月前
查看次数：	13168 次
最近记录：	7 年，8 月前