我有一个熊猫数据框:
df12 = pd.DataFrame({'group_ids':[1,1,1,2,2,2],'dates':['2016-04-01','2016-04-20','2016-04-28','2016-04-05','2016-04-20','2016-04-29'],'event_today_in_group':[1,0,1,1,1,0]})
group_ids dates event_today_in_group
0 1 2016-04-01 1
1 1 2016-04-20 0
2 1 2016-04-28 1
3 2 2016-04-05 1
4 2 2016-04-20 1
5 2 2016-04-29 0
Run Code Online (Sandbox Code Playgroud)
我想计算一个额外的列,其中包含每个group_ids,自上次event_today_in_group为1以来的天数.
group_ids dates event_today_in_group days_since_last_event
0 1 2016-04-01 1 0
1 1 2016-04-20 0 19
2 1 2016-04-28 1 27
3 2 2016-04-05 1 0
4 2 2016-04-20 1 15
5 2 2016-04-29 0 9
Run Code Online (Sandbox Code Playgroud) maxIterLogisticRegression from 中使用的参数的作用是什么pyspark.ml.classification?
mlor = LogisticRegression(maxIter=5, regParam=0.01, weightCol="weight",
family="multinomial")
Run Code Online (Sandbox Code Playgroud) from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
import gc
import pandas as pd
import datetime
import numpy as np
import sys
APP_NAME = "DataFrameToCSV"
spark = SparkSession\
.builder\
.appName(APP_NAME)\
.config("spark.sql.crossJoin.enabled","true")\
.getOrCreate()
group_ids = [1,1,1,1,1,1,1,2,2,2,2,2,2,2]
dates = ["2016-04-01","2016-04-01","2016-04-01","2016-04-20","2016-04-20","2016-04-28","2016-04-28","2016-04-05","2016-04-05","2016-04-05","2016-04-05","2016-04-20","2016-04-20","2016-04-29"]
#event = [0,1,0,0,0,0,1,1,0,0,0,0,1,0]
event = [0,1,1,0,1,0,1,0,0,1,0,0,0,0]
dataFrameArr = np.column_stack((group_ids,dates,event))
df = pd.DataFrame(dataFrameArr,columns = ["group_ids","dates","event"])
Run Code Online (Sandbox Code Playgroud)
上面的 python 代码将在 gcloud dataproc 上的 spark 集群上运行。我想在 gs://mybucket/csv_data/ 的 gcloud 存储桶中将 Pandas 数据帧保存为 csv 文件
我该怎么做呢?