我正在尝试使用igraph包创建用于创建网络图表的数据框.我有样本数据"mydata_data",我想创建"expected_data".
我可以很容易地计算访问特定商店的客户数量,但我如何计算去存储x1和存储x2等的常见客户组.
我有500多个商店,所以我不想手动创建列.可重现目的的样本数据如下:
mydata_data<-data.frame(
Customer_Name=c("A","A","C","D","D","B"),
Store_Name=c("x1","x2","x2","x2","x3","x1"))
expected_data<-data.frame(
Store_Name=c("x1","x2","x3","x1_x2","x2_x3","x1_x3"),
Customers_Visited=c(2,3,1,1,1,0))
Run Code Online (Sandbox Code Playgroud) 我已尝试下面的代码及其组合,以便读取 S3 文件夹中给出的所有文件,但似乎没有任何效果。敏感信息/代码已从下面的脚本中删除。有 6 个文件,每个文件 6.5 GB。
#Spark Connection
sc<-spark_connect(master = "local" , config=config)
rd_1<-spark_read_csv(sc,name = "Retail_1",path = "s3a://mybucket/xyzabc/Retail_Industry/*/*",header = F,delimiter = "|")
# This is the S3 bucket/folder for files [One of the file names Industry_Raw_Data_000]
s3://mybucket/xyzabc/Retail_Industry/Industry_Raw_Data_000
Run Code Online (Sandbox Code Playgroud)
这是我得到的错误
Error: org.apache.spark.sql.AnalysisException: Path does not exist: s3a://mybucket/xyzabc/Retail_Industry/*/*;
at org.apache.spark.sql.execution.datasources.DataSource$.org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary(DataSource.scala:710)
Run Code Online (Sandbox Code Playgroud) 我有数据框(df),它具有设备ID和本地日期列。我想将用户ID分配给始终在所有本地日期一起显示的设备ID。我在下面提供了示例
device_id <- c("x1", "x1", "x1", "x2", "x2", "x3", "x3", "x3", "x4", "x4", "x5",
"x5", "x5", "x5", "x5", "x5", "x5", "x6", "x6", "x7", "x7", "x8",
"x8", "x9", "x9", "x9")
local_date <- c("2019-01-13", "2019-01-14", "2019-01-15", "2019-01-03", "2019-01-04",
"2019-01-10", "2019-01-11", "2019-01-12", "2019-01-11", "2019-01-12",
"2019-01-03", "2019-01-05", "2019-01-06", "2019-01-07", "2019-01-08",
"2019-01-13", "2019-01-23", "2019-01-03", "2019-01-04", "2019-10-23",
"2019-10-28", "2019-10-23", "2019-10-28", "2019-01-13", "2019-01-14",
"2019-01-15")
df <- data.frame(device_id, local_date)
df$local_date <- as.Date(df$local_date)
Run Code Online (Sandbox Code Playgroud)
这是我要创建的数据框。
expected_df <- data.frame(device_id=c("x1", "x9", "x2", "x6", "x3", "x4", "x5", "x7", "x8"),
user_id=c(1, 1, 2, 2, …Run Code Online (Sandbox Code Playgroud)