我正在尝试创建一个物流虚拟数据集,用于对数据进行一些分析和可能的预测。
Assumed variables are as follows:
VARIABLES RANGES
awb random number eg:235533
destination_city random cities
product different products
product_category different categories
origin_city random metro cities
logistics_provider_id id's eg:1,20,28,27
dispatch_date datetime between mar01-2015 to mar15-2015
final_delivery_status created,delivered,returned
actual_delivery_date datetime between mar16-2015 to mar30-2015
promised_delivery_date datetime between mar25-2015 to Apr6-2015
Run Code Online (Sandbox Code Playgroud)
因此,从上述变量假设我想创建上述范围内的虚拟数据。我如何使用 python 创建虚拟数据
Expected output:
example_dummy_data:
awb destination_city product product_category
1 104842891 Byatarayanapura Wrangler Denim Jeans Men's Clothing
2 104842938 Bareilly Sky Blue Denim Men's Clothing
3 104842942 Saharanpur puma shoes Men's …
Run Code Online (Sandbox Code Playgroud) 我有一个值如下的列,我想在每行的相同值的"+或_"20%之间取一个随机值,并将其分配给另一列.
样本数据
benchmark
1 100
2 200
3 250
4 400
5 150
6 1000
Run Code Online (Sandbox Code Playgroud)
现在,我想通过在每行的基准值的+或-20%之间添加1个随机数,在同一数据中创建一个名为value的变量.
预期产量:
benchmark value
1 100 87
2 200 213
3 250 255
4 400 320
5 150 180
6 1000 900
Run Code Online (Sandbox Code Playgroud)
下面的片段说明了我实现这一目标的尝试; 它按预期工作,但需要花费太多时间来执行:
for (i in 1:nrow(sample_data)){
sample_data$value[i] = sample_data$benchmark[i] + runif(1,min = -0.2*sample_data$benchmark[i], max = 0.2*sample_data$benchmark[i])
}
Run Code Online (Sandbox Code Playgroud)
如何改进代码的性能?