我想在Deedle中的两个数据帧之间进行左连接.两个数据框的示例如下:
let workOrders =
Frame.ofColumns [
"workOrderCode" =?> series [ (20050,20050); (20051,20051); (20060,20060) ]
"workOrderDescription" =?> series [ (20050,"Door Repair"); (20051,"Lift Replacement"); (20060,"Window Cleaning") ]]
// This does not compile due to the duplicate Work Order Codes
let workOrderScores =
Frame.ofColumns [
"workOrderCode" => series [ (20050,20050); (20050,20050); (20051,20051) ]
"runTime" => series [ (20050,20100112); (20050,20100130); (20051,20100215) ]
"score" => series [ (20050,100); (20050,120); (20051,80) ]]
Frame.join JoinKind.Outer workOrders workOrderScores
Run Code Online (Sandbox Code Playgroud)
问题是Deedle不会让我创建一个具有非唯一索引的数据框,并且我收到以下错误:System.ArgumentException:Duplicate key'20050'.索引中不允许重复键.
有趣的是在Python/Pandas中,我可以做到以下完美的工作.如何在Deedle中重现此结果?我想我可能必须压扁第二个数据框以删除重复项然后加入然后取消/取消堆叠它?
workOrders = pd.DataFrame(
{'workOrderCode': [20050, 20051, 20060],
'workOrderDescription': ['Door Repair', 'Lift Replacement', 'Window Cleaning']})
workOrderScores = pd.DataFrame(
{'workOrderCode': [20050, 20050, 20051],
'runTime': [20100112, 20100130, 20100215],
'score' : [100, 120, 80]})
pd.merge(workOrders, workOrderScores, on = 'workOrderCode', how = 'left')
# Result:
# workOrderCode workOrderDescription runTime score
#0 20050 Door Repair 20100112 100
#1 20050 Door Repair 20100130 120
#2 20051 Lift Replacement 20100215 80
#3 20060 Window Cleaning NaN NaN
Run Code Online (Sandbox Code Playgroud)
这是一个很好的问题 - 我不得不承认,目前Deedle目前还没有优雅的方法.您能否向GitHub提交一个问题,以确保我们跟踪并添加一些解决方案?
正如你所说,Deedle当前没有让你在密钥中有重复值 - 虽然你的Pandas解决方案也没有使用重复密钥 - 你只需使用Pandas允许你指定加入时使用的列的事实(我认为这个对Deedle来说是个很好的补充).
这是一种做你想做的事情 - 但不是很好.我认为使用pivoting将是另一种选择(在最新的源代码中有一个很好的数据透视表函数 - 还没有在NuGet上).
我使用groupByRows和nest将您的数据框转换为按workOrderCode(按项目分组的系列)(每个项目现在包含一个框架,所有行具有相同的工单代码):
let workOrders =
Frame.ofColumns [
"workOrderCode" =?> Series.ofValues [ 20050; 20051; 20060 ]
"workOrderDescription" =?> Series.ofValues [ "Door Repair"; "Lift Replacement"; "Window Cleaning" ]]
|> Frame.groupRowsByInt "workOrderCode"
|> Frame.nest
let workOrderScores =
Frame.ofColumns [
"workOrderCode" => Series.ofValues [ 20050; 20050; 20051 ]
"runTime" => Series.ofValues [ 20100112; 20100130; 20100215 ]
"score" => Series.ofValues [ 100; 120; 80 ]]
|> Frame.groupRowsByInt "workOrderCode"
|> Frame.nest
Run Code Online (Sandbox Code Playgroud)
现在我们可以加入这两个系列(因为他们的工单代码是关键).但是,您为每个连接的订单代码获得一个或两个数据框,并且外部连接两个框架的行需要大量工作:
// Join the two series to align frames with the same work order code
Series.zip workOrders workOrderScores
|> Series.map(fun _ (orders, scores) ->
match orders, scores with
| OptionalValue.Present s1, OptionalValue.Present s2 ->
// There is a frame with some rows with the specified code in both
// work orders and work order scores - we return a cross product of their rows
[ for r1 in s1.Rows.Values do
for r2 in s2.Rows.Values do
// Drop workOrderCode from one series (they are the same in both)
// and append the two rows & return that as the result
yield Series.append r1 (Series.filter (fun k _ -> k <> "workOrderCode") r2) ]
|> Frame.ofRowsOrdinal
// If left or right value is missing, we just return the columns
// that are available (others will be filled with NaN)
| OptionalValue.Present s, _
| _, OptionalValue.Present s -> s)
|> Frame.unnest
|> Frame.indexRowsOrdinally
Run Code Online (Sandbox Code Playgroud)
这可能很慢(特别是在NuGet版本中).如果您处理更多数据,请尝试从源代码构建最新版本的Deedle(如果这没有帮助,请提交问题 - 我们应该对此进行调查!)
| 归档时间: |
|
| 查看次数: |
588 次 |
| 最近记录: |