Julia DataFrames:根据与另一个数据帧的比较替换数据帧中的条目

jn_*_*_br 5 comparison replace julia dataframes.jl

我有以下数据框:

\n
df1 = DataFrame(\n    col_A = [1, 2, 3, 4, 5, 6, 7],\n    col_B = ["A", "B", "C", "D", "E", "F", "G"],\n    col_C = missing,\n)\n\n7\xc3\x973 DataFrame\n Row \xe2\x94\x82 col_A  col_B   col_C   \n     \xe2\x94\x82 Int64  String  Missing \n\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\xbc\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\n   1 \xe2\x94\x82     1  "A"     missing \n   2 \xe2\x94\x82     2  "B"     missing \n   3 \xe2\x94\x82     3  "C"     missing \n   4 \xe2\x94\x82     4  "D"     missing \n   5 \xe2\x94\x82     5  "E"     missing \n   6 \xe2\x94\x82     6  "F"     missing \n   7 \xe2\x94\x82     7  "G"     missing\n\ndf2 = DataFrame(\n    col_X = [1, 2, 3, 4, 5, 5],\n    col_Y = ["A", "nope", "C", "nope", "E", "E"],\n    col_Z = ["First", "Second", "Third", "Fourth", "Fifth", "Duplicated"]\n)\n\n6\xc3\x973 DataFrame\n Row \xe2\x94\x82 col_X  col_Y   col_Z      \n     \xe2\x94\x82 Int64  String  String     \n\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\xbc\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\n   1 \xe2\x94\x82     1  "A"     "First"\n   2 \xe2\x94\x82     2  "nope"  "Second"\n   3 \xe2\x94\x82     3  "C"     "Third"\n   4 \xe2\x94\x82     4  "nope"  "Fourth"\n   5 \xe2\x94\x82     5  "E"     "Fifth"\n   6 \xe2\x94\x82     5  "E"     "Duplicated"\n
Run Code Online (Sandbox Code Playgroud)\n

我需要有效地将 的值替换为 的值df1.col_Cdf2.col_Z如果-比方说-由两个数据帧中的前两列组成的复合键之间存在匹配(例如(1, "A"),两者都出现,但(2, "B")没有),并且否则保持不变。如果存在重复的组合键,则获取 中最后一次出现的组合键df2

\n

所以df1会变成:

\n
7\xc3\x973 DataFrame\n Row \xe2\x94\x82 col_A  col_B   col_C      \n     \xe2\x94\x82 Int64  String  String?    \n\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\xbc\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\n   1 \xe2\x94\x82     1  "A"     "First"\n   2 \xe2\x94\x82     2  "B"     missing    \n   3 \xe2\x94\x82     3  "C"     "Third"\n   4 \xe2\x94\x82     4  "D"     missing    \n   5 \xe2\x94\x82     5  "E"     "Duplicated"\n   6 \xe2\x94\x82     6  "F"     missing    \n   7 \xe2\x94\x82     7  "G"     missing    \n
Run Code Online (Sandbox Code Playgroud)\n

小智 6

使用InMemoryDatasets包:

df1 = Dataset(
    col_A = [1, 2, 3, 4, 5, 6, 7],
    col_B = ["A", "B", "C", "D", "E", "F", "G"],
    col_C = missings(String, 7),
)
df2 = Dataset(
    col_A = [1, 2, 3, 4, 5, 5],
    col_B = ["A", "nope", "C", "nope", "E", "E"],
    col_C = ["First", "Second", "Third", "Fourth", "Fifth", "Duplicated"]
)
update!(df1, df2, on = [:col_A, :col_B])
Run Code Online (Sandbox Code Playgroud)


Dan*_*etz 0

这令人满意吗?

julia> df1.col_C .= ifelse.(df1.col_A .== df2.col_X .&& 
                            df1.col_B .== df2.col_Y, 
                            df2.col_Z, missing)
5-element Vector{Union{Missing, String}}:
 "First"
 missing
 missing
 missing
 "Fifth"
Run Code Online (Sandbox Code Playgroud)

使用博古米尔的答案,我认为:

mapping = Dict(zip(df2.col_X, df2.col_Y) .=> df2.col_Z)
df1.col_C = [get(mapping, k, missing) 
  for k in zip(df1.col_A, df1.col_B)]
Run Code Online (Sandbox Code Playgroud)

df将解决您评论的未对齐问题。