jn_*_*_br 5 comparison replace julia dataframes.jl
我有以下数据框:
\ndf1 = DataFrame(\n col_A = [1, 2, 3, 4, 5, 6, 7],\n col_B = ["A", "B", "C", "D", "E", "F", "G"],\n col_C = missing,\n)\n\n7\xc3\x973 DataFrame\n Row \xe2\x94\x82 col_A col_B col_C \n \xe2\x94\x82 Int64 String Missing \n\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\xbc\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\n 1 \xe2\x94\x82 1 "A" missing \n 2 \xe2\x94\x82 2 "B" missing \n 3 \xe2\x94\x82 3 "C" missing \n 4 \xe2\x94\x82 4 "D" missing \n 5 \xe2\x94\x82 5 "E" missing \n 6 \xe2\x94\x82 6 "F" missing \n 7 \xe2\x94\x82 7 "G" missing\n\ndf2 = DataFrame(\n col_X = [1, 2, 3, 4, 5, 5],\n col_Y = ["A", "nope", "C", "nope", "E", "E"],\n col_Z = ["First", "Second", "Third", "Fourth", "Fifth", "Duplicated"]\n)\n\n6\xc3\x973 DataFrame\n Row \xe2\x94\x82 col_X col_Y col_Z \n \xe2\x94\x82 Int64 String String \n\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\xbc\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\n 1 \xe2\x94\x82 1 "A" "First"\n 2 \xe2\x94\x82 2 "nope" "Second"\n 3 \xe2\x94\x82 3 "C" "Third"\n 4 \xe2\x94\x82 4 "nope" "Fourth"\n 5 \xe2\x94\x82 5 "E" "Fifth"\n 6 \xe2\x94\x82 5 "E" "Duplicated"\nRun Code Online (Sandbox Code Playgroud)\n我需要有效地将 的值替换为 的值df1.col_C,df2.col_Z如果-比方说-由两个数据帧中的前两列组成的复合键之间存在匹配(例如(1, "A"),两者都出现,但(2, "B")没有),并且否则保持不变。如果存在重复的组合键,则获取 中最后一次出现的组合键df2。
所以df1会变成:
7\xc3\x973 DataFrame\n Row \xe2\x94\x82 col_A col_B col_C \n \xe2\x94\x82 Int64 String String? \n\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\xbc\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\n 1 \xe2\x94\x82 1 "A" "First"\n 2 \xe2\x94\x82 2 "B" missing \n 3 \xe2\x94\x82 3 "C" "Third"\n 4 \xe2\x94\x82 4 "D" missing \n 5 \xe2\x94\x82 5 "E" "Duplicated"\n 6 \xe2\x94\x82 6 "F" missing \n 7 \xe2\x94\x82 7 "G" missing \nRun Code Online (Sandbox Code Playgroud)\n
小智 6
使用InMemoryDatasets包:
df1 = Dataset(
col_A = [1, 2, 3, 4, 5, 6, 7],
col_B = ["A", "B", "C", "D", "E", "F", "G"],
col_C = missings(String, 7),
)
df2 = Dataset(
col_A = [1, 2, 3, 4, 5, 5],
col_B = ["A", "nope", "C", "nope", "E", "E"],
col_C = ["First", "Second", "Third", "Fourth", "Fifth", "Duplicated"]
)
update!(df1, df2, on = [:col_A, :col_B])
Run Code Online (Sandbox Code Playgroud)
这令人满意吗?
julia> df1.col_C .= ifelse.(df1.col_A .== df2.col_X .&&
df1.col_B .== df2.col_Y,
df2.col_Z, missing)
5-element Vector{Union{Missing, String}}:
"First"
missing
missing
missing
"Fifth"
Run Code Online (Sandbox Code Playgroud)
使用博古米尔的答案,我认为:
mapping = Dict(zip(df2.col_X, df2.col_Y) .=> df2.col_Z)
df1.col_C = [get(mapping, k, missing)
for k in zip(df1.col_A, df1.col_B)]
Run Code Online (Sandbox Code Playgroud)
df将解决您评论的未对齐问题。
| 归档时间: |
|
| 查看次数: |
226 次 |
| 最近记录: |