如何重新排列两个数据帧之间的匹配顺序

Question

如何重新排列两个数据帧之间的匹配顺序

自昨晚以来我一直忙于这个问题,我无法弄清楚如何去做.

我想要做的是将df1字符串与df2字符串匹配并获得类似的字符串

我做的就像这样

# a function to arrange the data to have IDs for each string 
    normalize <- function(x, delim) {
      x <- gsub(")", "", x, fixed=TRUE)
      x <- gsub("(", "", x, fixed=TRUE)
      idx <- rep(seq_len(length(x)), times=nchar(gsub(sprintf("[^%s]",delim), "", as.character(x)))+1)
      names <- unlist(strsplit(as.character(x), delim))
      return(setNames(idx, names))
    }

# a function to arrange the second df  
lookup <- normalize(df2[,1], ",")

# a function to match them and give the IDs
process <- function(s) {
  lookup_try <- lookup[names(s)]
  found <- which(!is.na(lookup_try))
  pos <- lookup_try[names(s)[found]]
  return(paste(s[found], pos, sep="-"))
  #change the last line to "return(as.character(pos))" to get only the result as in the comment
}

Run Code Online (Sandbox Code Playgroud)

然后我得到这样的结果

res <- lapply(colnames(df1), function(x) process(normalize(df1[,x], ";")))

Run Code Online (Sandbox Code Playgroud)

这给了我df1中每个字符串的行号和匹配的df2的字符串的行号.所以这个数据的输出看起来像这样

> res
$s1
[1] "3-4" "4-1" "5-4"

$s2
[1] "2-4"  "3-15" "7-16"

Run Code Online (Sandbox Code Playgroud)

第一列ID是与df1中的字符串匹配的df2的行号.第二列No是匹配的次数第三列ID-col-n是df1中与该字符串匹配的字符串的行号+他们的列名称第四列是df1的第一列中与该字符串匹配的字符串第五列是与该字符串匹配的第二列的字符串,依此类推

Answer 1

Nic*_*icE 6

在这种情况下,我发现将数据切换到宽格式并将其合并到查找表之前更容易.

你可以尝试:

library(tidyr)
library(dplyr)
df1_tmp <- df1
df2_tmp <- df2
#add numerical id to df1_tmp to keep row information
df1_tmp$id <- seq_along(df1_tmp[,1])

#switch to wide and unnest rows with several strings
df1_tmp <- gather(df1_tmp,key="s_val",value="query_string",-id)
df1_tmp <- df1_tmp %>% 
        mutate(query_string = strsplit(as.character(query_string), ";")) %>% 
        unnest(query_string)


df2_tmp$IDs. <- gsub("[()]", "", df2_tmp$IDs.)

#add numerical id to df1_tmp to keep row information
df2_tmp$id <- seq_along(df2_tmp$IDs.)

#unnest rows with several strings
df2_tmp <- df2_tmp %>% 
        mutate(IDs. = strsplit(as.character(IDs.), ",")) %>% 
        unnest(IDs.)

res <- merge(df1_tmp,df2_tmp,by.x="query_string",by.y="IDs.")

res$ID_col_n <- paste(paste0(res$id.x,res$s_val))
res$total_id <- 1:nrow(res)
res <- spread(res,s_val,value=query_string,fill=NA)
res
#summarize to get required output 

res <- res %>% group_by(id.y) %>%
        mutate(No=n())  %>% group_by(id.y,No) %>%
        summarise_each(funs(paste(.[!is.na(.)],collapse=","))) %>% 
        select(-id.x,-total_id)

colnames(res)[colnames(res)=="id.y"]<-"IDs"

res$df1_colMatch_counts <- rowSums(res[,-(1:3)]!="")
df2_counts <- df2_tmp %>% group_by(id) %>% summarize(df2_string_counts=n())
res <- merge(res,df2_counts,by.x="IDs",by.y="id")
res


res

  IDs No    ID_col_n            s1     s2 df1_colMatch_counts df2_string_counts
1   1  1         4s1        P41182                          1                 2
2   2  1         4s1        P41182                          1                 2
3   3  1         4s1        P41182                          1                 2
4   4  3 2s2,3s1,5s1 Q9Y6Q9,Q09472 Q92831                   2                 4
5  15  1         3s2               P54612                   1                 5
6  16  1         7s2               O15143                   1                 7

Run Code Online (Sandbox Code Playgroud)

归档时间：	9 年，11 月前
查看次数：	731 次
最近记录：	9 年，7 月前