循环和聚类

Sha*_*ver 5 algorithm r

我不得不承认这对我来说太难了.我必须分析一些数据,这一步对我来说至关重要.

我要分析的数据:

> dput(tbl_clustering)
structure(list(P1 = structure(c(14L, 14L, 6L, 6L, 6L, 19L, 15L, 
13L, 13L, 13L, 13L, 10L, 10L, 6L, 6L, 10L, 27L, 27L, 27L, 27L, 
27L, 22L, 22L, 22L, 21L, 21L, 21L, 27L, 27L, 27L, 27L, 21L, 21L, 
21L, 28L, 28L, 25L, 25L, 25L, 29L, 29L, 17L, 17L, 17L, 5L, 5L, 
5L, 5L, 20L, 20L, 23L, 23L, 23L, 23L, 7L, 26L, 26L, 24L, 24L, 
24L, 24L, 3L, 3L, 3L, 9L, 8L, 2L, 11L, 11L, 11L, 11L, 11L, 12L, 
12L, 4L, 4L, 4L, 1L, 1L, 1L, 18L, 18L, 18L, 18L, 18L, 18L, 18L, 
18L, 18L, 18L, 18L, 16L, 16L, 16L, 16L, 16L, 16L, 16L), .Label = c("AT1G09130", 
"AT1G09620", "AT1G10760", "AT1G14610", "AT1G43170", "AT1G58080", 
"AT2G27680", "AT2G27710", "AT3G03710", "AT3G05590", "AT3G11510", 
"AT3G56130", "AT3G58730", "AT3G61540", "AT4G03520", "AT4G22930", 
"AT4G33030", "AT5G01600", "AT5G04710", "AT5G17990", "AT5G19220", 
"AT5G43940", "AT5G63310", "ATCG00020", "ATCG00380", "ATCG00720", 
"ATCG00770", "ATCG00810", "ATCG00900"), class = "factor"), P2 = structure(c(55L, 
54L, 29L, 4L, 70L, 72L, 18L, 9L, 58L, 68L, 19L, 6L, 1L, 16L, 
34L, 32L, 77L, 12L, 61L, 41L, 71L, 73L, 50L, 11L, 69L, 22L, 60L, 
42L, 47L, 45L, 59L, 30L, 24L, 23L, 77L, 45L, 12L, 47L, 59L, 82L, 
75L, 40L, 26L, 83L, 81L, 47L, 36L, 45L, 2L, 65L, 11L, 38L, 13L, 
31L, 53L, 78L, 7L, 80L, 79L, 7L, 76L, 17L, 10L, 3L, 68L, 51L, 
48L, 62L, 58L, 64L, 68L, 74L, 63L, 14L, 57L, 33L, 56L, 39L, 52L, 
35L, 43L, 25L, 27L, 21L, 15L, 5L, 49L, 37L, 66L, 20L, 44L, 69L, 
22L, 67L, 57L, 8L, 46L, 28L), .Label = c("AT1G01090", "AT1G02150", 
"AT1G03870", "AT1G09795", "AT1G13060", "AT1G14320", "AT1G15820", 
"AT1G17745", "AT1G20630", "AT1G29880", "AT1G29990", "AT1G43170", 
"AT1G52340", "AT1G52670", "AT1G56450", "AT1G59900", "AT1G69830", 
"AT1G75330", "AT1G78570", "AT2G05840", "AT2G28000", "AT2G34590", 
"AT2G35040", "AT2G37020", "AT2G40300", "AT2G42910", "AT2G44050", 
"AT2G44350", "AT2G45440", "AT3G01500", "AT3G03980", "AT3G04840", 
"AT3G07770", "AT3G13235", "AT3G14415", "AT3G18740", "AT3G22110", 
"AT3G22480", "AT3G22960", "AT3G51840", "AT3G54210", "AT3G54400", 
"AT3G56090", "AT3G60820", "AT4G00100", "AT4G00570", "AT4G02770", 
"AT4G11010", "AT4G14800", "AT4G18480", "AT4G20760", "AT4G26530", 
"AT4G28750", "AT4G30910", "AT4G30920", "AT4G33760", "AT4G34200", 
"AT5G02500", "AT5G02960", "AT5G10920", "AT5G12250", "AT5G13120", 
"AT5G16390", "AT5G18380", "AT5G35360", "AT5G35590", "AT5G35630", 
"AT5G35790", "AT5G48300", "AT5G52100", "AT5G56030", "AT5G60160", 
"AT5G64300", "AT5G67360", "ATCG00160", "ATCG00270", "ATCG00380", 
"ATCG00540", "ATCG00580", "ATCG00680", "ATCG00750", "ATCG00820", 
"ATCG01110"), class = "factor"), No_Interactions = c(8L, 5L, 
5L, 9L, 7L, 6L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 6L, 6L, 5L, 8L, 6L, 
5L, 5L, 5L, 5L, 5L, 5L, 10L, 6L, 6L, 5L, 5L, 5L, 5L, 8L, 5L, 
5L, 7L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 6L, 5L, 5L, 5L, 5L, 
6L, 5L, 5L, 6L, 5L, 5L, 6L, 5L, 6L, 5L, 5L, 5L, 5L, 5L, 5L, 6L, 
5L, 5L, 5L, 5L, 6L, 5L, 5L, 5L, 6L, 5L, 5L, 5L, 5L, 5L, 5L, 7L, 
8L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 7L, 5L, 5L, 
6L)), .Names = c("P1", "P2", "No_Interactions"), class = "data.frame", row.names = c(NA, 
-98L))
Run Code Online (Sandbox Code Playgroud)

为了更好地解释我想要实现的目标,我将在此处粘贴一些行:

        P1        P2 No_Interactions
1  AT3G61540 AT4G30920               8
2  AT3G61540 AT4G30910               5
3  AT1G58080 AT2G45440               5
4  AT1G58080 AT1G09795               9
5  AT1G58080 AT5G52100               7
6  AT5G04710 AT5G60160               6
7  AT4G03520 AT1G75330               5
8  AT3G58730 AT1G20630               5
9  AT3G58730 AT5G02500               5
10 AT3G58730 AT5G35790               5
Run Code Online (Sandbox Code Playgroud)

首先,Cluster必须创建新列.接下来我们只关注两列P1P2.正如你在第一行中看到的那样,我们有两个名字AT3G61540,AT4G30920这就是我们的起点(循环我认为是必要的).我们将数字1放在Cluster列中.我们采取名字AT3G61540并扫描两个列P1,P2如果我们再次找到这个名称,而不是第一行,我们也会将数字1放入其中Cluster.接下来,我们从第一行获取第二个名称,AT4G30920并通过整个数据进行相同的筛选.

下一步将是分析下一行并完成相同的事情.在这种情况下,在下一行中,我们的名称完全相同,P1这意味着我们不需要对其进行筛选,但第二个名称AT4G30910是不同的,因此使用该名称进行筛选会很棒.这里出现的问题是这一行也应该是这样的cluster 1.在cluster 2与第三排开始,因为我们已经完全一双新名字.

我知道这不是那么容易的任务,可能必须分几步完成.

编辑:我想得到的输出:

       P1        P2 No_Interactions      Cluster
1  AT3G61540 AT4G30920               8      1
2  AT3G61540 AT4G30910               5      1
3  AT1G58080 AT2G45440               5      2
4  AT1G58080 AT1G09795               9      2
5  AT1G58080 AT5G52100               7      2
6  AT5G04710 AT5G60160               6      3
7  AT5G52100 AT1G75330               5      2 ### Cluster 2 because AT5G52100 was found in the row number 5 as a partner of AT1G58080
8  AT3G58730 AT1G20630               5      5
9  AT3G58730 AT5G02500               5      5
10 AT3G58730 AT3G61540               5      1 ## Cluster 1 because AT3G61540 was found in first row.
Run Code Online (Sandbox Code Playgroud)

Col*_*vel 5

我纠正了我的初步答案,并提出了一种函数式编程方法,使用Maprecursion查找集群:

library(magrittr)

similar = function(u,v) if(length(intersect(u,v))==0) FALSE else TRUE

clusterify = function(df)
{ 
    clusters = df$cluster

    if(!any(clusters==0)) return(df)

    idx = pmatch(0, clusters)
    lst = Map(c, as.character(df[,1]), as.character(df[,2]))
    el  = c(as.character(df[idx, 1]), as.character(df[idx, 2]))

    K = lst %>%
        sapply(similar, v=el) %>%
        add(0)

    mask = if(any(clusters!=0 & K==1))

    if(any(mask))
    {
        cl = min(clusters[mask])
        df[K==1,]$cluster = cl
    }
    else
    {
        df[K==1,]$cluster = max(clusters) + 1
    }

    clusterify(df)
}
Run Code Online (Sandbox Code Playgroud)

你可以使用它 clusterify(transform(df, cluster=0))

例如,通过获取群集9(您可以检查其他群集),群集在您的示例上正常运行:

subset(clusterify(transform(df, cluster=0)), cluster==9)
#          P1        P2 No_Interactions cluster
#25 AT5G19220 AT5G48300              10       9
#26 AT5G19220 AT2G34590               6       9
#27 AT5G19220 AT5G10920               6       9
#32 AT5G19220 AT3G01500               8       9
#33 AT5G19220 AT2G37020               5       9
#34 AT5G19220 AT2G35040               5       9
#92 AT4G22930 AT5G48300               5       9
#93 AT4G22930 AT2G34590               5       9
#94 AT4G22930 AT5G35630               5       9
#95 AT4G22930 AT4G34200               7       9
#96 AT4G22930 AT1G17745               5       9
#97 AT4G22930 AT4G00570               5       9
#98 AT4G22930 AT2G44350               6       9
Run Code Online (Sandbox Code Playgroud)