查找矩阵中最大的连续簇

Tav*_*nis 2 optimization r cluster-analysis depth-first-search

给定 R 编程语言中的以下矩阵:

set.seed(123)
matrix_1 <- matrix(rbinom(100, 1, 0.5), nrow = 10, ncol = 10)
Run Code Online (Sandbox Code Playgroud)

这是一个深度优先搜索 (DFS) 算法,用于识别该矩阵中 1 的簇。在此上下文中,“簇”是整数在矩阵上的连续映射,最小簇大小为 3,并假设 8 个连通性(即,包括对角线)。注意:我尝试对包使用基于图像的方法EBImage,但其执行速度对于我的目的来说太慢了。我有数千个 100X100 矩阵需要分析!

find_clusters <- function(matrix) {
  rows <- nrow(matrix)
  cols <- ncol(matrix)
  
  # Create a matrix of the same size to mark visited cells
  visited <- matrix(0, nrow = rows, ncol = cols)
  
  # Define all 8 possible movements from a cell (8-connectivity)
  row_nbr <- c(-1, -1, -1,  0, 0,  1, 1, 1)
  col_nbr <- c(-1,  0,  1, -1, 1, -1, 0, 1)
  
  # A function to check if a cell can be included in the DFS
  is_valid <- function(row, col) {
    row >= 1 && row <= rows && col >= 1 && col <= cols &&
      visited[row, col] == 0 && matrix[row, col] == 1
  }
  
  # A function to do a DFS of a 2D boolean matrix. It only considers
  # the 8 cells directly connected to a cell
  DFS <- function(matrix, row, col, visited, cluster) {
    row_stack <- c(row)
    col_stack <- c(col)
    
    while (length(row_stack) > 0) {
      r <- row_stack[length(row_stack)]
      c <- col_stack[length(col_stack)]
      row_stack <- row_stack[-length(row_stack)]
      col_stack <- col_stack[-length(col_stack)]
      
      if (visited[r, c] == 0) {
        visited[r, c] <- 1
        cluster <- rbind(cluster, c(r, c))
        
        for (k in 1:8) {
          if (is_valid(r + row_nbr[k], c + col_nbr[k])) {
            row_stack <- c(row_stack, r + row_nbr[k])
            col_stack <- c(col_stack, c + col_nbr[k])
          }
        }
      }
    }
    return(cluster)
  }
  
  # The main function that returns all clusters
  get_clusters <- function(matrix, visited) {
    clusters <- list()
    for (i in 1:rows) {
      for (j in 1:cols) {
        if (visited[i, j] == 0 && matrix[i, j] == 1) {
          new_cluster <- DFS(matrix, i, j, visited, matrix(, nrow = 0, ncol = 2))
          if (nrow(new_cluster) >= 3) {
            clusters[[length(clusters) + 1]] <- new_cluster
          }
        }
      }
    }
    return(clusters)
  }
  
  return(get_clusters(matrix, visited))
}
Run Code Online (Sandbox Code Playgroud)

效果很好而且速度很快。但是,此函数返回大小 > 3 的所有可能簇(总共 44 个簇),其中包括嵌套在较大簇内的较小簇。

矩阵作为二值图像:

my_palette <- c("white", "black")

# correct for how image() reads a matrix
rotate <- function(x) t(apply(x, 2, rev))

image(rotate(matrix_1),
      axes = FALSE,
      col = my_palette_2)
Run Code Online (Sandbox Code Playgroud)

我只看到三个大小 >= 3 的簇。如何修改我的函数以仅“看到”矩阵上最大的完整簇?

更新

谢谢@I_O!我有来自 MATLAB 模拟的 10000 个 100X100 矩阵,用于模拟细胞膜上钠通道的行为。以下函数实现您的建议并返回通道类型 1 和 2 的簇大小:

library(dplyr)
library(terra)

# M: matrix of integers 

find_clusters_2chan <- function(M) {
  
  # Consider only 1s
  ones <- M == 1
  
  # convert matrix to raster
  raster_ones <- ones |> rast()
  
  # find clusters (consider zeros as NA, i. e. discontinuation)
  
  clusters_ones <- patches(raster_ones,
                           directions = 8,
                           zeroAsNA = TRUE)
  
  # generate frequency table
  ones_freq <- the_clusters_ones |> freq()
  
  # return counts >=3
  ones_freq$count %>%
    .[. >= 3] -> ONES
  
  #-------------------------------------------------------------------------------
  
  # Consider only 2s
  twos <- M == 2
  
  # convert matrix to raster
  raster_twos <- twos |> rast()
  
  # find clusters (consider zeros as NA, i. e. discontinuation)
  
  clusters_twos <- patches(raster_twos,
                           directions = 8,
                           zeroAsNA = TRUE)
  
  # generate frequency table
  twos_freq <- clusters_twos |> freq()
  
  # return counts >=3
  twos_freq$count %>%
    .[. >= 3] -> TWOS
  
  clusters_list <- list(channel_1 = ONES,
                        channel_2 = TWOS)
  
  return(clusters_list)
  
}

start <- Sys.time()

clusters_big_list <- lapply(list_of_matrices, find_clusters_2chan)

end <- Sys.time()

end - start

# run time = 3.902859 minutes
Run Code Online (Sandbox Code Playgroud)

I_O*_*_O 6

如果您愿意使用像 {terra} 这样的专用包进行栅格分析,那么以下操作应该方便快捷编辑:包装函数以概括跨通道值和底部的最小簇大小

library(terra)

set.seed(123)
matrix_1 <- matrix(rbinom(100, 1, 0.5), nrow = 10, ncol = 10)
Run Code Online (Sandbox Code Playgroud)
  • 将矩阵转换为栅格:
raster_1 <- matrix_1 |> rast()
Run Code Online (Sandbox Code Playgroud)
  • 查找簇(将零视为 NA,即中止)
the_clusters <- patches(raster_1, directions = 8, zeroAsNA=TRUE)
Run Code Online (Sandbox Code Playgroud)
  • 检查已识别的集群:
the_clusters |> plot()
Run Code Online (Sandbox Code Playgroud)

集群

  • 列出簇大小(值:簇 ID,计数:簇大小)
the_clusters |> freq()
Run Code Online (Sandbox Code Playgroud)
  layer value count
1     1     1    24
2     1     2     8
3     1     3     1
4     1     4     2
5     1     5    12
Run Code Online (Sandbox Code Playgroud)

为任意数量的通道提取最小大小的簇的函数可能如下所示:

find_clusters_2_all_chans <- 
  function(M, channel_numbers, min_count) {
    channel_numbers |> 
      Map(f = \(i){
        M |> 
          rast() |>
          app(fun = \(cell) ifelse(cell == i, cell, NA)) |>
          patches(directions = 8, zeroAsNA = TRUE) |>
          freq()|>
          (\(m) m[m['count'] > min_count,])()
      }) |>
      setNames(paste0('channel_', channel_numbers))
  }
Run Code Online (Sandbox Code Playgroud)

示例:为通道 1 和 2 选择大于三个像素的簇

find_clusters_2_all_chans(M, 1:2, 3)
Run Code Online (Sandbox Code Playgroud)
> $channel_1
  layer value count
1     1     1     4
2     1     2    10
3     1     3     6
4     1     5     6
6     1     7     4

$channel_2
  layer value count
5     1     5    12
Run Code Online (Sandbox Code Playgroud)