在R数据框中找到最重复的行序列

use*_*768 6 algorithm r sequence

假设我有一个看起来像这样的数据框

    ITEM
  1  X
  2  A
  3  B
  4  C
  5  A
  6  F
  7  U
  8  A
  9  B
 10  C
 11  F
 12  U
Run Code Online (Sandbox Code Playgroud)

如何获得最常见的行序列。在这种情况下,最常见的顺序是A,B,C因为它出现在第2至4行和8至10行中。

我已经尝试过该功能rle以及此处找到的一些解决方案,但我并不幸运。我可以有建议,提示或套餐推荐吗?

d.b*_*d.b 1

我猜你想要最长的非重叠子字符串。这里有一些关于动态规划解决方案的很好的解释。

x = c("X", "A", "B", "C", "A", "F", "U", "A", "B", "C", "F", "U")
n = length(x)
m1 = sapply(x, function(i) sapply(x, function(j) as.integer(i == j)))
diag(m1) = 0
m1[lower.tri(m1)] = 0
m1
#   X A B C A F U A B C F U
# X 0 0 0 0 0 0 0 0 0 0 0 0
# A 0 0 0 0 1 0 0 1 0 0 0 0
# B 0 0 0 0 0 0 0 0 1 0 0 0
# C 0 0 0 0 0 0 0 0 0 1 0 0
# A 0 0 0 0 0 0 0 1 0 0 0 0
# F 0 0 0 0 0 0 0 0 0 0 1 0
# U 0 0 0 0 0 0 0 0 0 0 0 1
# A 0 0 0 0 0 0 0 0 0 0 0 0
# B 0 0 0 0 0 0 0 0 0 0 0 0
# C 0 0 0 0 0 0 0 0 0 0 0 0
# F 0 0 0 0 0 0 0 0 0 0 0 0
# U 0 0 0 0 0 0 0 0 0 0 0 0

m2 = m1
for (i in 2:nrow(m1)){
    for (j in 2:nrow(m1)){
        if (m1[i-1, j-1] == 1 & m1[i, j] == 1){
            if (j - i > m2[i - 1, j - 1]){
                m2[i, j] = m2[i - 1, j - 1] + m2[i, j]
                m2[i - 1, j - 1] = 0
            } else {
                m2[i, j] = 0
            }
        }
    }
}
m2
#   X A B C A F U A B C F U
# X 0 0 0 0 0 0 0 0 0 0 0 0
# A 0 0 0 0 1 0 0 0 0 0 0 0
# B 0 0 0 0 0 0 0 0 0 0 0 0
# C 0 0 0 0 0 0 0 0 0 3 0 0
# A 0 0 0 0 0 0 0 1 0 0 0 0
# F 0 0 0 0 0 0 0 0 0 0 0 0
# U 0 0 0 0 0 0 0 0 0 0 0 2
# A 0 0 0 0 0 0 0 0 0 0 0 0
# B 0 0 0 0 0 0 0 0 0 0 0 0
# C 0 0 0 0 0 0 0 0 0 0 0 0
# F 0 0 0 0 0 0 0 0 0 0 0 0
# U 0 0 0 0 0 0 0 0 0 0 0 0

ans_len = max(m2)
inds = c(which(m2 == ans_len, arr.ind = TRUE)[,2])
lapply(inds, function(ind) x[(ind - ans_len + 1):ind])
# [[1]]
# [1] "A" "B" "C"
Run Code Online (Sandbox Code Playgroud)

  • 这很好,但它是 O(2^N),其中 N = 输入的总长度,因此不可扩展。如果我们可以假设序列长度的上限(user3276768 可以吗?),这将保持其可扩展性。另外,空的 `collapse = ""` 字符串会让事情在视觉上更加紧凑。 (2认同)