按顺序删除/折叠连续的重复值

Ama*_*eet 16 loops r lag apply

我有以下数据帧:

a a a b c c d e a a b b b e e d d
Run Code Online (Sandbox Code Playgroud)

所需的结果应该是

a b c d e a b e d 
Run Code Online (Sandbox Code Playgroud)

这意味着没有两个连续的行应该具有相同的值.如何在不使用循环的情况下完成.

由于我的数据集非常庞大,因此循环需要花费大量时间来执行.

数据帧结构如下所示

a 1 
a 2
a 3
b 2
c 4
c 1
d 3
e 9
a 4
a 8
b 10
b 199
e 2
e 5
d 4
d 10
Run Code Online (Sandbox Code Playgroud)

结果:

a 1 
b 2
c 4
d 3
e 9
a 4
b 10
e 2
d 4
Run Code Online (Sandbox Code Playgroud)

它应该删除整行.

A5C*_*2T1 21

一种简单的方法是使用rle:

这是您的示例数据:

x <- scan(what = character(), text = "a a a b c c d e a a b b b e e d d")
# Read 17 items
Run Code Online (Sandbox Code Playgroud)

rle返回list带有两个值的a:运行长度(" lengths"),以及为该运行重复的值(" values").

rle(x)$values
# [1] "a" "b" "c" "d" "e" "a" "b" "e" "d"
Run Code Online (Sandbox Code Playgroud)

更新:对于 data.frame

如果您正在使用a data.frame,请尝试以下操作:

## Sample data
mydf <- data.frame(
  V1 = c("a", "a", "a", "b", "c", "c", "d", "e", 
         "a", "a", "b", "b", "e", "e", "d", "d"),
  V2 = c(1, 2, 3, 2, 4, 1, 3, 9, 
         4, 8, 10, 199, 2, 5, 4, 10)
)

## Use rle, as before
X <- rle(mydf$V1)
## Identify the rows you want to keep
Y <- cumsum(c(1, X$lengths[-length(X$lengths)]))
Y
# [1]  1  4  5  7  8  9 11 13 15
mydf[Y, ]
#    V1 V2
# 1   a  1
# 4   b  2
# 5   c  4
# 7   d  3
# 8   e  9
# 9   a  4
# 11  b 10
# 13  e  2
# 15  d  4
Run Code Online (Sandbox Code Playgroud)

更新2

"data.table"包具有一个rleid可以让您轻松完成此操作的功能.mydf从上面使用,尝试:

library(data.table)
as.data.table(mydf)[, .SD[1], by = rleid(V1)]
#    rleid V2
# 1:     1  1
# 2:     2  2
# 3:     3  4
# 4:     4  3
# 5:     5  9
# 6:     6  4
# 7:     7 10
# 8:     8  2
# 9:     9  4
Run Code Online (Sandbox Code Playgroud)


Kha*_*haa 8

library(dplyr)
x <- c("a", "a", "a", "b", "c", "c", "d", "e", "a", "a", "b", "b", "b", "e", "e", "d", "d")
x[x!=lag(x, default=1)]
#[1] "a" "b" "c" "d" "e" "a" "b" "e" "d"
Run Code Online (Sandbox Code Playgroud)

编辑:为data.frame

  mydf <- data.frame(
    V1 = c("a", "a", "a", "b", "c", "c", "d", "e", 
         "a", "a", "b", "b", "e", "e", "d", "d"),
    V2 = c(1, 2, 3, 2, 4, 1, 3, 9, 
         4, 8, 10, 199, 2, 5, 4, 10),
   stringsAsFactors=FALSE)
Run Code Online (Sandbox Code Playgroud)

dplyr解决方案是一个班轮:

mydf %>% filter(V1!= lag(V1, default="1"))
#  V1 V2
#1  a  1
#2  b  2
#3  c  4
#4  d  3
#5  e  9
#6  a  4
#7  b 10
#8  e  2
#9  d  4
Run Code Online (Sandbox Code Playgroud)

后脚本

lead(x,1) 由@Carl Witthoft建议以相反的顺序迭代.

leadit<-function(x) x!=lead(x, default="what")
rows <- leadit(mydf[ ,1])
mydf[rows, ]

#   V1  V2
#3   a   3
#4   b   2
#6   c   1
#7   d   3
#8   e   9
#10  a   8
#12  b 199
#14  e   5
#16  d  10
Run Code Online (Sandbox Code Playgroud)


Col*_*vel 6

使用基数R,我喜欢有趣的算法:

x <- c("a", "a", "a", "b", "c", "c", "d", "e", "a", "a", "b", "b", "b", "e", "e", "d", "d")

x[x!=c(x[-1], FALSE)]
#[1] "a" "b" "c" "d" "e" "a" "b" "e" "d"
Run Code Online (Sandbox Code Playgroud)

  • 类似地,可以使用索引代替“ tail”,例如“ x [x!= c(x [-1],FALSE)]” (2认同)