pan*_*man 5 intersection r list match
我有一个这样的列表:
mylist <- list(PP = c("PP 1", "OMITTED"),
IN01 = c("DID NOT PARTICIPATE", "PARTICIPATED", "OMITTED"),
RD1 = c("YES", "NO", "NOT REACHED", "INVALID", "OMITTED"),
RD2 = c("YES", "NO", "NOT REACHED", "NOT AN OPTION", "OMITTED"),
LOS = c("LESS THAN 3", "3 TO 100", "100 TO 500", "MORE THAN 500", "LOGICALLY NOT APPLICABLE", "OMITTED"),
COM = c("BAN", "SBAN", "RAL"),
VR1 = c("WITHIN 30", "WITHIN 200", "NOT AVAILABLE", "OMITTED"),
INF = c("A LOT", "SOME", "LITTLE OR NO", "NOT APPLICABLE", "OMITTED"),
IST = c("FULL-TIME", "PART-TIME", "FULL STAFFED", "NOT STAFFED", "LOGICALLY NOT APPLICABLE", "OMITTED"),
CMP = c("ALL", "MOST", "SOME", "NONE", "LOGICALLY NOT APPLICABLE", "OMITTED"))
Run Code Online (Sandbox Code Playgroud)
我还有另一个这样的清单:
matchlist <- list("INVALID", c("INVALID", "OMITTED OR INVALID"),
c("INVALID", "OMITTED"), "OMITTED", c("NOT REACHED", "INVALID", "OMITTED"),
c("LOGICALLY NOT APPLICABLE", "INVALID", "OMITTED"),
c("LOGICALLY NOT APPLICABLE", "INVALID", "OMITTED OR INVALID"),
c("Not applicable", "Not stated"), c("Not reached", "Not administered/missing by design", "Presented but not answered/invalid"),
c("Not administered/missing by design", "Presented but not answered/invalid"),
"OMITTED OR INVALID",
c("LOGICALLY NOT APPLICABLE", "OMITTED OR INVALID"),
c("NOT REACHED", "OMITTED"),
c("NOT APPLICABLE", "OMITTED"),
c("LOGICALLY NOT APPLICABLE", "OMITTED"),
c("LOGICALLY NOT APPLICABLE", "NOT REACHED", "OMITTED"),
"NOT EXCLUDED", c("Default", "Not applicable", "Not stated"), c("Valid Skip", "Not Reached", "Not Applicable", "Invalid", "No Response"),
c("Not administered", "Omitted"),
c("NOT REACHED", "INVALID RESPONSE", "OMITTED"),
c("INVALID RESPONSE", "OMITTED"))
Run Code Online (Sandbox Code Playgroud)
正如您所看到的, 中的一些向量matchlist部分匹配 中的向量mylist。在某些情况下, 中的向量matchlist与 中的部分向量完全匹配mylist。例如,RD1in的最后一个值与mylist的第五个分量中的向量匹配matchlist,但RD2不匹配,尽管存在公共值。(“NOT REACHED”、“NOT AN OPTION”、“OMITTED”)RD2中的值按此顺序一起在 中的任何向量中都不匹配。in的值也是如此。mylistmatchlistCOMmylist
我想要实现的是将每个向量中的元素与 中的mylist每个向量进行比较,提取常见的值并以相同的顺序matchlist匹配中的值,并将它们存储在另一个列表中。期望的结果应如下所示:matchlist
$PP
[1] "OMITTED"
$IN01
[1] "OMITTED"
$RD1
[1] "NOT REACHED" "INVALID" "OMITTED"
$RD2
character(0)
$LOS
[1] "LOGICALLY NOT APPLICABLE" "OMITTED"
$COM
character(0)
$VR1
[1] "OMITTED"
$INF
[1] "NOT APPLICABLE" "OMITTED"
$IST
[1] "LOGICALLY NOT APPLICABLE" "OMITTED"
$CMP
[1] "LOGICALLY NOT APPLICABLE" "OMITTED"
Run Code Online (Sandbox Code Playgroud)
到目前为止我尝试过的:
使用intersect
lapply(mylist, function(i) {
intersect(i, lapply(matchlist, function(i) {i}))
})
Run Code Online (Sandbox Code Playgroud)
matchlist它仅返回(“OMITTED”)每个向量中的最后一个值。
使用match通过%in%:
lapply(mylist, function(i) {
i[which(i %in% matchlist)]
})
Run Code Online (Sandbox Code Playgroud)
仅返回(“INVALID”,“OMITTED”)所需的结果RD1,对于其余的,它仅返回最后一个值(“OMITTED”),除了COM正确的值。
使用mapply和intersect:
mapply(intersect, mylist, matchlist)
Run Code Online (Sandbox Code Playgroud)
返回一个长列表,其中包含几乎所有内容的混合物,包括不应该出现的组合,以及长度不等的警告。
有人可以帮忙吗?
有一些非常简单/好的答案,但它们似乎都依赖于unlist. 我假设您需要保留 中的分组matchlist,因此取消列出它们没有意义。lapply这是一个无需此方法即可工作的解决方案,如您开始执行的那样使用双循环:
out <- lapply(mylist, function(this) {
mtch <- lapply(matchlist, intersect, this)
wh <- which.max(lengths(mtch))
if (length(wh)) mtch[[wh]] else character(0)
})
str(out)
# List of 9
# $ PP : chr "OMITTED"
# $ IN01: chr "OMITTED"
# $ RD1 : chr [1:3] "NOT REACHED" "INVALID" "OMITTED"
# $ LOS : chr [1:2] "LOGICALLY NOT APPLICABLE" "OMITTED"
# $ COM : chr(0)
# $ VR1 : chr "OMITTED"
# $ INF : chr [1:2] "NOT APPLICABLE" "OMITTED"
# $ IST : chr [1:2] "LOGICALLY NOT APPLICABLE" "OMITTED"
# $ CMP : chr [1:2] "LOGICALLY NOT APPLICABLE" "OMITTED"
Run Code Online (Sandbox Code Playgroud)
它总是返回一个匹配次数最多的向量,但如果(以某种方式)有多个匹配项,我认为它将保留自然顺序并返回第一个所述长匹配项。(问题是:“是否which.max保留自然秩序?”我认为确实如此,但尚未得到证实。)
更新
添加的约束不仅matchlist要求向量的存在和顺序,而且还要求不存在插入的单词。例如,如果按照评论中的建议,mylist$RD1has "BLAH",那么它将不再与 匹配matchlist[[5]]。
检查一个向量到另一个向量的完美有序子集有点问题(因此不是代码高尔夫冠军),并且通常扩展性很差,因为我们没有简单的子集确定。有了这个警告,这个实现会执行一些嵌套*apply函数......
(注意:在评论中建议应该$RD1返回character(0),但它确实有"INVALID"匹配 的单长度组件之一matchlist,所以它应该匹配,只是不是更长的那个。)
out <- lapply(mylist, function(this) {
ind <- lapply(matchlist, function(a) which(this == a[1]))
perfectmatches <- mapply(function(ml, allis, this) {
length(ml) * any(sapply(allis, function(i) all(ml == this[ i + seq_along(ml) - 1 ])))
}, matchlist, ind, MoreArgs = list(this=this))
if (any(perfectmatches) > 0) {
wh <- which.max(perfectmatches)
return(matchlist[[wh]])
} else return(character(0))
})
str(out)
# List of 9
# $ PP : chr "OMITTED"
# $ IN01: chr "OMITTED"
# $ RD1 : chr "INVALID"
# $ LOS : chr [1:2] "LOGICALLY NOT APPLICABLE" "OMITTED"
# $ COM : chr(0)
# $ VR1 : chr "OMITTED"
# $ INF : chr [1:2] "NOT APPLICABLE" "OMITTED"
# $ IST : chr [1:2] "LOGICALLY NOT APPLICABLE" "OMITTED"
# $ CMP : chr [1:2] "LOGICALLY NOT APPLICABLE" "OMITTED"
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
417 次 |
| 最近记录: |