我遇到了这个问题.
我有一个数据框(日期),其中一些文档ID和日期存储在一个字符向量中:
Doc Dates
1 12345 c("06/01/2000","08/09/2002")
2 23456 c("07/01/2000", 09/08/2003", "07/01/2000")
3 34567 c("09/06/2004", "09/06/2004", "12/30/2006")
4 45678 c("06/01/2000","08/09/2002")
Run Code Online (Sandbox Code Playgroud)
我试图删除日期中的重复元素以获得此结果:
Doc Dates
1 12345 c("06/01/2000","08/09/2002")
2 23456 c("07/01/2000", 09/08/2003")
3 34567 c("09/06/2004", "12/30/2006")
4 45678 c("06/01/2000","08/09/2002")
Run Code Online (Sandbox Code Playgroud)
我试过了:
R>unique(dates$dates)
Run Code Online (Sandbox Code Playgroud)
但它会按日期删除重复的行:
Doc Dates
1 12345 c("06/01/2000","08/09/2002")
2 23456 c("07/01/2000", 09/08/2003")
3 34567 c("09/06/2004", "12/30/2006")
Run Code Online (Sandbox Code Playgroud)
有关如何仅删除日期中重复元素的任何帮助,而不是按日期删除重复行?
**更新了数据
# Match some text string (dates) from some text:
df1$dates <- as.character(strapply(df1[[2]], "((\\D\\d{1,2}(/|-)\\d{1,2}(/|-)\\d{2,4})| ([^/]\\d{1,2}(/|-)\\d{2,4})|((JAN|FEB|MAR|APR|MAY|JUN|JUL|AUG|SEP|OCT|NOV){1}[\\s|-]{0,2}\\d{1,4}(\\D[\\s|-]{0,}\\d{2,4}){0,}))"))
# Drop first 2 columns from dataframe
df2<-df1[ -c(1,2)] …Run Code Online (Sandbox Code Playgroud) 好的 - 也许这是一个更好的例子。我正在寻找有关如何在正则表达式中引用变量的指南/参考资料 - 而不是如何为此数据构建正则表达式。
如何使用变量中的值来正则表达式下一个变量?
library(plyr)
library(tm)
library(stringr)
library(gsubfn)
Run Code Online (Sandbox Code Playgroud)
速度数据集
d1$sub <- c("LEFT CAROTID STENOSIS: (50-69)APPROXIMATELY 50-55% (0-49)LESS THAN 50% COMMON:", "LEFT CAROTID STENOSIS: (50-69)APPROXIMATELY 60-70% (0-49)LESS THAN 50% COMMON:", "LEFT CAROTID STENOSIS: (40-60)APPROXIMATELY 40% INCOMPLETE SCAN SEE NOTES (40-50)LESS THAN 50% COMMON:")
d1$sub
[1] "LEFT CAROTID STENOSIS: (50-69)APPROXIMATELY 50-55% (0-49)LESS THAN 50% COMMON:"
[2] "LEFT CAROTID STENOSIS: (50-69)APPROXIMATELY 60-70% (0-49)LESS THAN 50% COMMON:"
[3] "LEFT CAROTID STENOSIS: (40-60)APPROXIMATELY 40% INCOMPLETE SCAN SEE NOTES (40- 50)LESS THAN 50% COMMON:" …Run Code Online (Sandbox Code Playgroud)