我的数据如下所示:
duration obs another
1 1.801760 ID: 10 DAY: 6/10/13 S orange
2 1.868500 ID: 10 DAY: 6/10/13 S green
3 0.233562 ID: 10 DAY: 6/10/13 S yellow
4 5.538760 ID:96 DAY: 6/8/13 T yellow
5 3.436700 ID:96 DAY: 6/8/13 T blue
6 0.533856 ID:96 DAY: 6/8/13 T pink
7 2.302250 ID:96 DAY: 6/8/13 T orange
8 2.779420 ID:96 DAY: 6/8/13 T green
Run Code Online (Sandbox Code Playgroud)
我只包含了3个变量,但实际上我的数据有很多.我的问题是看丑陋的"obs"变量.我从另一个人那里收到了这些数据,这些人不一致地将这些信息输入到他们正在使用的软件中.
'obs'包含三条信息: - id(ID:10,ID:96等) - 日期(M/D/Y) - 标识符(S或T)
我想分割这些信息并提取ID号(10或96),日期(例如6/8/13)和标识符(S或T).
为此,我尝试使用strsplit进行以下操作:
temp<-strsplit(as.character(df$obs), " ")
mat<-matrix(unlist(temp), ncol=5, byrow=TRUE)
Run Code Online (Sandbox Code Playgroud)
我认为这可以像我的实际数据那样工作,我有130,000个观察结果,我没有意识到某些观察结果存在id在"ID:"和数字之间没有空格的问题.例如,在上面的数据中,"ID:96"在冒号和数字之间没有空格.显然,我收到了这条警告信息:
Warning message:
In matrix(unlist(temp), ncol = 5, byrow = TRUE) :
data length [796454] is not a sub-multiple or multiple of the number of rows [159291]
Run Code Online (Sandbox Code Playgroud)
很明显,strsplit不能被强制转换成好的常规列,因为strsplit的输出有两种形式:
[1] "ID:" "10" "DAY:" "6/10/13" "S" #when there is whitespace
[1] "ID:96" "DAY:" "6/8/13" "T" #when there isn't whitespace
Run Code Online (Sandbox Code Playgroud)
为了尝试绕过这个,我做了这个,认为如果我可以在'ID:'之后引入任何空格它可以工作:
df$obs <- gsub("ID:", "ID: ", df$obs)
Run Code Online (Sandbox Code Playgroud)
但是当我执行strsplit时,这不起作用,它会将双空白识别为分割数据的两个位置.
如果有人知道多个strsplits的解决方案,那么可以将其强制转换回原始df,其中包含idnumber,date,identifier的单独列,这将是很好的.
编辑:对不起,忘了添加数据以获得可重现的示例:
df<-structure(list(duration = c(1.80176, 1.8685, 0.233562, 5.53876,
3.4367, 0.533856, 2.30225, 2.77942), obs = structure(c(1L, 1L,
1L, 2L, 2L, 2L, 2L, 2L), .Label = c("ID: 10 DAY: 6/10/13 S",
"ID:96 DAY: 6/8/13 T"), class = "factor"), another = structure(c(3L,
2L, 5L, 5L, 1L, 4L, 3L, 2L), .Label = c("blue", "green", "orange",
"pink", "yellow"), class = "factor")), .Names = c("duration",
"obs", "another"), class = "data.frame", row.names = c(NA, -8L
))
Run Code Online (Sandbox Code Playgroud)
在您触发该数据输入人员之后,我可能会在此处考虑使用正则表达式来捕获数据.首先,这里只是"obs"列中的数据(在评论中添加附加值)
obs<-c("ID: 10 DAY: 6/10/13 S", "ID: 10 DAY: 6/10/13 S", "ID: 10 DAY: 6/10/13 S",
"ID:96 DAY: 6/8/13 T", "ID:96 DAY: 6/8/13 T", "ID:96 DAY: 6/8/13 T",
"ID:96 DAY: 6/8/13 T", "ID:96 DAY: 6/8/13 T", "ID: 84DAY: 6/8/13 T")
Run Code Online (Sandbox Code Playgroud)
接下来,我可以捕获数据
m<-regexpr("ID:\\s*(\\d+) ?DAY: (\\d+/\\d+/\\d+) (S|T)", obs, perl=T)
Run Code Online (Sandbox Code Playgroud)
接下来,我使用一个辅助函数regcapturedmatches()来提取捕获的匹配(它的工作方式与regmatches()捕获组相似)
do.call(rbind, regcapturedmatches(obs,m))
# [,1] [,2] [,3]
# [1,] "10" "6/10/13" "S"
# [2,] "10" "6/10/13" "S"
# [3,] "10" "6/10/13" "S"
# [4,] "96" "6/8/13" "T"
# [5,] "96" "6/8/13" "T"
# [6,] "96" "6/8/13" "T"
# [7,] "96" "6/8/13" "T"
# [8,] "96" "6/8/13" "T"
# [9,] "84" "6/8/13" "T"
Run Code Online (Sandbox Code Playgroud)
这将返回值矩阵.然后,您可以根据自己的喜好处理这些字符值.您可以将它们转换为正确的类并附加到data.frame.
但是如果你确实想要使用a strsplit,你可以拆分":"或带有":"前面的选项的空格
do.call(rbind, strsplit(obs,"(:|:?\\s+)", obs))
# [,1] [,2] [,3] [,4] [,5]
# [1,] "ID" "10" "DAY" "6/10/13" "S"
# [2,] "ID" "10" "DAY" "6/10/13" "S"
# [3,] "ID" "10" "DAY" "6/10/13" "S"
# [4,] "ID" "96" "DAY" "6/8/13" "T"
# [5,] "ID" "96" "DAY" "6/8/13" "T"
# [6,] "ID" "96" "DAY" "6/8/13" "T"
# [7,] "ID" "96" "DAY" "6/8/13" "T"
# [8,] "ID" "96" "DAY" "6/8/13" "T"
# [9,] "ID" "84DAY" "6/8/13" "T" "ID"
Run Code Online (Sandbox Code Playgroud)
直到你最新的坏数据系列为止.