我有一个日志数据集:
V1 duration id startpoint
T<=>161[=]P<=>explorer.exe[=]I<=>1820[=]W<=>20094[=]V<=>6.00.2900.5512 7771 1 2012-05-07_12-29-51
T<=>195[=]P<=>360Safe.exe[=]I<=>1732[=]W<=>1017e[=]V<=>7, 5, 0, 1501[=]N<=>360????[=]C<=>360.cn 7771 1 2012-05-07_12-29-51
T<=>209[=]P<=>360Safe.exe[=]I<=>1732[=]W<=>1017e[=]V<=>7, 5, 0, 1501 7771 1 2012-05-07_12-29-51
T<=>211[=]P<=>360chrome.exe[=]I<=>436[=]U<=>www.hao123.com[=]A<=>2027a[=]B<=>20290[=]V<=>5.2.0.804 7771 1 2012-05-07_12-29-51 211
Run Code Online (Sandbox Code Playgroud)
我正在尝试从第一列(时间点,进程,pid,url等)中提取信息。一开始我尝试过:
df$timepoint <- gsub("T<=>(.*)[=].*", "\\1", df$V1)
Run Code Online (Sandbox Code Playgroud)
它返回类似的内容161[=]P<=>explorer.exe[=]I<=>1820[=]W<=>20094[=]V<,然后我尝试了:
df$timepoint <- gsub("T<=>([0-9]*).*", "\\1", df$V1)
Run Code Online (Sandbox Code Playgroud)
它有效,但是在处理诸如流程名称之类的文本时将不起作用,因此我搜索了“正则表达式最小匹配”并找到了术语non-greedy。我再次尝试:
df$timepoint <- gsub("T<=>(.*?)\\[=\\].*", "\\1", df$V1)
df$process <- gsub(".*P<=>(.*?)\\[=\\].*", "\\1", df$V1)
df$pid <- gsub(".*I<=>(.*?)\\[=\\].*", "\\1", df$V1)
df$url <- gsub(".*U<=>(.*?)\\[=\\].*", "\\1", df$V1)
df$addr <- gsub(".*A<=>(.*?)\\[=\\].*", "\\1", df$V1)
df$tab <- gsub(".*B<=>(.*?)\\[=\\].*", "\\1", df$V1)
df$ver <- gsub(".*V<=>(.*?)\\[=\\].*", "\\1", df$V1)
df$window <- gsub(".*W<=>(.*?)\\[=\\].*", "\\1", df$V1)
df$name <- gsub(".*N<=>(.*?)\\[=\\].*", "\\1", df$V1)
df$company <- gsub(".*C<=>(.*?)", "\\1", df$V1)
Run Code Online (Sandbox Code Playgroud)
并非每一行都包含所有信息,并且发生了问题。如果没有有关软件名称或公司名称的信息,R只需将V1复制到新的变量中。如果软件版本信息在V1的末尾,则正则表达式".*V<=>(.*?)\\[=\\].*"还将整个字符串复制到新的var中:
V1 duration id startpoint timepoint process pid url addr tab ver window name company
T<=>161[=]P<=>explorer.exe[=]I<=>1820[=]W<=>20094[=]V<=>6.00.2900.5512 7771 1 2012-05-07_12-29-51 161 explorer.exe 1820 T<=>161[=]P<=>explorer.exe[=]I<=>1820[=]W<=>20094[=]V<=>6.00.2900.5512 T<=>161[=]P<=>explorer.exe[=]I<=>1820[=]W<=>20094[=]V<=>6.00.2900.5512 T<=>161[=]P<=>explorer.exe[=]I<=>1820[=]W<=>20094[=]V<=>6.00.2900.5512 T<=>161[=]P<=>explorer.exe[=]I<=>1820[=]W<=>20094[=]V<=>6.00.2900.5512 20094 T<=>161[=]P<=>explorer.exe[=]I<=>1820[=]W<=>20094[=]V<=>6.00.2900.5512 T<=>161[=]P<=>explorer.exe[=]I<=>1820[=]W<=>20094[=]V<=>6.00.2900.5512
T<=>195[=]P<=>360Safe.exe[=]I<=>1732[=]W<=>1017e[=]V<=>7, 5, 0, 1501[=]N<=>360????[=]C<=>360.cn 7771 1 2012-05-07_12-29-51 195 360Safe.exe 1732 T<=>195[=]P<=>360Safe.exe[=]I<=>1732[=]W<=>1017e[=]V<=>7, 5, 0, 1501[=]N<=>360????[=]C<=>360.cn T<=>195[=]P<=>360Safe.exe[=]I<=>1732[=]W<=>1017e[=]V<=>7, 5, 0, 1501[=]N<=>360????[=]C<=>360.cn T<=>195[=]P<=>360Safe.exe[=]I<=>1732[=]W<=>1017e[=]V<=>7, 5, 0, 1501[=]N<=>360????[=]C<=>360.cn 7, 5, 0, 1501 1017e 360???? 360.cn
T<=>203[=]P<=>360chrome.exe[=]I<=>436[=]U<=>NULL[=]A<=>2027a[=]B<=>20290[=]V<=>5.2.0.804[=]N<=>360?????[=]C<=>360.cn 7771 1 2012-05-07_12-29-51 203 360chrome.exe 436 NULL 2027a 20290 5.2.0.804 T<=>203[=]P<=>360chrome.exe[=]I<=>436[=]U<=>NULL[=]A<=>2027a[=]B<=>20290[=]V<=>5.2.0.804[=]N<=>360?????[=]C<=>360.cn 360????? 360.cn
T<=>209[=]P<=>360Safe.exe[=]I<=>1732[=]W<=>1017e[=]V<=>7, 5, 0, 1501 7771 1 2012-05-07_12-29-51 209 360Safe.exe 1732 T<=>209[=]P<=>360Safe.exe[=]I<=>1732[=]W<=>1017e[=]V<=>7, 5, 0, 1501 T<=>209[=]P<=>360Safe.exe[=]I<=>1732[=]W<=>1017e[=]V<=>7, 5, 0, 1501 T<=>209[=]P<=>360Safe.exe[=]I<=>1732[=]W<=>1017e[=]V<=>7, 5, 0, 1501 T<=>209[=]P<=>360Safe.exe[=]I<=>1732[=]W<=>1017e[=]V<=>7, 5, 0, 1501 1017e T<=>209[=]P<=>360Safe.exe[=]I<=>1732[=]W<=>1017e[=]V<=>7, 5, 0, 1501 T<=>209[=]P<=>360Safe.exe[=]I<=>1732[=]W<=>1017e[=]V<=>7, 5, 0, 1501
T<=>211[=]P<=>360chrome.exe[=]I<=>436[=]U<=>www.hao123.com[=]A<=>2027a[=]B<=>20290[=]V<=>5.2.0.804 7771 1 2012-05-07_12-29-51 211 360chrome.exe 436 www.hao123.com 2027a 20290 T<=>211[=]P<=>360chrome.exe[=]I<=>436[=]U<=>www.hao123.com[=]A<=>2027a[=]B<=>20290[=]V<=>5.2.0.804 T<=>211[=]P<=>360chrome.exe[=]I<=>436[=]U<=>www.hao123.com[=]A<=>2027a[=]B<=>20290[=]V<=>5.2.0.804 T<=>211[=]P<=>360chrome.exe[=]I<=>436[=]U<=>www.hao123.com[=]A<=>2027a[=]B<=>20290[=]V<=>5.2.0.804 T<=>211[=]P<=>360chrome.exe[=]I<=>436[=]U<=>www.hao123.com[=]A<=>2027a[=]B<=>20290[=]V<=>5.2.0.804
Run Code Online (Sandbox Code Playgroud)
我以为如果R无法找到'C <=>'(例如),那之后就没有(。*?)了。这将是一个空字符串,但输出使用了整个字符串。有人可以帮我修复它吗?谢谢!
多亏MrFlick的评论,我才得到一个基于以下答案的解决方案:
以提取软件名称信息的过程为例,
ind1 <- grep(".*N<=>(.*?)\\[=\\].*", df$V1, value= FALSE) # see if pattern exists with follow-up
ind2 <- grep(".*N<=>(.*?)", df$V1, value= FALSE) # see if pattern exists
df$name <- ""
df$name[ind2] <- gsub(".*N<=>(.*?)", "\\1", df$V1) # replace the ones with pattern match
df$name[ind1] <- gsub(".*N<=>(.*?)\\[=\\].*", "\\1", df$V1) # replace the ones with pattern match and follow-up
Run Code Online (Sandbox Code Playgroud)
但是此代码片段似乎很糟糕,如果这是最终的解决方案,那么我必须与其他信息(流程,pid,版本,公司等)一起使用它……有人可以帮助对其进行优化吗?谢谢!
这是另一个策略。我们可以使用它gregexpr来分离堆叠数据的每个部分。这是向量中的数据
V1<-c("T<=>161[=]P<=>explorer.exe[=]I<=>1820[=]W<=>20094[=]V<=>6.00.2900.5512", \n"T<=>195[=]P<=>360Safe.exe[=]I<=>1732[=]W<=>1017e[=]V<=>7, 5, 0, 1501[=]N<=>360\xe5\xae\x89\xe5\x85\xa8\xe5\x8d\xab\xe5\xa3\xab[=]C<=>360.cn", \n"T<=>209[=]P<=>360Safe.exe[=]I<=>1732[=]W<=>1017e[=]V<=>7, 5, 0, 1501", \n"T<=>211[=]P<=>360chrome.exe[=]I<=>436[=]U<=>www.hao123.com[=]A<=>2027a[=]B<=>20290[=]V<=>5.2.0.804")\nRun Code Online (Sandbox Code Playgroud)\n\n现在我们可以用以下方法分割碎片
\n\nm <- gregexpr("(\\\\w)<=>(.*?)(?:\\\\[=\\\\]|$)", V1, perl=T)\nRun Code Online (Sandbox Code Playgroud)\n\n获取 caputred 匹配可能会很麻烦,但我使用函数regcapturedmatches轻松获取所有匹配的数据。我使用它就像你使用内置的一样regmatches
data <- regcapturedmatches(V1,m)\nRun Code Online (Sandbox Code Playgroud)\n\n然后,如果您检查,data您可以看到所有信息都在那里。现在的问题是我们只需要将其构建为列而不是像现在这样的行。为此,我使用reshape2
library(reshape2)\n\n#combine list into one data.frame\nsdata<-do.call(rbind, lapply(1:length(data), \n function(i) data.frame(data[[i]], S=i))) \n\n#turn rows into columns\ndcast(sdata, S~X1, value.var="X2")\nRun Code Online (Sandbox Code Playgroud)\n\n然后返回
\n\n S I P T V W C N A B\n1 1 1820 explorer.exe 161 6.00.2900.5512 20094 <NA> <NA> <NA> <NA>\n2 2 1732 360Safe.exe 195 7, 5, 0, 1501 1017e 360.cn 360\xe5\xae\x89\xe5\x85\xa8\xe5\x8d\xab\xe5\xa3\xab <NA> <NA>\n3 3 1732 360Safe.exe 209 7, 5, 0, 1501 1017e <NA> <NA> <NA> <NA>\n4 4 436 360chrome.exe 211 5.2.0.804 <NA> <NA> <NA> 2027a 20290\n U\n1 <NA>\n2 <NA>\n3 <NA>\n4 www.hao123.com\nRun Code Online (Sandbox Code Playgroud)\n\n您可以重命名列等,但实际上并没有那么多代码可以一次完成所有转换。
\n