我有一个字符变量(companies
),其观察结果如下所示:
我试图将这些字符串分成3部分:
"."
,"."
和下一个数字之间的所有内容(格式一致#.##
),以及#.##
).以第一个障碍为例,我想:"612","Grt.Am.CMt&Inv","5.01"
我尝试过定义模式rebus
并使用str_match
,但下面的代码仅适用于像obs#2和#3这样的情况.它并不反映字符串中间部分的所有变化以捕获其他障碍物.
pattern2 <- capture(one_or_more(DGT)) %R% DOT %R% SPC %R%
capture(or(one_or_more(WRD), one_or_more(WRD) %R% SPC
%R% one_or_more(WRD))) %R% SPC %R% capture(DGT %R% DOT
%R% one_or_more(DGT))
str_match(companies, pattern = pattern2)
Run Code Online (Sandbox Code Playgroud)
有没有更好的方法将字符串分成这3个部分?
我不熟悉regex
,但我已经看到了很多建议(我是R和Stack Overflow的新手)
您应该能够调试您编写的正则表达式。
> as.regex(pattern2)
<regex> ([\d]+)\.\s((?:[\w]+|[\w]+\s[\w]+))\s(\d\.[\d]+)
Run Code Online (Sandbox Code Playgroud)
将其插入regex101,您会发现您的字符串并不总是匹配。右侧的解释告诉您,点和数字之间只允许使用 1 或 2 个空格分隔的单词。此外,WRD
( [\w]+
pattern) 不匹配点和任何其他非字母、数字或 的字符_
。现在,您知道您需要将字符串与
^(\d+)\.(.*?)\s*(\d\.\d{2})$
Run Code Online (Sandbox Code Playgroud)
请参阅此正则表达式演示。翻译成画画:
pattern2 <- START %R% # ^ - start of string
capture(one_or_more(DGT)) %R% # (\d+) - Group 1: one or more digits
DOT %R% # \. - a dot
"(.*?)" %R% # (.*?) - Group 2: any 0+ chars as few as possible
zero_or_more(SPC) %R% # \s* - 0+ whitespaces
capture(DGT %R% DOT %R% repeated(DGT, 2)) %R% # (\d\.\d{2}) - Group 3: #.## number
END # $ - end of string
Run Code Online (Sandbox Code Playgroud)
检查:
> pattern2
<regex> ^([\d]+)\.(.*?)[\s]*(\d\.[\d]{2})$
> companies <- c("612. Grt. Am. Mgt. & Inv. 7.33","77. Wickes 4.61","265. Wang Labs 8.75","9. CrossLand Savings 6.32","228. JPS Textile Group 2.00")
> str_match(companies, pattern = pattern2)
[,1] [,2] [,3] [,4]
[1,] "612. Grt. Am. Mgt. & Inv. 7.33" "612" " Grt. Am. Mgt. & Inv." "7.33"
[2,] "77. Wickes 4.61" "77" " Wickes" "4.61"
[3,] "265. Wang Labs 8.75" "265" " Wang Labs" "8.75"
[4,] "9. CrossLand Savings 6.32" "9" " CrossLand Savings" "6.32"
[5,] "228. JPS Textile Group 2.00" "228" " JPS Textile Group" "2.00"
Run Code Online (Sandbox Code Playgroud)
警告:capture(lazy(zero_or_more(ANY_CHAR)))
返回的([.]*?)
模式尽可能少地匹配 0 个或多个点,而不是匹配任何 0 个以上的字符,因为rebus
有一个错误:它用and (字符类)包装所有repeated
(one_or_more
或zero_or_more
) 字符。这就是为什么要“手动”添加。[
]
(.*?)
[\w\W]
可以使用/[\s\S]
或 等常见结构来解决或解决此问题[\d\D]
:
pattern2 <- START %R% # ^ - start of string
capture(one_or_more(DGT)) %R% # (\d+) - Group 1: one or more digits
DOT %R% # \. - a dot
capture( # Group 2 start:
lazy(zero_or_more(char_class(WRD, NOT_WRD))) # - [\w\W] - any 0+ chars as few as possible
) %R% # End of Group 2
zero_or_more(SPC) %R% # \s* - 0+ whitespaces
capture(DGT %R% DOT %R% repeated(DGT, 2)) %R% # (\d\.\d{2}) - Group 3: #.## number
END
Run Code Online (Sandbox Code Playgroud)
查看:
> as.regex(pattern2)
<regex> ^([\d]+)\.([\w\W]*?)[\s]*(\d\.[\d]{2})$
Run Code Online (Sandbox Code Playgroud)
请参阅正则表达式演示。