拆分并提取R中字符串的一部分(在"."和数字之间)

Cha*_*han 7 regex r stringr

我有一个字符变量(companies),其观察结果如下所示:

  1. "612. Grt.Am.Dgt.&Inv.7.33"
  2. "77. Wickes 4.61"
  3. "265. Wang Labs 8.75"
  4. "9. CrossLand Savings 6.32"
  5. "228. JPS Textile Group 2.00"

我试图将这些字符串分成3部分:

  1. 第一个之前的所有数字".",
  2. 第一个"."和下一个数字之间的所有内容(格式一致#.##),以及
  3. 最后一个数字本身(格式#.##).

以第一个障碍为例,我想:"612","Grt.Am.CMt&Inv","5.01"

我尝试过定义模式rebus并使用str_match,但下面的代码仅适用于像obs#2和#3这样的情况.它并不反映字符串中间部分的所有变化以捕获其他障碍物.

pattern2 <- capture(one_or_more(DGT)) %R% DOT %R% SPC %R% 
            capture(or(one_or_more(WRD), one_or_more(WRD) %R% SPC 
            %R% one_or_more(WRD))) %R% SPC %R% capture(DGT %R% DOT 
            %R% one_or_more(DGT))

str_match(companies, pattern = pattern2)
Run Code Online (Sandbox Code Playgroud)

有没有更好的方法将字符串分成这3个部分?

我不熟悉regex,但我已经看到了很多建议(我是R和Stack Overflow的新手)

Wik*_*żew 1

您应该能够调试您编写的正则表达式。

> as.regex(pattern2)
<regex> ([\d]+)\.\s((?:[\w]+|[\w]+\s[\w]+))\s(\d\.[\d]+)
Run Code Online (Sandbox Code Playgroud)

将其插入regex101,您会发现您的字符串并不总是匹配。右侧的解释告诉您,点和数字之间只允许使用 1 或 2 个空格分隔的单词。此外,WRD( [\w]+pattern) 不匹配点和任何其他非字母、数字或 的字符_。现在,您知道您需要将字符串与

^(\d+)\.(.*?)\s*(\d\.\d{2})$
Run Code Online (Sandbox Code Playgroud)

请参阅此正则表达式演示。翻译成画画:

pattern2 <- START %R%            # ^ - start of string
 capture(one_or_more(DGT)) %R%   # (\d+) - Group 1: one or more digits
 DOT %R%                         # \. - a dot
 "(.*?)" %R%                     # (.*?) - Group 2: any 0+ chars as few as possible
 zero_or_more(SPC) %R%           # \s* - 0+ whitespaces 
 capture(DGT %R% DOT %R% repeated(DGT, 2)) %R% # (\d\.\d{2}) - Group 3: #.## number
END                              # $ - end of string
Run Code Online (Sandbox Code Playgroud)

检查:

> pattern2
<regex> ^([\d]+)\.(.*?)[\s]*(\d\.[\d]{2})$

> companies <- c("612. Grt. Am. Mgt. & Inv. 7.33","77. Wickes 4.61","265. Wang Labs 8.75","9. CrossLand Savings 6.32","228. JPS Textile Group 2.00")
> str_match(companies, pattern = pattern2)
     [,1]                             [,2]  [,3]                    [,4]  
[1,] "612. Grt. Am. Mgt. & Inv. 7.33" "612" " Grt. Am. Mgt. & Inv." "7.33"
[2,] "77. Wickes 4.61"                "77"  " Wickes"               "4.61"
[3,] "265. Wang Labs 8.75"            "265" " Wang Labs"            "8.75"
[4,] "9. CrossLand Savings 6.32"      "9"   " CrossLand Savings"    "6.32"
[5,] "228. JPS Textile Group 2.00"    "228" " JPS Textile Group"    "2.00"
Run Code Online (Sandbox Code Playgroud)

警告capture(lazy(zero_or_more(ANY_CHAR)))返回的([.]*?)模式尽可能少地匹配 0 个或多个点,而不是匹配任何 0 个以上的字符,因为rebus有一个错误:它用and (字符类)包装所有repeated(one_or_morezero_or_more) 字符。这就是为什么要“手动”添加。[](.*?)

[\w\W]可以使用/[\s\S]或 等常见结构来解决或解决此问题[\d\D]

pattern2 <- START %R%                          # ^ - start of string
 capture(one_or_more(DGT)) %R%                 # (\d+) - Group 1: one or more digits
 DOT %R%                                       # \. - a dot
 capture(                                      # Group 2 start:
  lazy(zero_or_more(char_class(WRD, NOT_WRD))) #  - [\w\W] - any 0+ chars as few as possible
 ) %R%                                         # End of Group 2
 zero_or_more(SPC) %R%                         # \s* - 0+ whitespaces 
 capture(DGT %R% DOT %R% repeated(DGT, 2)) %R% # (\d\.\d{2}) - Group 3: #.## number
END
Run Code Online (Sandbox Code Playgroud)

查看:

> as.regex(pattern2)
<regex> ^([\d]+)\.([\w\W]*?)[\s]*(\d\.[\d]{2})$
Run Code Online (Sandbox Code Playgroud)

请参阅正则表达式演示