使用 R 在整数第一次出现时分割字符串

NM_*_*NM_ 5 regex string split r data.table

注意我已经阅读过在字符串中第一次出现整数时分割字符串,但是我的请求不同,因为我想使用 R。

假设我有以下示例数据框:

> df = data.frame(name_and_address =
      c("Mr. Smith12 Some street",
        "Mr. Jones345 Another street",
        "Mr. Anderson6 A different street"))
> df
                  name_and_address
1          Mr. Smith12 Some street
2      Mr. Jones345 Another street
3 Mr. Anderson6 A different street
Run Code Online (Sandbox Code Playgroud)

我想在第一次出现整数时分割字符串。请注意,整数的长度不同。

所需的输出可以如下所示:

[[1]]
[1] "Mr. Smith"
[2] "12 Some street",

[[2]]
[1] "Mr. Jones"
[2] "345 Another street",

[[3]]
[1] "Mr. Anderson"
[2] "6 A different street"
Run Code Online (Sandbox Code Playgroud)

我已尝试以下操作,但无法获得正确的正则表达式:

# Attempt 1 (Does not work)
library(data.table)
tstrsplit(df,'(?=\\d+)', perl=TRUE, type.convert=TRUE)

# Attempt 2 (Does not work)
library(stringr)
str_split(df, "\\d+")
Run Code Online (Sandbox Code Playgroud)

Wik*_*żew 3

您可以使用tidyr::extract

library(tidyr)
df <- df %>% 
    extract("name_and_address", c("name", "address"), "(\\D*)(\\d.*)")
## => df
##           name              address
## 1    Mr. Smith       12 Some street
## 2    Mr. Jones   345 Another street
## 3 Mr. Anderson 6 A different street
Run Code Online (Sandbox Code Playgroud)

(\D*)(\d.*)则表达式匹配以下内容:

  • (\D*)- 第 1 组:任何零个或多个非数字字符
  • (\d.*) - 第 2 组:一个数字,然后是尽可能多的零个或多个字符。

另一种解决方案stringr::str_split也是可能的:

str_split(df$name_and_address, "(?=\\d)", n=2)
## => [[1]]
## [1] "Mr. Smith"      "12 Some street"

## [[2]]
## [1] "Mr. Jones"          "345 Another street"

## [[3]]
## [1] "Mr. Anderson"         "6 A different street"
Run Code Online (Sandbox Code Playgroud)

正向(?=\d)先行查找数字之前的位置,并n=2告知stringr::str_split最多仅分成 2 个块。

如果字符串中没有数字,则基本 R 方法不会返回任何内容:

df = data.frame(name_and_address = c("Mr. Smith12 Some street", "Mr. Jones345 Another street", "Mr. Anderson6 A different street", "1 digit is at the start", "No digits, sorry."))

df$name <- sub("^(?:(\\D*)\\d.*|.+)", "\\1", df$name_and_address)
df$address <- sub("^\\D*(\\d.*)?", "\\1", df$name_and_address)
df$name
# => [1] "Mr. Smith"    "Mr. Jones"    "Mr. Anderson" ""             ""
df$address
# => [1] "12 Some street"          "345 Another street"     
#    [3] "6 A different street"    "1 digit is at the start"         ""                       
Run Code Online (Sandbox Code Playgroud)

查看在线 R 演示。这也支持第一个数字是字符串中第一个字符的情况。