使用正则表达式从R中的字符串中提取特定长度的数值

Question

使用正则表达式从R中的字符串中提取特定长度的数值

Ank*_*ira 0 regex r extract string-length gsub

看起来像是一个重复的问题，但其他答案对我没有帮助。我正在尝试提取文本中的任何 8 位数字。该数字可以位于文本中的任何位置。它可以单独存在，也可以跟随或跟随字符串。基本上，我需要仅使用正则表达式从 R 中的字符串中提取任何出现的 8 个连续数字字符。

这是我尝试过的，但没有成功：

> my_text <- "the number 5849 and 5555555555 shouldn't turn up. but12345654 and 99119911 should be. let's see if 1234567H also works. It shouldn't. both 12345678JE and RG10293847 should turn up as well."

> ## this doesn't work
    > gsub('(\\d{8})', '\\1', my_text)
    [1] "the number 5849 shouldn't turn up. but12345654 and 99119911 should be. let's see if 1234567H also works. It shouldn't.both 12345678JE and RG10293847 should turn up as well."

Run Code Online (Sandbox Code Playgroud)

我想要的输出应该提取以下数字：

Run Code Online (Sandbox Code Playgroud)

同时，如果答案包含第二个正则表达式，用于仅提取第一次出现的 8 位数字，我也将不胜感激：

12345654

Run Code Online (Sandbox Code Playgroud)

编辑：我有一个非常大的表（大约 2 亿行），我需要在其中一列上进行操作。什么是最有效的解决方案？

编辑：我意识到我的文本案例中缺少案例。文本中也有一些数字长度超过8位，但我只想提取正好是8位的数字。

Answer 1

Ron*_*hah 6

我们可以用str_extract_all

stringr::str_extract_all(my_text, "\\d{8}")[[1]]
#[1] "12345654" "99119911" "12345678" "10293847"

Run Code Online (Sandbox Code Playgroud)

类似地，在基 R 中我们可以使用gregexpr和regmatches

regmatches(my_text, gregexpr("\\d{8}", my_text))[[1]]

Run Code Online (Sandbox Code Playgroud)

要获取最后 8 位数字，我们可以使用

sub('.*(\\d{8}).*', '\\1', my_text)
#[1] "10293847"

Run Code Online (Sandbox Code Playgroud)

而对于第一个，我们可以使用

sub('.*?(\\d{8}).*', '\\1', my_text)
#[1] "12345654"

Run Code Online (Sandbox Code Playgroud)

编辑

对于更新后的情况，我们想要精确匹配 8 位数字（而不是更多），我们可以使用str_match_all负向后查找

stringr::str_match_all(my_text, "(?<!\\d)\\d{8}(?!\\d)")[[1]][, 1]
#[1] "12345654" "99119911" "12345678" "10293847"

Run Code Online (Sandbox Code Playgroud)

在这里，我们得到 8 位数字，后面没有数字。

一个简单的选择也可以是从字符串中提取所有数字并仅保留 8 位数字

v1 <- regmatches(my_text, gregexpr("\\d+", my_text))[[1]]
v1[nchar(v1) == 8]

Run Code Online (Sandbox Code Playgroud)

归档时间：	6 年，3 月前
查看次数：	2213 次
最近记录：	6 年，3 月前