使用R中的正则表达式从字符串中获取数字

Can*_*ice 10 regex r

所以正则表达式是我一直在努力/从未花费适当时间学习的东西.在这种情况下,我有一个R矢量的字符串与棒球数据格式:

hit_vector = c("", "Batted ball speed <b>104 mph</b>; distance of <b>381 
feet</b>; launch angle of <b>38 degrees</b>.", 
"Ball was hit at <b>67 mph</b>.", "", "Ball left the bat at <b>107 mph</b> and traveled a distance of <b>412 feet</b>.", 
"Batted ball speed <b>71 mph</b>.", "Ball left the bat at <b>94 mph</b> and traveled a distance of <b>287 feet</b>.", 
"", "", "Batted ball speed <b>64 mph</b>.")  

> hit_vector
 [1] ""                                                                                                       
 [2] "Batted ball speed <b>104 mph</b>; distance of <b>381 feet</b>; launch angle of <b>38 degrees</b>."
 [3] "Ball was hit at <b>67 mph</b>."                                                                         
 [4] ""                                                                                                       
 [5] "Ball left the bat at <b>107 mph</b> and traveled a distance of <b>412 feet</b>."                        
 [6] "Batted ball speed <b>71 mph</b>."                                                                       
 [7] "Ball left the bat at <b>94 mph</b> and traveled a distance of <b>287 feet</b>."                         
 [8] ""                                                                                                       
 [9] ""                                                                                                       
[10] "Batted ball speed <b>64 mph</b>."  
Run Code Online (Sandbox Code Playgroud)

我正在尝试创建一个包含10行的数据框,如下所示:

hit_dataframe
    speed   distance   degrees
1.     NA         NA        NA
2.    104        381        38
3.     67         NA        NA
4.     NA         NA        NA
5.    107        412        NA
6.     71         NA        NA
7.     94        287        NA
8.     NA         NA        NA
9.     NA         NA        NA
10.    64         NA        NA
Run Code Online (Sandbox Code Playgroud)

整个hit_vector要长得多,但似乎它们都遵循这个命名约定.

编辑:看起来以下有助于识别一些信息,但这些行不能正常工作(第三行返回所有FALSE,这是不对的):

grepl("[0-9]{1,3} mph", hit_vector)
grepl("[0-9]{1,3} feet", hit_vector)
grepl("[0-9]{1,3} degrees", hit_vector)
Run Code Online (Sandbox Code Playgroud)

编辑2:我不确定每个统计数字的数字.例如,mph可以超过100(3位),也可以小于10(1位).

Ony*_*mbu 11

使用基数r:

read.table(text=gsub("\\D+"," ",hit_vector),fill=T,blank.lines.skip = F)

    V1  V2 V3
1   NA  NA NA
2  104 381 38
3   67  NA NA
4   NA  NA NA
5  107 412 NA
6   71  NA NA
7   94 287 NA
8   NA  NA NA
9   NA  NA NA
10  64  NA NA
Run Code Online (Sandbox Code Playgroud)

在这里,只需删除非数字的所有内容,即\\D+读取数据,有FILL=T或没有跳过

要考虑下面的评论,我们需要重新安排我们的数据:

hit_vector1=c(hit_vector,"traveled a distance of <b>412 feet</b>.")

#Take the numbers together with their respective measurements.
a=gsub(".*?(\\d+).*?(mph|feet|degree).*?"," \\1 \\2",hit_vector1)

#Remove the </b>
b=sub("<[/]b>.","",a)

## Any element that does not contain the measurements, invoke an NA
fun=function(x){y=-grep(x,b);b<<-replace(b,y,paste(b[y],NA,x))}
invisible(sapply(c("mph","feet","degrees"),fun))

## Break the line after each measurement and read in a table format
e=gsub("([a-z])\\s","\\1\n",b)
unstack(read.table(text=e))
      degrees feet mph
1       NA   NA  NA
2       38  381 104
3       NA   NA  67
4       NA   NA  NA
5       NA  412 107
6       NA   NA  71
7       NA  287  94
8       NA   NA  NA
9       NA   NA  NA
10      NA   NA  64
11      NA  412  NA
Run Code Online (Sandbox Code Playgroud)


cma*_*her 10

包中的str_extract函数在stringr这里应该是有用的:

data.frame(
    speed=str_extract(hit_vector, "(\\d+)(?=\\s+mph)"),
    distance=str_extract(hit_vector, "(\\d+)(?=\\s+feet)"),
    degrees=str_extract(hit_vector, "(\\d+)(?=\\s+degrees)")
)

#    speed distance degrees
# 1   <NA>     <NA>    <NA>
# 2    104      381      38
# 3     67     <NA>    <NA>
# 4   <NA>     <NA>    <NA>
# 5    107      412    <NA>
# 6     71     <NA>    <NA>
# 7     94      287    <NA>
# 8   <NA>     <NA>    <NA>
# 9   <NA>     <NA>    <NA>
# 10    64     <NA>    <NA>
Run Code Online (Sandbox Code Playgroud)

\\d是数字的字符类,因此\\d+匹配一个或多个数字.(?=)是零宽度先行操作者,所以在这种情况下,它匹配图案,随后的零个或更多空白字符(\\s+)和mph,feetdegrees,而不捕获这些字符串.

  • 你需要小心,第二排有104,381和38 (2认同)