在R中仅提取一个已知字符串之间的字符串

Ray*_*S. 2 regex r stringr

我想在两个其他字符串之间提取一个字符串.一个字符串是回车符,而另一个字符串是几乎相似字符的变体:

dput(head(decisions$Title))
c("Zinaida Shumilina et al. v. Belarus                    \r\n                    
CCPR/C/120/D/2142/2012", 
"K.E.R. vs. Canada                    \r\n                    
CCPR/C/120/D/2196/2012", 
"Lounis Khelifati v Algeria                    \r\n                    
CCPR/C/120/D/2267/2013", 
"Hibaq Said Hash v. Denmark                    \r\n                    
CCPR/C/120/D/2470/2014", 
"Anton Batanov v. Russian Federation                    \r\n                    
CCPR/C/120/D/2532/2015", 
"S. Z. v. Denmark                    \r\n                    
CCPR/C/120/D/2625/2015"
)
Run Code Online (Sandbox Code Playgroud)

我基本上想要在"v."之间提取国家名称.和回车\ r.但是,"v." 有时是"v","vs.","vs"和"v:".

基于相关SO问题的答案,我尝试了以下方法:

res <- str_match(decisions$Title, "(v\\.|vs\\.|v)(.*?)\\r")
res[,3]
Run Code Online (Sandbox Code Playgroud)

不幸的是,这并没有得到所有的变化,或者在某些情况下,当试图从"Navruz Tahirovich Nasyrlayev诉土库曼斯坦CCPR/C/117/D /"中提取国名时,它会返回诸如"ruz Tahirovich Nasyrlayev诉土库曼斯坦"之类的数据.二千零十二分之二千二百十九" .

还有另一种方法来实现这一目标吗?

Wik*_*żew 6

你可以用

trimws(str_match(decisions$Title, "\\bv(?:s?\\.|:)?\\s*(.*)")[,2])
Run Code Online (Sandbox Code Playgroud)

请参阅正则表达式演示.请注意,这trimws将删除多余的前导和尾随空格字符.

图案细节

  • \b - 一个单词边界
  • v- 一个v
  • (?:s?\\.|:)?- 可选地匹配可选的s后跟.或者:char
  • \\s* - 0+空格字符
  • (.*)-第1组:比换行符字符(请注意,您不必担心是否有其他任何字符0+ .匹配的CR符号或没有(在使用正则表达式TRE味sub.也匹配LF符号)becaue trimws将削减前/后无论如何都是空白).

在R中测试:

> df<-c("Zinaida Shumilina et al. v. Belarus                    \r\n                    
+ CCPR/C/120/D/2142/2012", 
+ "K.E.R. vs. Canada                    \r\n                    
+ CCPR/C/120/D/2196/2012", 
+ "Lounis Khelifati v Algeria                    \r\n                    
+ CCPR/C/120/D/2267/2013", 
+ "Hibaq Said Hash v. Denmark                    \r\n                    
+ CCPR/C/120/D/2470/2014", 
+ "Anton Batanov v. Russian Federation                    \r\n                    
+ CCPR/C/120/D/2532/2015", 
+ "S. Z. v. Denmark                    \r\n                    
+ CCPR/C/120/D/2625/2015"
+ )

> trimws(str_match(df, "\\bv(?:s?\\.|:)?\\s*(.*)")[,2])
[1] "Belarus"            "Canada"             "Algeria"           
[4] "Denmark"            "Russian Federation" "Denmark"           
> 
Run Code Online (Sandbox Code Playgroud)