我想在两个其他字符串之间提取一个字符串.一个字符串是回车符,而另一个字符串是几乎相似字符的变体:
dput(head(decisions$Title))
c("Zinaida Shumilina et al. v. Belarus \r\n
CCPR/C/120/D/2142/2012",
"K.E.R. vs. Canada \r\n
CCPR/C/120/D/2196/2012",
"Lounis Khelifati v Algeria \r\n
CCPR/C/120/D/2267/2013",
"Hibaq Said Hash v. Denmark \r\n
CCPR/C/120/D/2470/2014",
"Anton Batanov v. Russian Federation \r\n
CCPR/C/120/D/2532/2015",
"S. Z. v. Denmark \r\n
CCPR/C/120/D/2625/2015"
)
Run Code Online (Sandbox Code Playgroud)
我基本上想要在"v."之间提取国家名称.和回车\ r.但是,"v." 有时是"v","vs.","vs"和"v:".
基于相关SO问题的答案,我尝试了以下方法:
res <- str_match(decisions$Title, "(v\\.|vs\\.|v)(.*?)\\r")
res[,3]
Run Code Online (Sandbox Code Playgroud)
不幸的是,这并没有得到所有的变化,或者在某些情况下,当试图从"Navruz Tahirovich Nasyrlayev诉土库曼斯坦CCPR/C/117/D /"中提取国名时,它会返回诸如"ruz Tahirovich Nasyrlayev诉土库曼斯坦"之类的数据.二千零十二分之二千二百十九" .
还有另一种方法来实现这一目标吗?
你可以用
trimws(str_match(decisions$Title, "\\bv(?:s?\\.|:)?\\s*(.*)")[,2])
Run Code Online (Sandbox Code Playgroud)
请参阅正则表达式演示.请注意,这trimws将删除多余的前导和尾随空格字符.
图案细节
\b - 一个单词边界v- 一个v炭(?:s?\\.|:)?- 可选地匹配可选的s后跟.或者:char\\s* - 0+空格字符(.*)-第1组:比换行符字符(请注意,您不必担心是否有其他任何字符0+ .匹配的CR符号或没有(在使用正则表达式TRE味sub的.也匹配LF符号)becaue trimws将削减前/后无论如何都是空白).在R中测试:
> df<-c("Zinaida Shumilina et al. v. Belarus \r\n
+ CCPR/C/120/D/2142/2012",
+ "K.E.R. vs. Canada \r\n
+ CCPR/C/120/D/2196/2012",
+ "Lounis Khelifati v Algeria \r\n
+ CCPR/C/120/D/2267/2013",
+ "Hibaq Said Hash v. Denmark \r\n
+ CCPR/C/120/D/2470/2014",
+ "Anton Batanov v. Russian Federation \r\n
+ CCPR/C/120/D/2532/2015",
+ "S. Z. v. Denmark \r\n
+ CCPR/C/120/D/2625/2015"
+ )
> trimws(str_match(df, "\\bv(?:s?\\.|:)?\\s*(.*)")[,2])
[1] "Belarus" "Canada" "Algeria"
[4] "Denmark" "Russian Federation" "Denmark"
>
Run Code Online (Sandbox Code Playgroud)