R 3.5.0中是否支持正则表达式？

Question

R 3.5.0中是否支持正则表达式？

我\\L\\1在R-dev(2017-06-06和2017-06-16 r72796版本)的非常特殊情况下遇到了perl表达式的困难:

bib <- readLines("https://raw.githubusercontent.com/HughParsonage/TeXCheckR/master/tests/testthat/lint_bib_in.bib", encoding = "UTF-8")

leading_spaces <- 2

is_field <- grepl("=", bib, fixed = TRUE)
field_width <- nchar(trimws(gsub("[=].*$", "", bib, perl = TRUE)))

widest_field <- max(field_width[is_field])

out <- bib

# Vectorized gsub:
for (line in seq_along(bib)){
  # Replace every field line with
  # two spaces + field name + spaces required for widest field + space
  if (is_field[line]){
    spaces_req <- widest_field - field_width[line]
    out[line] <-
      gsub("^\\s*(\\w+)\\s*[=]\\s*\\{",
           paste0(paste0(rep(" ", leading_spaces), collapse = ""),
                  "\\L\\1",
                  paste0(rep(" ", spaces_req), collapse = ""),
                  " = {"),
           bib[line],
           perl = TRUE)
  }
}

# Add commas: 
out[is_field] <- gsub("\\}$", "\\},", out[is_field], perl = TRUE)

out[9]
#> R-dev   "  author"
#> R 3.4.0 "  author      = {Tony Wood and Amélie Hunter and Michael O'Toole and Prasana Venkataraman and Lucy Carter},"

Run Code Online (Sandbox Code Playgroud)

要重现,有必要:

为了readLines从一个文件,并指定编码.(使用dput不会重现)
使用\\L或\\U在perl正则表达式中使用.
使用字符向量
要有一个需要UTF-8的矢量元素(上面的Amélie中的é)

这是R 3.5.0的变化,还是\\L在这种情况下我一直在滥用？

Answer 1

Wik*_*żew 10

UPDATE

修正此行为的补丁已应用于r74274.

原始答案

显然有一些意想不到的行为.

在提到时\1,它可以输出:

[1] "  author      = {Tony Wood and Amélie Hunter and Michael O'Toole and Prasana Venkataraman and Lucy Carter},"

Run Code Online (Sandbox Code Playgroud)

然而,只要一个\U或\L与使用\1,该第二反向引用被除去.

"\\U\\1": [1] " AUTHOR"
"\\U\\1\\E\\2": [1] " AUTHOR"

一个gsubfn解决方案仍然有效(在这里,一个例子用toupper()):

library(gsubfn)
bib <- readLines("https://raw.githubusercontent.com/HughParsonage/TeXCheckR/master/tests/testthat/lint_bib_in.bib", encoding = "UTF-8")
leading_spaces <- 2
is_field <- grepl("=", bib, fixed = TRUE)
field_width <- nchar(trimws(gsub("[=].*$", "", bib, perl = TRUE)))
widest_field <- max(field_width[is_field])
out <- bib

# Vectorized gsub:
for (line in seq_along(bib)){
  # Replace every field line with
  # two spaces + field name + spaces required for widest field + space
  if (is_field[line]){
    spaces_req <- widest_field - field_width[line]
    out[line] <-
      gsubfn("^\\s*(\\w+)\\s*=\\s*\\{", 
             function(y) paste0(
                  paste0(rep(" ", leading_spaces), collapse = ""),
                  toupper(y),
                  paste0(rep(" ", spaces_req), collapse = ""),
                  " = {"
             ),
           bib[line], engine="R"
      )
  }
}
# Add commas: 
out[is_field] <- gsub("\\}$", "},", out[is_field], perl = TRUE)

out[9]

Run Code Online (Sandbox Code Playgroud)

输出:

[1] "  AUTHOR      = {Tony Wood and Amélie Hunter and Michael O'Toole and Prasana Venkataraman and Lucy Carter},"

Run Code Online (Sandbox Code Playgroud)

我的sessionInfo详细信息:

> sessionInfo()
R Under development (unstable) (2017-06-19 r72808)
Platform: i386-w64-mingw32/i386 (32-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252 
[2] LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] gsubfn_0.6-6 proto_1.0.0 

loaded via a namespace (and not attached):
[1] compiler_3.5.0 tools_3.5.0    tcltk_3.5.0

Run Code Online (Sandbox Code Playgroud)

Answer 2

Hug*_*ugh -3

是的，支持。r74274 中应用了纠正此行为的补丁。

Date: Mon, 19 Feb 2018 14:56:11 +0000
Subject: [PATCH] Fix lower/upper case conversions in UTF-8 in gsub (related to
 72714).

git-svn-id: https://svn.r-project.org/R/trunk@74274 00db46b3-68df-0310-9c12-caf00c1e9a41
---
 src/main/grep.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/main/grep.c b/src/main/grep.c
index dd10b9d923e..68e63616c87 100644
--- a/src/main/grep.c
+++ b/src/main/grep.c
@@ -1592,7 +1592,7 @@ char *pcre_string_adj(char *target, const char *orig, const char *repl,
            for (j = 0; j < nc; j++) wc[j] = towctrans(wc[j], tr);
            nb = (int) wcstoutf8(NULL, wc, INT_MAX);
            wcstoutf8(xi, wc, nb);
-           for (j = 0; j < nb; j++) *t++ = *xi++;
+           for (j = 0; j < nb - 1; j++) *t++ = *xi++;
            }
        } else
            for (i = ovec[2*k] ; i < ovec[2*k+1] ; i++) {

Run Code Online (Sandbox Code Playgroud)

归档时间：	8 年，8 月前
查看次数：	298 次
最近记录：	7 年，8 月前