用于保留案例模式,大写的正则表达式

Question

用于保留案例模式,大写的正则表达式

是否有一个正则表达式来保留案例模式\U和\L？

在下面的示例中,我想转换"date"为"month"同时保持使用的大小写input

   from        to
  "date" ~~> "month"
  "Date" ~~> "Month"
  "DATE" ~~> "MONTH"

Run Code Online (Sandbox Code Playgroud)

我目前使用三个嵌套调用sub来完成此任务.

input <- c("date", "Date", "DATE")
expected.out <- c("month", "Month", "MONTH")

sub("date", "month", 
  sub("Date", "Month", 
    sub("DATE", "MONTH", input)
  )
)

Run Code Online (Sandbox Code Playgroud)

我们的目标是有一个单一的pattern和一个replace如

gsub("(date)", "\\Umonth", input, perl=TRUE)

Run Code Online (Sandbox Code Playgroud)

这将产生所需的输出

Answer 1

Tyl*_*ker 8

这是 qdap 方法。非常简单，但不是最快的：

input <- rep("Here are a date, a Date, and a DATE",2)
pat <- c("date", "Date", "DATE")
ret <- c("month", "Month", "MONTH")


library(qdap)
mgsub(pat, ret, input)

## [1] "Here are a month, a Month, and a MONTH"
## [2] "Here are a month, a Month, and a MONTH"

Run Code Online (Sandbox Code Playgroud)

基准测试：

input <- rep("Here are a date, a Date, and a DATE",1000)

library(microbenchmark)

(op <- microbenchmark( 
    GSUBFN = gsubfn('date', list('date'='month','Date'='Month','DATE'='MONTH'), 
             input, ignore.case=T),
    QDAP = mgsub(pat, ret, input),
    REDUCE = Reduce(function(str, args) gsub(args[1], args[2], str), 
       Map(c, pat, ret), init = input),
    FOR = function() {
       for(i in seq_along(pat)) { 
          input <- gsub(pat[i],ret[i],input) 
       }
       input
    },

times=100L))

## Unit: milliseconds
##    expr        min         lq     median         uq        max neval
##  GSUBFN 682.549812 815.908385 847.361883 925.385557 1186.66743   100
##    QDAP  10.499195  12.217805  13.059149  13.912157   25.77868   100
##  REDUCE   4.267602   5.184986   5.482151   5.679251   28.57819   100
##     FOR   4.244743   5.148132   5.434801   5.870518   10.28833   100

Run Code Online (Sandbox Code Playgroud)

在此输入图像描述

Answer 2

the*_*ail 7

这是我认为for循环合理的场合之一:

input <- rep("Here are a date, a Date, and a DATE",2)
pat <- c("date", "Date", "DATE")
ret <- c("month", "Month", "MONTH")

for(i in seq_along(pat)) { input <- gsub(pat[i],ret[i],input) }
input
#[1] "Here are a month, a Month, and a MONTH" 
#[2] "Here are a month, a Month, and a MONTH"

Run Code Online (Sandbox Code Playgroud)

另外一种方法是@flodel实现与循环相同的逻辑Reduce:

Reduce(function(str, args) gsub(args[1], args[2], str), 
       Map(c, pat, ret), init = input)

Run Code Online (Sandbox Code Playgroud)

有关这些选项的一些基准测试,请参阅@ TylerRinker的答案.

Answer 3

Ste*_*ers 6

AFAIK 没有办法用纯正则表达式和单个（*）查找和替换来完成您所要求的操作。问题是替换部分只能按原样使用捕获组匹配 - 它无法处理它们、从中导出信息或在不涉及函数的情况下执行条件语句。因此，即使您\b(?:(d)|(D))(?:(a)|(A))(?:(t)|(T))(?:(e)|(E))\b在区分大小写的查找中使用类似的内容（因此偶数编号的捕获是大写的，奇数编号的捕获是小写的 - 请参阅 regex101 右侧窗格中的“匹配信息”），替换部分仍然需要函数对捕获的信息采取行动。

(*) 我假设您不想对上下的每个组合执行单独的查找和替换！

附录

我可以就此打住，因为您已经明确表示您对其他解决方案不感兴趣...但只是为了好玩，我想尝试一个 Javascript 解决方案（其中包括作为正则表达式替换的一部分的函数处理）：

const text = `This Date is a DATE that is daTe and date.
But dated should not be replaced, and nor should sedate.`;

const find = "date", replace = "month";
// For the general case, could apply a regex escaping function to `find` here.
// See https://stackoverflow.com/questions/3561493

const result = text.replace(new RegExp(`\\b${find}\\b`, "gi"), match => {
  let rep = "", pos = 0, upperCase = false;
  for (; pos < find.length && pos < replace.length; pos++) {
    const matchChar = match.charAt(pos);
    upperCase = matchChar.toUpperCase() === matchChar;
    const repChar = replace.charAt(pos);
    rep += upperCase ? repChar.toUpperCase() : repChar.toLowerCase();
  }
  const remaining = replace.substring(pos);
  rep += upperCase ? remaining.toUpperCase() : remaining.toLowerCase();
  return rep;
});

console.log(result);

Run Code Online (Sandbox Code Playgroud)

这很好地解释了为什么您需要一些逻辑来进行替换 - 我喜欢您可以在 JS 中相当优雅地完成此操作。仅供参考，您发布的代码片段似乎没有进行替换（输出的第一句话仍然是“这个日期是一个包含日期和日期的日期。”）如果您从它的工作模式中删除单词边界（尽管显然它会在第二句中进行不需要的替换）。 (3认同)

Answer 4

hwn*_*wnd 5

使用该gsubfn包,您可以避免使用嵌套的子函数,并在一次调用中执行此操作.

> library(gsubfn)
> x <- 'Here we have a date, a different Date, and a DATE'
> gsubfn('date', list('date'='month','Date'='Month','DATE'='MONTH'), x, ignore.case=T)
# [1] "Here we have a month, a different Month, and a MONTH"

Run Code Online (Sandbox Code Playgroud)

Answer 5

Sam*_*amR 5

你必须写一些逻辑

您不会为此找到纯粹的正则表达式解决方案。C#和JS中类似的 SO 问题包含广泛的逻辑流程来确定哪些字符是大写字母。

此外，这些问题有额外的限制，这使得它们比您的问题简单得多：

图案和替换的长度相同。
模式中的每个字符都有一个唯一的替换字符，例如"abcd" => "wxyz"。

Rust reddit 上对类似问题的回复指出：

有很多可能出错的方式。例如，如果您尝试替换为不同数量的字符（“abc”->“wxyz”），会发生什么情况？如果您有一个包含多个传出链接的映射（“aaa”->“xyz”）怎么办？

这正是您正在尝试做的事情。当模式和替换的长度不同时，通常您希望将模式中每个大写字母的索引映射到替换中的索引，例如"daTe" => ""moNth。但是，有时您不这样做，例如"DATE" => "MONTH"和 not "MONTh"。即使存在具有某种等价物的正则\U表达式风格（这是一个很好的问题），为了应对不同长度的模式和替换，正则表达式还不够。

另一个复杂之处是模式或替换中的字母不能保证是唯一的：您希望能够替换"WEEK"为"MONTH"，反之亦然。这排除了像 Rust 答案这样的字符哈希映射方法。注释中链接的Perl 响应可以处理不同长度的替换。然而，要将其推广到不仅仅是第一个字母，需要一个列出所有可能的大写和小写字母排列的模式。这至少是2^n模式，其中n是被替换的单词中的字母数量。这并不比在 R 或任何语言中做同样的事情更能让你走得更远。

R溶液

我编写了一个函数swap()，它将使用两个字符串（即使字母数量不同）为您执行此操作：

x <- "This Date is a DATE that is daTe and date."
swap("date", "month", x)
# [1] "This Month is a MONTH that is moNth and month."

Run Code Online (Sandbox Code Playgroud)

怎么运行的

该swap()函数的使用方式与此答案Reduce()非常相似：

swap <- function(old, new, str, preserve_boundaries = TRUE) {
    l <- create_replacement_pairs(old, new, str, preserve_boundaries)
    Reduce(\(x, l) gsub(l[1], l[2], x, fixed = TRUE), l, init = str)
}

Run Code Online (Sandbox Code Playgroud)

主力函数是create_replacement_pairs()，它创建实际出现在字符串中的模式对列表，例如c("daTe", "DATE")，并生成正确大小写的替换，例如c("moNth", "MONTH")。函数逻辑为：

查找字符串中的所有匹配项，例如"Date" "DATE" "daTe" "date"。
创建一个布尔掩码，指示每个字母是否为大写字母。
如果所有字母都是大写，则替换也应该全部大写，例如"DATE" => "MONTH"。否则，如果模式中相应索引处的字母是大写，则将替换中每个索引处的字母设为大写。

create_replacement_pairs <- function(old = "date", new = "month", str, preserve_boundaries) {
    if (preserve_boundaries) {
        pattern <- paste0("\\b", old, "\\b")
    } else {
        pattern <- old
    }

    matches <- unique(unlist(
        regmatches(str, gregexpr(pattern, str, ignore.case = TRUE))
    )) # e.g. "Date" "DATE" "daTe" "date"

    capital_shift <- lapply(matches, \(x) {
        out_length <- nchar(new)
        # Boolean mask if <= capital Z
        capitals <- utf8ToInt(x) <= 90

        # If e.g. DATE, replacement should be
        # MONTH and not MONTh
        if (all(capitals)) {
            shift <- rep(32, out_length)
        } else {
            # If not all capitals replace corresponding
            # index with capital e.g. daTe => moNth

            # Pad with lower case if replacement is longer
            length_diff <- max(out_length - nchar(old), 0)
            shift <- c(
                ifelse(capitals, 32, 0),
                rep(0, length_diff)
            )[1:out_length] # truncate if replacement shorter than pattern
        }
    })

    replacements <- lapply(capital_shift, \(x) {
        paste(vapply(
            utf8ToInt(new) - x,
            intToUtf8,
            character(1)
        ), collapse = "")
    })

    replacement_list <- Map(\(x, y) c(old = x, new = y), matches, replacements)

    replacement_list
}

Run Code Online (Sandbox Code Playgroud)

用例

这种方法不受与本答案开头链接的 Rust 和 C# 答案相同的限制。我们已经看到这种方法在替换比模式长的情况下有效。反之亦然：

swap("date", "day", x)
# [1] "This Day is a DAY that is daY and day."

Run Code Online (Sandbox Code Playgroud)

此外，由于它不使用哈希映射，因此它可以在替换中的字母不唯一的情况下工作。

swap("date", "week", x)
# [1] "This Week is a WEEK that is weEk and week."

Run Code Online (Sandbox Code Playgroud)

它也适用于图案中的字母不唯一的情况：

swap("that", "which", x)
# [1] "This Date is a DATE which is daTe and date."

Run Code Online (Sandbox Code Playgroud)

编辑：感谢@shs在评论中指出这并没有保留单词边界。现在默认情况下会这样做，但您可以使用以下命令禁用它 preserve_boundaries = FALSE：

swap("date", "week", "this dAte is dated", preserve_boundaries = FALSE)
# [1] "this wEek is weekd"
swap("date", "week", "this dAte is dated")
# [1] "this wEek is dated"

Run Code Online (Sandbox Code Playgroud)

表现

以这种方式从小写参数动态生成匹配不会像硬编码那么快list(c("Date", "Month"), c("DATE", "MONTH"), c("daTe", "moNth"), c("date", "month"))。然而，公平的比较可能应该包括输入该列表所需的时间，我怀疑即使是最忠实的 vim 用户，该时间也能在函数返回所需的千分之一秒内完成。

我受益于看到 Tyler Rinker答案中的基准，因此使用了Reduce()和gsub()，这是测试的替换方法中最快的。此外，此答案中的方法会生成精确匹配和替换对，因此我们可以设置fixed = TRUE，与相比gsub()，使用五个字符模式进行替换所需的时间大约是的四分之一fixed = FALSE。

这确实对字符串进行了多次传递，而不是其他一些答案只通过一次来查找匹配项。然而，这些答案会在找到匹配后应用逻辑，而这具有匹配到替换的一对一映射，因此不需要逻辑。我怀疑哪个更快取决于数据，特别是你有多少个模式变体，以及语言（在 R 中执行几次正则表达式通常更快，这是用 C 编写的，而不是资本转移逻辑，它是用 C 编写的）是用 R 编写的）。

这仍然是一个解决方法吗？是的。但由于纯粹的正则表达式解决方案不存在，我喜欢一个抽象出不合时宜的字符级迭代的解决方案，所以我可以忘记它有点像黑客。

归档时间：	11 年，4 月前
查看次数：	213 次
最近记录：	11 年，4 月前