将 .subst 与部分正则表达式匹配一起使用

Question

将 .subst 与部分正则表达式匹配一起使用

my $book1 = "Don Quixote- Miguel de Cervantes";\nmy $book2 = "Les Mis\xc3\xa9rables -Victor Hugo";\nmy $book3 = "War and Peace - Leo Tolstoy";\n

Run Code Online (Sandbox Code Playgroud)\n

我想用来.subst将中的“-”更改为“-” $book1，并将中的“-”更改为“-” $book2。问题是我找不到正确的正则表达式来使用.subst. 我可以使用与正则表达式不同的东西，但我想使用.subst. 我可以对两个字符串使用不同的正则表达式，但两者都应该忽略$book3.

\n

很抱歉问了这个可能是基本的问题。我一直在尝试不同的事情，但我总是毁掉部分文字。

\n

Answer 1

che*_*nyf 8

你可以使用反式方法：

\n

my $book1 = "Don Quixote- Miguel de Cervantes";\nmy $book2 = "Les Mis\xc3\xa9rables -Victor Hugo";\nmy $book3 = "War and Peace - Leo Tolstoy";\n\nfor ($book1, $book2, $book3) -> $b {\n    say $b.trans([/<wb> \'- \'/, /\' -\' <wb>/] => [\' - \']);\n}\n

Run Code Online (Sandbox Code Playgroud)\n

wb是字边界。

\n

如果您想更多地利用 `trans` 的功能，可以将 `say` 行更改为 `say $b.trans( /<wb> '-'/ => ' -', /'-' < wb>/ => '- ');` （这可以更清楚地表达您的意图，并且*可能*更快一点，因为它避免了替换已经存在的空格） (2认同)

Answer 2

rai*_*iph 8

TL;DR Another option to consider is using the <( and )> capture markers to pick out just the bit you want to replace.

A "literal" interpretation of your Q

Matching strictly per your examples:

/   \C[space]   <(   '- '   |   ' -'   )>   \C[space]   /

Run Code Online (Sandbox Code Playgroud)

The syntax \c[...] specifies one or more characters by using their Unicode names inside the square brackets (in this case the classic ASCII space character).¹

In this pattern I've used \C[...] (uppercase C, not lowercase c). There is a range of Raku "backslash" atoms and they all have lowercase and uppercase variants, where the uppercase variant matches any character except the one(s) matched by the lowercase variant. So \C[space] matches any character other than the ASCII space character. See \c / \C for more info.
The <( capture marker marks the start point of the regex's capture. Likewise )> marks the endpoint.

Without them, when the pattern matches, the whole match would be captured, which would include whatever non whitespace character matches the \C[space] atom. We don't want that. So we use these markers to restrict what we capture.

Btw, each marker is independent. The above pattern matches \C[space] '- ' or '- ' \C[space]. If the pattern to the left of the | matches, only the <( has an impact, omitting whatever matched \C[space], and capturing until the end of the match, which for this pattern stops at the |. If the pattern to the right matches, capturing starts immediately after the | and ends at the )>.
The | is Raku's parallel (aka "longest token match" -- LTM) pattern alternation operator, an alternative to the traditional sequential pattern alternation operator (which in Raku is written ||). In this case the set of substrings that the two operators will and won't match is the same, so it makes no difference which is used. But | is shorter than ||; when the match set is the same it's typically faster; and when the match sets are different it's often | that's desirable. So I use it by default unless I know I need the traditional sequential alternation logic (try pattern on left of || first; if that fails, try the pattern on the right of the ||).

A "per its spirit?" interpretation of your Q

Matching more flexibly regarding whitespace:

/   \S   <(   '-' \s+   |   \s+ '-'   )>   \S   /

Run Code Online (Sandbox Code Playgroud)

The \S atoms match any character that is not categorized by Unicode as being a whitespace character. (I use Raku, or tools such as this character property lookup web page, to explore what Unicode makes of a character.)

Comparing \C[space], \S, and <wb>:
- \C[space] matches any character, including whitespace characters, with the sole exception of an ASCII space. My guess is it'll be the fastest of the three.
- \S matches any non-whitespace. My guess is it'll be faster than <wb>.
- <wb>字符之间的匹配。它还会匹配字符串中第一个字符之前和最后一个字符之后。因此，@chenyf 的模式将匹配并更改'- foo...'为' - foo...'和'...bar -'to ，而带有或的'...bar - '模式在这些字符串的开头/结尾处将不匹配。\C[space]\S
原子\s+匹配一个或多个空白字符。

脚注

¹命名不区分大小写。多个字符用逗号分隔。\c[...]也适用于双引号字符串（但不适用于\C[...]）。

Answer 3

jub*_*us1 5

for ($book1, $book2, $book3, $book4, $book5, $book6) -> $b \n  { say $b\n    .subst(/ \\S <( (\\-+) \\h   )> \\S /, {" $0 "}, :global) \n    .subst(/ \\S <(  \\h  (\\-+) )> \\S /, {" $0 "}, :global)\n    .subst(/ \\S <( (\\-)  \\v   )> \\S /,   {"$0"}, :global) #fixes hyphenated words w/embedded newlines\n}\n

Run Code Online (Sandbox Code Playgroud)\n

输入示例：

\n

my $book1 = "Don Quixote- Miguel de Cervantes";\nmy $book2 = "Les Mis\xc3\xa9rables -Victor Hugo";\nmy $book3 = "War and Peace - Leo Tolstoy";\nmy $book4 = "Moby-Dick; or, The Whale- Herman Melville";\nmy $book5 = "Winnie-the-Pooh --A. A. Milne";\nmy $book6 = "Slaughterhouse-\\nFive- Kurt Vonnegut";\n

Run Code Online (Sandbox Code Playgroud)\n

示例输出：

\n

Don Quixote - Miguel de Cervantes\nLes Mis\xc3\xa9rables - Victor Hugo\nWar and Peace - Leo Tolstoy\nMoby-Dick; or, The Whale - Herman Melville\nWinnie-the-Pooh -- A. A. Milne\nSlaughterhouse-Five - Kurt Vonnegut\n

Run Code Online (Sandbox Code Playgroud)\n

对于这个问题，我可能会首先询问这些错误的条目是如何进入手头的数据的。它是连接的产物吗？还是非正式（手动）输入？第一个是可以修复的，第二个可能是 Raku 编程语言的主要应用（即，使非正式的手动文本条目更加正式）。这个答案遵循已经发布的优秀示例，但是（相反）使用$0捕获来重新定位“ -”字段分隔符。简单来说：

\n

第一个.subst(...)命令在后跟单个水平空格时全局捕获一个或多个连字符，并在标题和作者之间放置相同数量的连字符（连字符被空格包围）。
\n
第二个.subst(...)命令全局捕获前面有水平空格的一个或多个连字符，并在标题和作者之间放置相同数量的连字符（连字符被空格包围）。
\n
第三个.subst(...)命令全局捕获后跟单个垂直空格（例如换行符）的单个连字符，并删除垂直空格。连字符后跟水平空白保持不变。注意，对于第三个.subst(...)命令，替换可以简单地写为"-"（即不需要使用$0）。
\n

\n

注意：前两个语句可以用OR.subst组合：|

\n

.subst(/ \\S <( (\\-+) \\h  | \\h  (\\-+) )> \\S /, {" "~$0~" "}, :global)\n

Run Code Online (Sandbox Code Playgroud)\n

为什么要这么麻烦呢？嗯，第一个原因是更“行人”的方法对于复杂的输入（例如连字符的单词）更稳健。事实上，已经发布的一些答案可能无法处理带连字符的书名和/或作者姓名，这些内容可以优雅地处理（上方和下方，请注意备用替换表格）：

\n

~$ cat book_author.txt\nDon Quixote- Miguel de Cervantes\nLes Mis\xc3\xa9rables -Victor Hugo\nWar and Peace - Leo Tolstoy\nMoby-Dick; or, The Whale- Herman Melville\nWinnie-the-Pooh --A. A. Milne\nSlaughterhouse-\nFive- Kurt Vonnegut\n~$ cat book_author.txt | raku -e 'say lines.join("\\n")\n      .subst(/ \\S <( (\\-+) \\h  )> \\S /, {" "~$0~" "}, :global)\n      .subst(/ \\S <( \\h  (\\-+) )> \\S /, {" "~$0~" "}, :global)\n      .subst(/ \\S <( \\-   \\v   )> \\S /,  "-", :global);'\nDon Quixote - Miguel de Cervantes\nLes Mis\xc3\xa9rables - Victor Hugo\nWar and Peace - Leo Tolstoy\nMoby-Dick; or, The Whale - Herman Melville\nWinnie-the-Pooh -- A. A. Milne\nSlaughterhouse-Five - Kurt Vonnegut\n

Run Code Online (Sandbox Code Playgroud)\n

第二个原因是这样的答案可用于修改具有其他分隔符的文本，例如 Title | Author数据，其中标题与作者通过竖线分隔。第三个原因是捕获（例如使用$0）适用于各种各样的问题，例如将多个相同的分隔符（例如--或）制作||成单字符分隔符（注意另一种编写替换的方式，这次添加.comb[0]）：

\n

~$ cat book_bar_author.txt\nDon Quixote| Miguel de Cervantes\nLes Mis\xc3\xa9rables |Victor Hugo\nWar and Peace | Leo Tolstoy\nMoby-Dick; or, The Whale| Herman Melville\nWinnie-the-Pooh ||A. A. Milne\nSlaughterhouse-\nFive| Kurt Vonnegut\n~$ cat book_bar_author.txt | raku -e 'say lines.join("\\n")\n      .subst(/ \\S <( (\\|+) \\h  )> \\S /, {"",$0.comb[0],""}, :global)\n      .subst(/ \\S <( \\h  (\\|+) )> \\S /, {"",$0.comb[0],""}, :global)\n      .subst(/ \\S <( \\-   \\v   )> \\S /,  "-", :global);'\nDon Quixote | Miguel de Cervantes\nLes Mis\xc3\xa9rables | Victor Hugo\nWar and Peace | Leo Tolstoy\nMoby-Dick; or, The Whale | Herman Melville\nWinnie-the-Pooh | A. A. Milne\nSlaughterhouse-Five | Kurt Vonnegut\n

Run Code Online (Sandbox Code Playgroud)\n

归档时间：	4 年，1 月前
查看次数：	263 次
最近记录：	4 年，1 月前