my $book1 = "Don Quixote- Miguel de Cervantes";\nmy $book2 = "Les Mis\xc3\xa9rables -Victor Hugo";\nmy $book3 = "War and Peace - Leo Tolstoy";\nRun Code Online (Sandbox Code Playgroud)\n我想用来.subst将 中的“-”更改为“-” $book1,并将 中的“-”更改为“-” $book2。问题是我找不到正确的正则表达式来使用.subst. 我可以使用与正则表达式不同的东西,但我想使用.subst. 我可以对两个字符串使用不同的正则表达式,但两者都应该忽略$book3.
很抱歉问了这个可能是基本的问题。我一直在尝试不同的事情,但我总是毁掉部分文字。
\n你可以使用反式方法:
\nmy $book1 = "Don Quixote- Miguel de Cervantes";\nmy $book2 = "Les Mis\xc3\xa9rables -Victor Hugo";\nmy $book3 = "War and Peace - Leo Tolstoy";\n\nfor ($book1, $book2, $book3) -> $b {\n say $b.trans([/<wb> \'- \'/, /\' -\' <wb>/] => [\' - \']);\n}\nRun Code Online (Sandbox Code Playgroud)\nwb是字边界。
\nTL;DR Another option to consider is using the <( and )> capture markers to pick out just the bit you want to replace.
Matching strictly per your examples:
/ \C[space] <( '- ' | ' -' )> \C[space] /
Run Code Online (Sandbox Code Playgroud)
The syntax \c[...] specifies one or more characters by using their Unicode names inside the square brackets (in this case the classic ASCII space character).1
In this pattern I've used \C[...] (uppercase C, not lowercase c). There is a range of Raku "backslash" atoms and they all have lowercase and uppercase variants, where the uppercase variant matches any character except the one(s) matched by the lowercase variant. So \C[space] matches any character other than the ASCII space character. See \c / \C for more info.
The <( capture marker marks the start point of the regex's capture. Likewise )> marks the endpoint.
Without them, when the pattern matches, the whole match would be captured, which would include whatever non whitespace character matches the \C[space] atom. We don't want that. So we use these markers to restrict what we capture.
Btw, each marker is independent. The above pattern matches \C[space] '- ' or '- ' \C[space]. If the pattern to the left of the | matches, only the <( has an impact, omitting whatever matched \C[space], and capturing until the end of the match, which for this pattern stops at the |. If the pattern to the right matches, capturing starts immediately after the | and ends at the )>.
The | is Raku's parallel (aka "longest token match" -- LTM) pattern alternation operator, an alternative to the traditional sequential pattern alternation operator (which in Raku is written ||). In this case the set of substrings that the two operators will and won't match is the same, so it makes no difference which is used. But | is shorter than ||; when the match set is the same it's typically faster; and when the match sets are different it's often | that's desirable. So I use it by default unless I know I need the traditional sequential alternation logic (try pattern on left of || first; if that fails, try the pattern on the right of the ||).
Matching more flexibly regarding whitespace:
/ \S <( '-' \s+ | \s+ '-' )> \S /
Run Code Online (Sandbox Code Playgroud)
The \S atoms match any character that is not categorized by Unicode as being a whitespace character. (I use Raku, or tools such as this character property lookup web page, to explore what Unicode makes of a character.)
Comparing \C[space], \S, and <wb>:
\C[space] matches any character, including whitespace characters, with the sole exception of an ASCII space. My guess is it'll be the fastest of the three.
\S matches any non-whitespace. My guess is it'll be faster than <wb>.
<wb>字符之间的匹配。它还会匹配字符串中第一个字符之前和最后一个字符之后。因此,@chenyf 的模式将匹配并更改'- foo...'为' - foo...'和'...bar -'to ,而带有或 的'...bar - '模式在这些字符串的开头/结尾处将不匹配。\C[space]\S
原子\s+匹配一个或多个空白字符。
1命名不区分大小写。多个字符用逗号分隔。\c[...]也适用于双引号字符串(但不适用于\C[...])。
for ($book1, $book2, $book3, $book4, $book5, $book6) -> $b \n { say $b\n .subst(/ \\S <( (\\-+) \\h )> \\S /, {" $0 "}, :global) \n .subst(/ \\S <( \\h (\\-+) )> \\S /, {" $0 "}, :global)\n .subst(/ \\S <( (\\-) \\v )> \\S /, {"$0"}, :global) #fixes hyphenated words w/embedded newlines\n}\nRun Code Online (Sandbox Code Playgroud)\n输入示例:
\nmy $book1 = "Don Quixote- Miguel de Cervantes";\nmy $book2 = "Les Mis\xc3\xa9rables -Victor Hugo";\nmy $book3 = "War and Peace - Leo Tolstoy";\nmy $book4 = "Moby-Dick; or, The Whale- Herman Melville";\nmy $book5 = "Winnie-the-Pooh --A. A. Milne";\nmy $book6 = "Slaughterhouse-\\nFive- Kurt Vonnegut";\nRun Code Online (Sandbox Code Playgroud)\n示例输出:
\nDon Quixote - Miguel de Cervantes\nLes Mis\xc3\xa9rables - Victor Hugo\nWar and Peace - Leo Tolstoy\nMoby-Dick; or, The Whale - Herman Melville\nWinnie-the-Pooh -- A. A. Milne\nSlaughterhouse-Five - Kurt Vonnegut\nRun Code Online (Sandbox Code Playgroud)\n对于这个问题,我可能会首先询问这些错误的条目是如何进入手头的数据的。它是连接的产物吗?还是非正式(手动)输入?第一个是可以修复的,第二个可能是 Raku 编程语言的主要应用(即,使非正式的手动文本条目更加正式)。这个答案遵循已经发布的优秀示例,但是(相反)使用$0捕获来重新定位“ -”字段分隔符。简单来说:
第一个.subst(...)命令在后跟单个水平空格时全局捕获一个或多个连字符,并在标题和作者之间放置相同数量的连字符(连字符被空格包围)。
第二个.subst(...)命令全局捕获前面有水平空格的一个或多个连字符,并在标题和作者之间放置相同数量的连字符(连字符被空格包围)。
第三个.subst(...)命令全局捕获后跟单个垂直空格(例如换行符)的单个连字符,并删除垂直空格。连字符后跟水平空白保持不变。注意,对于第三个.subst(...)命令,替换可以简单地写为"-"(即不需要使用$0)。
注意:前两个语句可以用OR.subst组合:|
.subst(/ \\S <( (\\-+) \\h | \\h (\\-+) )> \\S /, {" "~$0~" "}, :global)\nRun Code Online (Sandbox Code Playgroud)\n为什么要这么麻烦呢?嗯,第一个原因是更“行人”的方法对于复杂的输入(例如连字符的单词)更稳健。事实上,已经发布的一些答案可能无法处理带连字符的书名和/或作者姓名,这些内容可以优雅地处理(上方和下方,请注意备用替换表格):
\n~$ cat book_author.txt\nDon Quixote- Miguel de Cervantes\nLes Mis\xc3\xa9rables -Victor Hugo\nWar and Peace - Leo Tolstoy\nMoby-Dick; or, The Whale- Herman Melville\nWinnie-the-Pooh --A. A. Milne\nSlaughterhouse-\nFive- Kurt Vonnegut\n~$ cat book_author.txt | raku -e 'say lines.join("\\n")\n .subst(/ \\S <( (\\-+) \\h )> \\S /, {" "~$0~" "}, :global)\n .subst(/ \\S <( \\h (\\-+) )> \\S /, {" "~$0~" "}, :global)\n .subst(/ \\S <( \\- \\v )> \\S /, "-", :global);'\nDon Quixote - Miguel de Cervantes\nLes Mis\xc3\xa9rables - Victor Hugo\nWar and Peace - Leo Tolstoy\nMoby-Dick; or, The Whale - Herman Melville\nWinnie-the-Pooh -- A. A. Milne\nSlaughterhouse-Five - Kurt Vonnegut\nRun Code Online (Sandbox Code Playgroud)\n第二个原因是这样的答案可用于修改具有其他分隔符的文本,例如 Title | Author数据,其中标题与作者通过竖线分隔。第三个原因是捕获(例如使用$0)适用于各种各样的问题,例如将多个相同的分隔符(例如--或)制作||成单字符分隔符(注意另一种编写替换的方式,这次添加.comb[0]):
~$ cat book_bar_author.txt\nDon Quixote| Miguel de Cervantes\nLes Mis\xc3\xa9rables |Victor Hugo\nWar and Peace | Leo Tolstoy\nMoby-Dick; or, The Whale| Herman Melville\nWinnie-the-Pooh ||A. A. Milne\nSlaughterhouse-\nFive| Kurt Vonnegut\n~$ cat book_bar_author.txt | raku -e 'say lines.join("\\n")\n .subst(/ \\S <( (\\|+) \\h )> \\S /, {"",$0.comb[0],""}, :global)\n .subst(/ \\S <( \\h (\\|+) )> \\S /, {"",$0.comb[0],""}, :global)\n .subst(/ \\S <( \\- \\v )> \\S /, "-", :global);'\nDon Quixote | Miguel de Cervantes\nLes Mis\xc3\xa9rables | Victor Hugo\nWar and Peace | Leo Tolstoy\nMoby-Dick; or, The Whale | Herman Melville\nWinnie-the-Pooh | A. A. Milne\nSlaughterhouse-Five | Kurt Vonnegut\nRun Code Online (Sandbox Code Playgroud)\n
| 归档时间: |
|
| 查看次数: |
263 次 |
| 最近记录: |