如何在文本中提取所有引用?

sec*_*ecr 5 perl grep sed quotations

我正在寻找一个输出文本中所有引用的SimpleGrepSedPerlOrPythonOneLiner.


例1:

echo “HAL,” noted Frank, “said that everything was going extremely well.” | SimpleGrepSedPerlOrPythonOneLiner
Run Code Online (Sandbox Code Playgroud)

标准输出:

"HAL,"
"said that everything was going extremely well.”
Run Code Online (Sandbox Code Playgroud)

例2:

cat MicrosoftWindowsXPEula.txt | SimpleGrepSedPerlOrPythonOneLiner
Run Code Online (Sandbox Code Playgroud)

标准输出:

"EULA"
"Software"
"Workstation Computer"
"Device"
"DRM"
Run Code Online (Sandbox Code Playgroud)

等等

(链接到相应的文本).

Axe*_*man 7

我喜欢这个:

perl -ne 'print "$_\n" foreach /"((?>[^"\\]|\\+[^"]|\\(?:\\\\)*")*)"/g;'
Run Code Online (Sandbox Code Playgroud)

它有点冗长,但它比最简单的实现更好地处理转义引用和回溯.它的意思是:

my $re = qr{
   "               # Begin it with literal quote
   ( 
     (?>           # prevent backtracking once the alternation has been
                   # satisfied. It either agrees or it does not. This expression
                   # only needs one direction, or we fail out of the branch

         [^"\\]    # a character that is not a dquote or a backslash
     |   \\+       # OR if a backslash, then any number of backslashes followed by 
         [^"]      # something that is not a quote
     |   \\        # OR again a backslash
         (?>\\\\)* # followed by any number of *pairs* of backslashes (as units)
         "         # and a quote
     )*            # any number of *set* qualifying phrases
  )                # all batched up together
  "                # Ended by a literal quote
}x;
Run Code Online (Sandbox Code Playgroud)

如果你不需要那么大的力量 - 说它只是可能是对话而不是结构化的引用,那么

/"([^"]*)"/ 
Run Code Online (Sandbox Code Playgroud)

可能与其他任何东西一样有效.


Vin*_*vic 5

如果您有嵌套引号,则没有正则表达式解决方案可行,但对于您的示例,这种方法效果很好

$ echo \"HAL,\" noted Frank, \"said that everything was going extremely well\"  
 | perl -n -e 'while (m/(".*?")/g) { print $1."\n"; }'
"HAL,"
"said that everything was going extremely well"

$ cat eula.txt| perl -n -e 'while (m/(".*?")/g) { print $1."\n"; }'
"EULA"
"online"
"Software"
"Workstation Computer"
"Device"
"multiplexing"
"DRM"
"Secure Content"
"DRM Software"
"Secure Content Owners"
"DRM Upgrades"
"WMFSDK"
"Not For Resale"
"NFR,"
"Academic Edition"
"AE,"
"Qualified Educational User."
"Exclusion of Incidental, Consequential and Certain Other Damages"
"Restricted Rights"
"Exclusion des dommages accessoires, indirects et de certains autres dommages"
"Consumer rights"
Run Code Online (Sandbox Code Playgroud)