如何优化grep正则表达式以匹配URL

Question

如何优化grep正则表达式以匹配URL

Geo*_*ter 7 regex url terminal optimization grep

背景:

我有一个名为"stuff"的目录,在Mac OS 10.7.5上有26个文件(2 .txt和24 .rtf).
我正在使用grep(GNU v2.5.1)查找这26个文件中与URL结构匹配的所有字符串,然后将它们打印到新文件(output.txt).
下面的正则表达式确实是小规模的.我在带有3个文件(1 .rtf和2 .txt)的目录上运行它,带有一堆虚拟文本和30个URL,并且在不到1秒的时间内成功执行.

我使用以下正则表达式:

1

grep -iIrPoh 'https?://.+?\s' . --include=*.txt --include=*.rtf > output.txt

Run Code Online (Sandbox Code Playgroud)

问题

我的目录"stuff"的当前大小是180 KB,包含26个文件.在终端,我cd到这个目录(东西)然后运行我的正则表达式.我等了大约15分钟,决定杀死这个过程,因为它没有完成.当我查看output.txt文件时,它是一个惊人的19.75GB(截图).

题

什么可能导致output.txt文件比整个目录大得多？
我还可以添加到我的正则表达式以简化处理时间.

提前感谢您提供的任何指导.我已经在我的正则表达式的许多不同变体上工作了将近16个小时,并且已经在线阅读了大量的帖子,但似乎没有任何帮助.我是新手写的正则表达式,但只要握一点,我想我会得到它.

附加评论

我运行以下命令来查看output.txt(19.75GB)文件中记录的内容.看起来正则表达式找到了正确的字符串,除了我认为奇怪的字符,如:花括号} {和字符串,如:{\fldrslt

    **TERMINAL**
    $ head -n 100 output.txt
    http://michacardenas.org/\
    http://culturelab.asc.upenn.edu/2013/03/06/calling-all-wearable-electronics-hackers-e-textile-makers-and-fashion-activists/\
    http://www.mumia-themovie.com/"}}{\fldrslt 
    http://www.mumia-themovie.com/}}\
    http://www.youtube.com/watch?v=Rvk2dAYkHW8\
    http://seniorfitnesssite.com/category/senior-fitness-exercises\
    http://www.giac.org/ 
    http://www.youtube.com/watch?v=deOCqGMFFBE"}}{\fldrslt 
    http://www.youtube.com/watch?v=deOCqGMFFBE}}
    https://angel.co/jason-a-hoffman\
    https://angel.co/joyent?save_req=mention_slugs"}}{\fldrslt 
    http://www.cooking-hacks.com/index.php/ehealth-sensors-complete-kit-biometric-medical-arduino-raspberry-pi.html"}}{\fldrslt 
    http://www.cooking-hacks.com/index.php/ehealth-sensors-complete-kit-biometric-medical-arduino-raspberry-pi.html}} 
    http://www.cooking-hacks.com/index.php/documentation/tutorials/ehealth-biometric-sensor-platform-arduino-raspberry-pi-medical"}}{\fldrslt 
    http://www.cooking-hacks.com/index.php/documentation

Run Code Online (Sandbox Code Playgroud)

到目前为止测试的正则表达式命令目录

2

grep -iIrPoh 'https?://\S+' . --include=*.txt --include=*.rtf > output.txt
失败:用1秒钟来运行/生成空白文件(output_2.txt)

3

grep -iIroh 'https?://\S+' . --include=*.txt --include=*.rtf > output.txt
失败:用1秒钟运行/生成空白文件(output_3.txt)

4

grep -iIrPoh 'https?://\S+\s' . --include=*.txt --include=*.rtf > sixth.txt
失败:用1秒钟来运行/生成空白文件(output_4.txt)

五

grep -iIroh 'https?://' . --include=*.txt --include=*.rtf > output.txt
失败:用1秒钟来运行/生成空白文件(output_5.txt)

6

grep -iIroh 'https?://\S' . --include=*.txt --include=*.rtf > output.txt
失败:用1秒钟运行/生成空白文件(output_6.txt)

7

grep -iIroh 'https?://[\w~#%&_+=,.?/-]+' . --include=*.txt --include=*.rtf > output.txt
失败:花了1秒钟来运行/生成空白文件(output_7.txt)

8

grep -iIrPoh 'https?://[\w~#%&_+=,.?/-]+' . --include=*.txt --include=*.rtf > output.txt
失败:让运行10分钟并手动杀死进程/生成20.63 GB文件(output_8.txt)/在正面,这个正则表达式捕获的字符串是准确的,因为它们不包括任何奇怪的附加字符,如花括号或RTF文件格式语法{\ fldrslt

9

find . -print | grep -iIPoh 'https\?://[a-zA-Z0-9~#%&_+=,.?/-]\+' . --include=*.txt --include=*.rtf > output_9.txt
失败:用1秒钟运行/生成空白文件(output_9.txt)

10

find . -print | grep -iIrPoh 'https\?://[a-zA-Z0-9~#%&_+=,.?/-]\+' . --include=*.txt --include=*.rtf > output_10.txt
失败:用1秒钟运行/生成空白文件(output_10.txt)

11

grep -iIroh 'https\?://[a-zA-Z0-9~#%&_+=,.?/-]\+' . --include=*.txt --include=*.rtf

编者注:当我将字符串输出到终端窗口时,这个正则表达式只能正常工作.当我输出到文件output_11.txt时,它不起作用.

近乎成功:所有URL字符串都被干净地剪切以删除字符串前后的空格,并删除了与.RTF格式相关的所有特殊标记.缺点:在测试准确性的示例URL中,有些被缩短,最终失去了它们的结构.我估计大约有10%的字符串被不正确地截断了.

截断字符串的示例:
URL structure before the regex: http://www.youtube.com/watch?v=deOCqGMFFBE
URL structure after the regex: http://www.youtube.com/watch?v=de

现在的问题是:
1.)有没有办法确保我们不会像上面的示例中那样删除URL字符串的一部分？
2.)为正则表达式定义转义命令会有帮助吗？(如果可能的话).

12

grep -iIroh 'https?:\/\/[\w~#%&_+=,.?\/-]+' . --include=*.txt --include=*.rtf > output_12.txt
失败:用1秒钟运行/生成空白文件(output_12.txt)

13

grep -iIroh 'https\?://[a-zA-Z0-9~#%&_+=,.?/-]\+' . --include=*.txt --include=*.rtf > tmp/output.txt

失败:让我们运行2分钟并手动杀死进程/生成的1 GB文件.这个正则表达式的目的是将grep的输出文件(output.txt)隔离到一个子目录中,以确保我们没有创建一个grep回读它自己的输出的无限循环.坚实的想法,但没有雪茄(截图).

14

grep -iIroh 'https\?://[a-z0-9~#%&_+=,.?/-]\+' . --include=*.txt --include=*.rtf
失败:与#11相同的结果.该命令导致带有截断字符串的无限循环.

15

grep -Iroh 'https\?://[a-zA-Z0-9~#%&_+=,.?/-]\+' . --include=*.txt --include=*.rtf
几乎获胜者:这捕获了整个URL字符串.它确实导致在终端中创建数百万个字符串的无限循环,但我可以手动识别第一个循环开始和结束的位置,所以这应该没问题.伟大的工作@ acheong87!谢谢!

16

find . -print | grep -v output.txt | xargs grep -Iroh 'https\?://[a-zA-Z0-9~#%&_+=,.?/-]\+' --include=*.txt --include=*.rtf > output.txt
近成功:我能够获取整个URL字符串,这很好.但是,该命令变成了无限循环.在输出到终端大约5秒后,它产生了大约100万个URL字符串,这些都是重复的.如果我们能够弄清楚如何在单个循环之后将其转义,那么这将是一个很好的表达.

17

ls *.rtf *.txt | grep -v 'output.txt' | xargs -J {} grep -iIF 'http' {} grep -iIFo > output.txt

NEAR SUCCESS:这个命令导致了一个循环遍历目录中的所有文件,这是很好的b/c解决了无限循环问题.但是,URL字符串的结构被截断并包含字符串来自的文件名.

18

ls *.rtf *.txt | grep -v 'output.txt' | xargs grep -iIohE 'https?://[^[:space:]]+'
NEAR SUCCESS:这个表达式阻止了一个好的无限循环,它在它查询的目录中创建了一个新文件,这个文件很小,大约30KB.它捕获了字符串中的所有正确字符以及一些不需要的字符.正如Floris所提到的,在URL未以空格终止的情况下 - 例如,http://www.mumia-themovie.com/"}}{\fldrslt它捕获了标记语法.

19

ls *.rtf *.txt | grep -v 'output.txt' | xargs grep -iIohE 'https?://[a-z./?#=%_-,~&]+'
失败:这个表达式阻止了一个好的无限循环,但它没有捕获整个URL字符串.

Answer 1

Flo*_*ris 10

我在评论中给出的表达式(您的测试17)旨在测试两件事:

1)我们可以使无限循环消失2)我们可以干净地遍历目录中的所有文件

我相信我们都实现了.所以现在我足够大胆地提出一个"解决方案":

ls *.rtf *.txt | grep -v 'output.txt' | xargs grep -iIohE 'https?://[^[:space:]]+'

Run Code Online (Sandbox Code Playgroud)

打破它:

ls *.rtf *.txt         - list all .rtf and .txt files
grep -v 'output.txt'   - skip 'output.txt' (in case it was left from a previous attempt)
xargs                  - "take each line of the input in turn and substitute it 
                       - at the end of the following command 
                       - (or use -J xxx to sub at place of xxx anywhere in command)
grep -i                - case insensitive
     -I                - skip binary (shouldn't have any since we only process .txt and .rtf...)
     -o                - print only the matched bit (not the entire line), i.e. just the URL
     -h                - don't include the name of the source file
     -E                - use extended regular expressions 

     'http             - match starts with http (there are many other URLs possible... but out of scope for this question)
      s?               - next character may be an s, or is not there
      ://              - literal characters that must be there
      [^[:space:]]+    - one or more "non space" characters (greedy... "as many as possible")

Run Code Online (Sandbox Code Playgroud)

这似乎在一组非常简单的文件/ URL上正常工作.我认为现在迭代问题已经解决,其余的很容易.在线有大量的"URL验证"正则表达式.选择其中任何一个......上面的表达式实际上只是搜索" http直到空格后的所有内容".如果您最终得到奇数或缺失的比赛,请告诉我们.

再次查看先前的输出，我发现您遇到URL不能以空格结尾的情况-例如，您要结束的`http://www.mumia-themovie.com/“}} {\ fldrslt`大概在“”。在这种情况下，您不希望使用[[^ [：space：]]]，而是类似`grep -iIohE'https？：// [az./?#=%_-,~&] +' (2认同)

归档时间：	11 年，9 月前
查看次数：	5895 次
最近记录：	11 年，9 月前