wget 网络爬虫检索不需要的 index.html 索引文件

Question

wget 网络爬虫检索不需要的 index.html 索引文件

我做了一个~/.bashrc函数来将一些 Web 目录保存到我的本地磁盘中。除了网站中不存在的一些不需要的索引文件外，它运行良好。我用它喜欢

crwl http://ioccc.org/2013/cable3/

Run Code Online (Sandbox Code Playgroud)

但它也会检索一些文件，例如 index.html?C=D;O=A index.html?C=D;O=D index.html?C=M;O=A index.html?C=M;O=D index.html?C=N;O=A index.html?C=N;O=D index.html?C=S;O=A index.html?C=S;O=D

完整文件清单：

 kenn@kenn:~/experiment/crwl/ioccc.org/2013/cable3$ ls
 bios        index.html?C=D;O=A  index.html?C=S;O=A           screenshot_flightsim4.png
 cable3.c    index.html?C=D;O=D  index.html?C=S;O=D           screenshot_lotus123.png
 fd.img      index.html?C=M;O=A  Makefile                     screenshot_qbasic.png
 hint.html   index.html?C=M;O=D  runme                        screenshot_simcity.png
 hint.text   index.html?C=N;O=A  sc-ioccc.terminal            screenshot_win3_on_macosx.png
 index.html  index.html?C=N;O=D  screenshot_autocad.png

Run Code Online (Sandbox Code Playgroud)

我想在克隆该目录时排除这些文件wget 是否有任何wget开关或技巧可以按原样克隆 Web 目录？

我的脚本功能.bashrc：

crwl() {
wget --tries=inf --timestamping --recursive --level=inf --convert-links --page-requisites --no-parent "$@"

}

Run Code Online (Sandbox Code Playgroud)

编辑：我发现了两种可能的解决方法

1) 添加-R index.html?*标志

2）添加-R =A,=D拒绝index.html?C=D;O=A文件的标志，除了index.html

我不知道哪个是合适的，但它们似乎都不安全。

Answer 1

cla*_*123 7

要排除索引排序文件（例如带有 URL 的文件）index.html?C=...而不排除任何其他类型的index.html*文件，确实可以有更精确的规范。尝试：-R \'\\?C=\'

\n\n

快速演示

\n\n

设置一个不同的空目录，例如

\n\n

$ mkdir ~/experiment2\n$ cd ~/experiment2\n

Run Code Online (Sandbox Code Playgroud)\n\n

然后是命令的较短版本，没有递归和级别，以便进行快速的一页测试：

\n\n

$ wget --tries=inf --timestamping --convert-links --page-requisites --no-parent -R \'\\?C=\' http://ioccc.org/2013/cable3/\n

Run Code Online (Sandbox Code Playgroud)\n\n

wget 完成后，~/experiment2, 将没有index.html?C=...文件：

\n\n

.\n\xe2\x94\x94\xe2\x94\x80\xe2\x94\x80 ioccc.org\n    \xe2\x94\x9c\xe2\x94\x80\xe2\x94\x80 2013\n    \xe2\x94\x82   \xe2\x94\x94\xe2\x94\x80\xe2\x94\x80 cable3\n    \xe2\x94\x82       \xe2\x94\x94\xe2\x94\x80\xe2\x94\x80 index.html\n    \xe2\x94\x9c\xe2\x94\x80\xe2\x94\x80 icons\n    \xe2\x94\x82   \xe2\x94\x9c\xe2\x94\x80\xe2\x94\x80 back.gif\n    \xe2\x94\x82   \xe2\x94\x9c\xe2\x94\x80\xe2\x94\x80 blank.gif\n    \xe2\x94\x82   \xe2\x94\x9c\xe2\x94\x80\xe2\x94\x80 image2.gif\n    \xe2\x94\x82   \xe2\x94\x9c\xe2\x94\x80\xe2\x94\x80 text.gif\n    \xe2\x94\x82   \xe2\x94\x94\xe2\x94\x80\xe2\x94\x80 unknown.gif\n    \xe2\x94\x94\xe2\x94\x80\xe2\x94\x80 robots.txt\n\n4 directories, 7 files\n

Run Code Online (Sandbox Code Playgroud)\n\n

所以它确实排除了那些多余的索引排序index.html?C=...目录，同时保留所有其他index.html目录，在这种情况下只是index.html

\n\n

实施

\n\n

因此-R \'\\?C=\'，只需通过更新中的 shell 函数来实现~/.bashrc：

\n\n

crwl() {\n  wget --tries=inf --timestamping --recursive --level=inf --convert-links --page-requisites --no-parent -R \'\\?C=\' "$@"\n}\n

Run Code Online (Sandbox Code Playgroud)\n\n

然后记得在新终端中进行测试，或者重新获取 bash 的源代码以使其生效：

\n\n

$ . ~/.bashrc\n

Run Code Online (Sandbox Code Playgroud)\n\n

然后在新目录下尝试一下，进行比较：

\n\n

$ mkdir ~/experiment3\n$ cd ~/experiment3\n$ crwl http://ioccc.org/2013/cable3/\n

Run Code Online (Sandbox Code Playgroud)\n\n

保修单

\n\n

仅限 wget 1.14 及更高版本。因此，如果您wget -V说它是 1.13，这可能不起作用，您需要自己删除那些讨厌的内容index.html?C=...，或者尝试获取更新版本的 wget。
通过指定您想要-R或拒绝一个模式来工作，在这种情况下，页面的?C=模式是典型的index.html?C=...版本index.html.
然而?恰好是一个 wget 通配符，因此要匹配文字，?您需要将其转义为\\?
不要中断 wget。因为 wget 处理可浏览网页的方式似乎是先下载，后删除，就好像它需要检查这些页面是否有进一步的链接可供抓取一样。因此，如果您中途取消此操作，您仍然会得到index.html?C=文件。只有当您让 wget 完成时，wget 才会按照您的规范并为您-R删除任何临时下载的文件index.html?C=...

\n

归档时间：	9 年，10 月前
查看次数：	10397 次
最近记录：	8 年，2 月前