理解 wget -r 输出

Question

理解 wget -r 输出

deo*_*oll 3 linux remote wget recursive directory-listing

这是一个目录中 tree 命令的输出：

.
|-- asdf.txt
|-- asd.txt
|-- fabc
|   |-- fbca
|   `-- file1.txt
|-- fldr1
|-- fldr2
|   `-- index.html
|-- fldr3
|   |-- cap.txt
|   `-- f01
`-- out.txt

6 directories, 6 files

Run Code Online (Sandbox Code Playgroud)

我在这个目录中启动了一个本地 http 服务器。接下来我运行以下命令：

wget -r -nv --spider --no-parent http://localhost:3000 -o -

Run Code Online (Sandbox Code Playgroud)

...并获得以下输出：

2017-01-02 20:07:24 URL:http://localhost:3000/ [1580] -> "localhost:3000/index.html" [1]
http://localhost:3000/robots.txt:
2017-01-02 20:07:24 ERROR 404: Not Found.
2017-01-02 20:07:24 URL:http://localhost:3000/fabc/ [897] -> "localhost:3000/fabc/index.html" [1]
2017-01-02 20:07:24 URL:http://localhost:3000/fldr1/ [536] -> "localhost:3000/fldr1/index.html" [1]
2017-01-02 20:07:24 URL:http://localhost:3000/fldr2/ [0/0] -> "localhost:3000/fldr2/index.html" [1]
2017-01-02 20:07:24 URL:http://localhost:3000/fldr3/ [896] -> "localhost:3000/fldr3/index.html" [1]
2017-01-02 20:07:24 URL: http://localhost:3000/asd.txt 200 OK
unlink: No such file or directory
2017-01-02 20:07:24 URL: http://localhost:3000/asdf.txt 200 OK
unlink: No such file or directory
2017-01-02 20:07:24 URL: http://localhost:3000/out.txt 200 OK
unlink: No such file or directory
2017-01-02 20:07:24 URL:http://localhost:3000/fabc/fbca/ [548] -> "localhost:3000/fabc/fbca/index.html" [1]
2017-01-02 20:07:24 URL: http://localhost:3000/fabc/file1.txt 200 OK
unlink: No such file or directory
2017-01-02 20:07:24 URL:http://localhost:3000/fldr3/f01/ [548] -> "localhost:3000/fldr3/f01/index.html" [1]
2017-01-02 20:07:24 URL: http://localhost:3000/fldr3/cap.txt 200 OK
unlink: No such file or directory
Found no broken links.

FINISHED --2017-01-02 20:07:24--
Total wall clock time: 0.3s
Downloaded: 7 files, 4.9K in 0s (43.4 MB/s)

Run Code Online (Sandbox Code Playgroud)

wget 是写成一直寻找index.html吗？我们可以禁用它吗？
1580、536、0/0 等数字是什么？
为什么这么说unlink: No such file or directory？

Answer 1

Mar*_*ost 5

您可以尝试使用--reject选项跳过文件（也接受通配符）：

wget --reject index.html

但是，您不想这样做。当使用 wget with 时-r，它需要以某种方式获取目录中的文件列表。因此 wget 请求 index.html 文件并解析内容以希望获得该目录中其他文件的路径。当文件夹中没有 index.html 文件时，网络服务器通常会为 wget 生成它——这个文件将包含目录列表。必须在网络服务器上启用此列表文件的创建 - 否则 wget 将收到 HTTP 404 回复并因递归下载而失败。

这是以字节为单位的文件大小。
这意味着无法删除文件（可能是因为它不是首先创建的）。您是否对使用 wget 下载的目录具有写权限？

编辑：在测试 wget 下载后--spider，--recursive 我重现了您的取消链接错误。似乎 wget 使用响应的内容类型来确定文件是否可以包含指向其他资源的链接。如果内容类型测试失败并且文件未下载，wget 仍会尝试删除临时文件，就好像它已下载一样（这在使用重新运行 wget 时很明显--debug。它会明确说明Removing file due to --spider in recursive_retrieve():）。我猜你在 wget 中发现了一个错误。

归档时间：	8 年，10 月前
查看次数：	1298 次
最近记录：	8 年，10 月前