使用wget以递归方式获取包含任意文件的目录

Question

使用wget以递归方式获取包含任意文件的目录

我有一个web目录,我存储一些配置文件.我想使用wget来拉下这些文件并保持它们当前的结构.例如,远程目录如下所示:

http://mysite.com/configs/.vim/

Run Code Online (Sandbox Code Playgroud)

.vim拥有多个文件和目录.我想使用wget在客户端上复制它.似乎找不到正确的wget标志组合来完成这项工作.有任何想法吗？

Answer 1

Jer*_*ten 923

您必须将-np/ --no-parent选项传递给wget(除了-r/ --recursive,当然),否则它将跟随我站点上的目录索引中的链接到父目录.所以命令看起来像这样:

wget --recursive --no-parent http://example.com/configs/.vim/

Run Code Online (Sandbox Code Playgroud)

要避免下载自动生成的index.html文件,请使用-R/ --reject选项:

wget -r -np -R "index.html*" http://example.com/configs/.vim/

Run Code Online (Sandbox Code Playgroud)

添加-nH(删除主机名)--cut-dirs = X(删除X目录).必须手动计算X的目录有点烦人. (47认同)
@matteo因为robots.txt可能不允许抓取网站.您应该添加-e robots = off以强制抓取. (28认同)
为什么这些都不适用于http://www.w3.org/History/1991-WWW-NeXT/Implementation/？它只会下载robots.txt (3认同)
如果您不想下载整个内容,可以使用:-l1只需下载目录(在您的情况下为example.com)-l2下载目录和所有1级子文件夹('example.com/something'但不是'example.com/somthing/foo')等等.如果不插入-l选项,wget将自动使用-l 5.如果您插入-l 0将下载整个Internet,因为wget将跟随它找到的每个链接./sf/answers/1378660041/ (3认同)
为什么我总是得到一个 index.html 文件而不是目录？`wget -r --no-parent -e robots=off http://demo.inspiretheme.com/templates/headlines/images/` 这个命令只会得到一个 index.html 文件 (3认同)

Answer 2

Sri*_*ram 117

以递归方式下载目录,拒绝index.html*文件并下载没有主机名,父目录和整个目录结构:

wget -r -nH --cut-dirs=2 --no-parent --reject="index.html*" http://mysite.com/dir1/dir2/data

Run Code Online (Sandbox Code Playgroud)

@matteo尝试添加:-e robots = off (33认同)

Answer 3

小智 112

对于其他有类似问题的人.Wget跟随robots.txt,可能不允许你抓住网站.不用担心,你可以把它关掉:

wget -e robots=off http://www.example.com/

Run Code Online (Sandbox Code Playgroud)

http://www.gnu.org/software/wget/manual/html_node/Robot-Exclusion.html

当您忽略 robots.txt 时，您至少应该限制您的请求。这个答案中建议的行为是非常不礼貌的。 (2认同)
@Nobody 那么对此的礼貌回答是什么？ (2认同)

Answer 4

Sam*_*ody 37

你应该使用-m(镜像)标志,因为它注意不要弄乱时间戳并无限地递归.

wget -m http://example.com/configs/.vim/

Run Code Online (Sandbox Code Playgroud)

如果你在这个帖子中添加其他人提到的点,那就是:

wget -m -e robots=off --no-parent http://example.com/configs/.vim/

Run Code Online (Sandbox Code Playgroud)

Answer 5

Eri*_*ger 32

这是完整的wget命令,它可以帮助我从服务器的目录下载文件(忽略robots.txt):

wget -e robots=off --cut-dirs=3 --user-agent=Mozilla/5.0 --reject="index.html*" --no-parent --recursive --relative --level=1 --no-directories http://www.example.com/archive/example/5.3.0/

Run Code Online (Sandbox Code Playgroud)

Answer 6

ber*_*kyi 17

首先，感谢所有发布答案的人。这是我的“最终”wget 脚本，用于递归下载网站：

wget --recursive ${comment# self-explanatory} \
  --no-parent ${comment# will not crawl links in folders above the base of the URL} \
  --convert-links ${comment# convert links with the domain name to relative and uncrawled to absolute} \
  --random-wait --wait 3 --no-http-keep-alive ${comment# do not get banned} \
  --no-host-directories ${comment# do not create folders with the domain name} \
  --execute robots=off --user-agent=Mozilla/5.0 ${comment# I AM A HUMAN!!!} \
  --level=inf  --accept '*' ${comment# do not limit to 5 levels or common file formats} \
  --reject="index.html*" ${comment# use this option if you need an exact mirror} \
  --cut-dirs=0 ${comment# replace 0 with the number of folders in the path, 0 for the whole domain} \
$URL

Run Code Online (Sandbox Code Playgroud)

之后，可能需要从 URL 中剥离查询参数main.css?crc=12324567并运行本地服务器（例如，通过刚刚 wget'ed 的目录）来运行 JS。python3 -m http.server请注意，该--convert-links选项仅在完整爬网完成后才会生效。

另外，如果您尝试访问一个可能很快就会关闭的网站，您应该与 ArchiveTeam 联系并要求他们将您的网站添加到他们的 ArchiveBot 队列中。

Answer 7

小智 7

如果--no-parent没有帮助,您可以使用--include选项.

目录结构:

http://<host>/downloads/good
http://<host>/downloads/bad

Run Code Online (Sandbox Code Playgroud)

你想下载downloads/good而不是downloads/bad目录:

wget --include downloads/good --mirror --execute robots=off --no-host-directories --cut-dirs=1 --reject="index.html*" --continue http://<host>/downloads/good

Run Code Online (Sandbox Code Playgroud)

Answer 8

Sen*_*esh 7

听起来您正在尝试获取文件的镜像。虽然wget有一些有趣的 FTP 和 SFTP 用途，但简单的镜像应该可以工作。只需注意一些事项即可确保您能够正确下载文件。

尊重`robots.txt`

确保如果您的、、或目录/robots.txt中有文件，它不会阻止爬网。如果确实如此，您需要在命令中使用以下选项来指示忽略它：public_htmlwwwconfigswgetwget

wget -e robots=off 'http://your-site.com/configs/.vim/'

Run Code Online (Sandbox Code Playgroud)

将远程链接转换为本地文件。

此外，wget必须指示将链接转换为下载的文件。如果您正确完成了上述所有操作，那么您应该没问题。我发现获取所有文件的最简单方法是使用命令，前提是非公共目录后面没有隐藏任何内容mirror。

尝试这个：

wget -mpEk 'http://your-site.com/configs/.vim/'

# If robots.txt is present:

wget -mpEk robots=off 'http://your-site.com/configs/.vim/'

# Good practice to only deal with the highest level directory you specify (instead of downloading all of `mysite.com` you're just mirroring from `.vim`

wget -mpEk robots=off --no-parent 'http://your-site.com/configs/.vim/'

Run Code Online (Sandbox Code Playgroud)

首选使用-m代替，-r因为它没有最大递归深度并且会下载所有资源。Mirror 非常擅长确定网站的完整深度，但是如果您有许多外部链接，您最终可能会下载的不仅仅是您的网站，这就是我们使用-p -E -k. 制作页面的所有先决文件以及保留的目录结构都应该是输出。-k将链接转换为本地文件。由于您应该设置一个链接，因此您应该获取带有文件的配置文件夹/.vim。

镜像模式还适用于设置为ftp://也的目录结构。

一般经验法则：

根据您正在制作镜像的站点的一侧，您将向服务器发送许多调用。为了防止您被列入黑名单或被切断，请使用该wait选项来限制您的下载速度。

wget -mpEk --no-parent robots=off --random-wait 'http://your-site.com/configs/.vim/'

Run Code Online (Sandbox Code Playgroud)

但是，如果您只是下载文件，../config/.vim/则不必担心它，因为您会忽略父目录并下载单个文件。

Answer 9

Con*_*roe 5

wget -r http://mysite.com/configs/.vim/

Run Code Online (Sandbox Code Playgroud)

适合我.

也许你有一个干扰它的.wgetrc？

Answer 10

pra*_*upd 5

要使用用户名和密码递归获取目录，请使用以下命令：

wget -r --user=(put username here) --password='(put password here)' --no-parent http://example.com/

Run Code Online (Sandbox Code Playgroud)

Answer 11

rko*_*kok 5

此版本以递归方式下载并且不创建父目录。

wgetod() {
    NSLASH="$(echo "$1" | perl -pe 's|.*://[^/]+(.*?)/?$|\1|' | grep -o / | wc -l)"
    NCUT=$((NSLASH > 0 ? NSLASH-1 : 0))
    wget -r -nH --user-agent=Mozilla/5.0 --cut-dirs=$NCUT --no-parent --reject="index.html*" "$1"
}

Run Code Online (Sandbox Code Playgroud)

用法：

添加~/.bashrc或粘贴到终端
wgetod "http://example.com/x/"

Answer 12

pr-*_*pal 5

在处理递归下载时，以下选项似乎是完美的组合：

wget -nd -np -P /dest/dir --recursive http://url/dir1/dir2

为了方便起见，手册页中的相关片段：

   -nd
   --no-directories
       Do not create a hierarchy of directories when retrieving recursively.  With this option turned on, all files will get saved to the current directory, without clobbering (if a name shows up more than once, the
       filenames will get extensions .n).


   -np
   --no-parent
       Do not ever ascend to the parent directory when retrieving recursively.  This is a useful option, since it guarantees that only the files below a certain hierarchy will be downloaded.

Run Code Online (Sandbox Code Playgroud)

归档时间：	16 年，11 月前
查看次数：	663277 次
最近记录：	6 年，1 月前

使用wget以递归方式获取包含任意文件的目录

尊重robots.txt

将远程链接转换为本地文件。

尝试这个：

一般经验法则：

尊重`robots.txt`