1 This is Info file wget.info, produced by Makeinfo version 1.68 from the
2 input file ./wget.texi.
4 INFO-DIR-SECTION Net Utilities
5 INFO-DIR-SECTION World Wide Web
7 * Wget: (wget). The non-interactive network downloader.
10 This file documents the the GNU Wget utility for downloading network
13 Copyright (C) 1996, 1997, 1998, 2000 Free Software Foundation, Inc.
15 Permission is granted to make and distribute verbatim copies of this
16 manual provided the copyright notice and this permission notice are
17 preserved on all copies.
19 Permission is granted to copy and distribute modified versions of
20 this manual under the conditions for verbatim copying, provided also
21 that the sections entitled "Copying" and "GNU General Public License"
22 are included exactly as in the original, and provided that the entire
23 resulting derived work is distributed under the terms of a permission
24 notice identical to this one.
27 File: wget.info, Node: Top, Next: Overview, Prev: (dir), Up: (dir)
32 This manual documents version 1.5.3+dev of GNU Wget, the freely
33 available utility for network download.
35 Copyright (C) 1996, 1997, 1998 Free Software Foundation, Inc.
39 * Overview:: Features of Wget.
40 * Invoking:: Wget command-line arguments.
41 * Recursive Retrieval:: Description of recursive retrieval.
42 * Following Links:: The available methods of chasing links.
43 * Time-Stamping:: Mirroring according to time-stamps.
44 * Startup File:: Wget's initialization file.
45 * Examples:: Examples of usage.
46 * Various:: The stuff that doesn't fit anywhere else.
47 * Appendices:: Some useful references.
48 * Copying:: You may give out copies of Wget.
49 * Concept Index:: Topics covered by this manual.
52 File: wget.info, Node: Overview, Next: Invoking, Prev: Top, Up: Top
57 GNU Wget is a freely available network utility to retrieve files from
58 the World Wide Web, using HTTP (Hyper Text Transfer Protocol) and FTP
59 (File Transfer Protocol), the two most widely used Internet protocols.
60 It has many useful features to make downloading easier, some of them
63 * Wget is non-interactive, meaning that it can work in the
64 background, while the user is not logged on. This allows you to
65 start a retrieval and disconnect from the system, letting Wget
66 finish the work. By contrast, most of the Web browsers require
67 constant user's presence, which can be a great hindrance when
68 transferring a lot of data.
70 * Wget is capable of descending recursively through the structure of
71 HTML documents and FTP directory trees, making a local copy of the
72 directory hierarchy similar to the one on the remote server. This
73 feature can be used to mirror archives and home pages, or traverse
74 the web in search of data, like a WWW robot (*Note Robots::). In
75 that spirit, Wget understands the `norobots' convention.
77 * File name wildcard matching and recursive mirroring of directories
78 are available when retrieving via FTP. Wget can read the
79 time-stamp information given by both HTTP and FTP servers, and
80 store it locally. Thus Wget can see if the remote file has
81 changed since last retrieval, and automatically retrieve the new
82 version if it has. This makes Wget suitable for mirroring of FTP
83 sites, as well as home pages.
85 * Wget works exceedingly well on slow or unstable connections,
86 retrying the document until it is fully retrieved, or until a
87 user-specified retry count is surpassed. It will try to resume the
88 download from the point of interruption, using `REST' with FTP and
89 `Range' with HTTP servers that support them.
91 * By default, Wget supports proxy servers, which can lighten the
92 network load, speed up retrieval and provide access behind
93 firewalls. However, if you are behind a firewall that requires
94 that you use a socks style gateway, you can get the socks library
95 and build wget with support for socks. Wget also supports the
96 passive FTP downloading as an option.
98 * Builtin features offer mechanisms to tune which links you wish to
99 follow (*Note Following Links::).
101 * The retrieval is conveniently traced with printing dots, each dot
102 representing a fixed amount of data received (1KB by default).
103 These representations can be customized to your preferences.
105 * Most of the features are fully configurable, either through
106 command line options, or via the initialization file `.wgetrc'
107 (*Note Startup File::). Wget allows you to define "global"
108 startup files (`/usr/local/etc/wgetrc' by default) for site
111 * Finally, GNU Wget is free software. This means that everyone may
112 use it, redistribute it and/or modify it under the terms of the
113 GNU General Public License, as published by the Free Software
114 Foundation (*Note Copying::).
117 File: wget.info, Node: Invoking, Next: Recursive Retrieval, Prev: Overview, Up: Top
122 By default, Wget is very simple to invoke. The basic syntax is:
124 wget [OPTION]... [URL]...
126 Wget will simply download all the URLs specified on the command
127 line. URL is a "Uniform Resource Locator", as defined below.
129 However, you may wish to change some of the default parameters of
130 Wget. You can do it two ways: permanently, adding the appropriate
131 command to `.wgetrc' (*Note Startup File::), or specifying it on the
138 * Basic Startup Options::
139 * Logging and Input File Options::
141 * Directory Options::
144 * Recursive Retrieval Options::
145 * Recursive Accept/Reject Options::
148 File: wget.info, Node: URL Format, Next: Option Syntax, Prev: Invoking, Up: Invoking
153 "URL" is an acronym for Uniform Resource Locator. A uniform
154 resource locator is a compact string representation for a resource
155 available via the Internet. Wget recognizes the URL syntax as per
156 RFC1738. This is the most widely used form (square brackets denote
159 http://host[:port]/directory/file
160 ftp://host[:port]/directory/file
162 You can also encode your username and password within a URL:
164 ftp://user:password@host/path
165 http://user:password@host/path
167 Either USER or PASSWORD, or both, may be left out. If you leave out
168 either the HTTP username or password, no authentication will be sent.
169 If you leave out the FTP username, `anonymous' will be used. If you
170 leave out the FTP password, your email address will be supplied as a
173 You can encode unsafe characters in a URL as `%xy', `xy' being the
174 hexadecimal representation of the character's ASCII value. Some common
175 unsafe characters include `%' (quoted as `%25'), `:' (quoted as `%3A'),
176 and `@' (quoted as `%40'). Refer to RFC1738 for a comprehensive list
177 of unsafe characters.
179 Wget also supports the `type' feature for FTP URLs. By default, FTP
180 documents are retrieved in the binary mode (type `i'), which means that
181 they are downloaded unchanged. Another useful mode is the `a'
182 ("ASCII") mode, which converts the line delimiters between the
183 different operating systems, and is thus useful for text files. Here
186 ftp://host/directory/file;type=a
188 Two alternative variants of URL specification are also supported,
189 because of historical (hysterical?) reasons and their widespreaded use.
191 FTP-only syntax (supported by `NcFTP'):
194 HTTP-only syntax (introduced by `Netscape'):
197 These two alternative forms are deprecated, and may cease being
198 supported in the future.
200 If you do not understand the difference between these notations, or
201 do not know which one to use, just use the plain ordinary format you use
202 with your favorite browser, like `Lynx' or `Netscape'.
204 ---------- Footnotes ----------
206 (1) If you have a `.netrc' file in your home directory, password
207 will also be searched for there.
210 File: wget.info, Node: Option Syntax, Next: Basic Startup Options, Prev: URL Format, Up: Invoking
215 Since Wget uses GNU getopts to process its arguments, every option
216 has a short form and a long form. Long options are more convenient to
217 remember, but take time to type. You may freely mix different option
218 styles, or specify options after the command-line arguments. Thus you
221 wget -r --tries=10 http://fly.cc.fer.hr/ -o log
223 The space between the option accepting an argument and the argument
224 may be omitted. Instead `-o log' you can write `-olog'.
226 You may put several options that do not require arguments together,
231 This is a complete equivalent of:
235 Since the options can be specified after the arguments, you may
236 terminate them with `--'. So the following will try to download URL
237 `-x', reporting failure to `log':
241 The options that accept comma-separated lists all respect the
242 convention that specifying an empty list clears its value. This can be
243 useful to clear the `.wgetrc' settings. For instance, if your `.wgetrc'
244 sets `exclude_directories' to `/cgi-bin', the following example will
245 first reset it, and then set it to exclude `/~nobody' and `/~somebody'.
246 You can also clear the lists in `.wgetrc' (*Note Wgetrc Syntax::).
248 wget -X '' -X /~nobody,/~somebody
251 File: wget.info, Node: Basic Startup Options, Next: Logging and Input File Options, Prev: Option Syntax, Up: Invoking
253 Basic Startup Options
254 =====================
258 Display the version of Wget.
262 Print a help message describing all of Wget's command-line options.
266 Go to background immediately after startup. If no output file is
267 specified via the `-o', output is redirected to `wget-log'.
271 Execute COMMAND as if it were a part of `.wgetrc' (*Note Startup
272 File::). A command thus invoked will be executed *after* the
273 commands in `.wgetrc', thus taking precedence over them.
276 File: wget.info, Node: Logging and Input File Options, Next: Download Options, Prev: Basic Startup Options, Up: Invoking
278 Logging and Input File Options
279 ==============================
282 `--output-file=LOGFILE'
283 Log all messages to LOGFILE. The messages are normally reported
287 `--append-output=LOGFILE'
288 Append to LOGFILE. This is the same as `-o', only it appends to
289 LOGFILE instead of overwriting the old log file. If LOGFILE does
290 not exist, a new file is created.
294 Turn on debug output, meaning various information important to the
295 developers of Wget if it does not work properly. Your system
296 administrator may have chosen to compile Wget without debug
297 support, in which case `-d' will not work. Please note that
298 compiling with debug support is always safe--Wget compiled with
299 the debug support will *not* print any debug info unless requested
300 with `-d'. *Note Reporting Bugs:: for more information on how to
301 use `-d' for sending bug reports.
305 Turn off Wget's output.
309 Turn on verbose output, with all the available data. The default
314 Non-verbose output--turn off verbose without being completely quiet
315 (use `-q' for that), which means that error messages and basic
316 information still get printed.
320 Read URLs from FILE, in which case no URLs need to be on the
321 command line. If there are URLs both on the command line and in
322 an input file, those on the command lines will be the first ones to
323 be retrieved. The FILE need not be an HTML document (but no harm
324 if it is)--it is enough if the URLs are just listed sequentially.
326 However, if you specify `--force-html', the document will be
327 regarded as `html'. In that case you may have problems with
328 relative links, which you can solve either by adding `<base
329 href="URL">' to the documents or by specifying `--base=URL' on the
334 When input is read from a file, force it to be treated as an HTML
335 file. This enables you to retrieve relative links from existing
336 HTML files on your local disk, by adding `<base href="URL">' to
337 HTML, or using the `--base' command-line option.
341 When used in conjunction with `-F', prepends URL to relative links
342 in the file specified by `-i'.
345 File: wget.info, Node: Download Options, Next: Directory Options, Prev: Logging and Input File Options, Up: Invoking
352 Set number of retries to NUMBER. Specify 0 or `inf' for infinite
356 `--output-document=FILE'
357 The documents will not be written to the appropriate files, but
358 all will be concatenated together and written to FILE. If FILE
359 already exists, it will be overwritten. If the FILE is `-', the
360 documents will be written to standard output. Including this
361 option automatically sets the number of tries to 1.
365 If a file is downloaded more than once in the same directory,
366 wget's behavior depends on a few options, including `-nc'. In
367 certain cases, the local file will be "clobbered", or overwritten,
368 upon repeated download. In other cases it will be preserved.
370 When running wget without `-N', `-nc', or `-r', downloading the
371 same file in the same directory will result in the original copy
372 of `FILE' being preserved and the second copy being named
373 `FILE.1'. If that file is downloaded yet again, the third copy
374 will be named `FILE.2', and so on. When `-nc' is specified, this
375 behavior is suppressed, and wget will refuse to download newer
376 copies of `FILE'. Therefore, "no-clobber" is actually a misnomer
377 in this mode - it's not clobbering that's prevented (as the
378 numeric suffixes were already preventing clobbering), but rather
379 the multiple version saving that's prevented.
381 When running wget with `-r', but without `-N' or `-nc',
382 re-downloading a file will result in the new copy simply
383 overwriting the old. Adding `-nc' will prevent this behavior,
384 instead causing the original version to be preserved and any newer
385 copies on the server to be ignored.
387 When running wget with `-N', with or without `-r', the decision as
388 to whether or not to download a newer copy of a file depends on
389 the local and remote timestamp and size of the file (*Note
390 Time-Stamping::). `-nc' may not be specified at the same time as
393 Note that when `-nc' is specified, files with the suffixes `.html'
394 or (yuck) `.htm' will be loaded from the local disk and parsed as
395 if they had been retrieved from the Web.
399 Continue getting an existing file. This is useful when you want to
400 finish up the download started by another program, or a previous
401 instance of Wget. Thus you can write:
403 wget -c ftp://sunsite.doc.ic.ac.uk/ls-lR.Z
405 If there is a file name `ls-lR.Z' in the current directory, Wget
406 will assume that it is the first portion of the remote file, and
407 will require the server to continue the retrieval from an offset
408 equal to the length of the local file.
410 Note that you need not specify this option if all you want is Wget
411 to continue retrieving where it left off when the connection is
412 lost--Wget does this by default. You need this option only when
413 you want to continue retrieval of a file already halfway
414 retrieved, saved by another FTP client, or left by Wget being
417 Without `-c', the previous example would just begin to download the
418 remote file to `ls-lR.Z.1'. The `-c' option is also applicable
419 for HTTP servers that support the `Range' header.
421 Note that if you use `-c' on a file that's already downloaded
422 completely, `FILE' will not be changed, nor will a second `FILE.1'
426 Set the retrieval style to STYLE. Wget traces the retrieval of
427 each document by printing dots on the screen, each dot
428 representing a fixed amount of retrieved data. Any number of dots
429 may be separated in a "cluster", to make counting easier. This
430 option allows you to choose one of the pre-defined styles,
431 determining the number of bytes represented by a dot, the number
432 of dots in a cluster, and the number of dots on the line.
434 With the `default' style each dot represents 1K, there are ten dots
435 in a cluster and 50 dots in a line. The `binary' style has a more
436 "computer"-like orientation--8K dots, 16-dots clusters and 48 dots
437 per line (which makes for 384K lines). The `mega' style is
438 suitable for downloading very large files--each dot represents 64K
439 retrieved, there are eight dots in a cluster, and 48 dots on each
440 line (so each line contains 3M). The `micro' style is exactly the
441 reverse; it is suitable for downloading small files, with 128-byte
442 dots, 8 dots per cluster, and 48 dots (6K) per line.
446 Turn on time-stamping. *Note Time-Stamping:: for details.
450 Print the headers sent by HTTP servers and responses sent by FTP
454 When invoked with this option, Wget will behave as a Web "spider",
455 which means that it will not download the pages, just check that
456 they are there. You can use it to check your bookmarks, e.g. with:
458 wget --spider --force-html -i bookmarks.html
460 This feature needs much more work for Wget to get close to the
461 functionality of real WWW spiders.
465 Set the read timeout to SECONDS seconds. Whenever a network read
466 is issued, the file descriptor is checked for a timeout, which
467 could otherwise leave a pending connection (uninterrupted read).
468 The default timeout is 900 seconds (fifteen minutes). Setting
469 timeout to 0 will disable checking for timeouts.
471 Please do not lower the default timeout value with this option
472 unless you know what you are doing.
476 Wait the specified number of seconds between the retrievals. Use
477 of this option is recommended, as it lightens the server load by
478 making the requests less frequent. Instead of in seconds, the
479 time can be specified in minutes using the `m' suffix, in hours
480 using `h' suffix, or in days using `d' suffix.
482 Specifying a large value for this option is useful if the network
483 or the destination host is down, so that Wget can wait long enough
484 to reasonably expect the network error to be fixed before the
487 `--waitretry=SECONDS'
488 If you don't want Wget to wait between *every* retrieval, but only
489 between retries of failed downloads, you can use this option.
490 Wget will use "linear backoff", waiting 1 second after the first
491 failure on a given file, then waiting 2 seconds after the second
492 failure on that file, up to the maximum number of SECONDS you
493 specify. Therefore, a value of 10 will actually make Wget wait up
494 to (1 + 2 + ... + 10) = 55 seconds per file.
496 Note that this option is turned on by default in the global
501 Turn proxy support on or off. The proxy is on by default if the
502 appropriate environmental variable is defined.
506 Specify download quota for automatic retrievals. The value can be
507 specified in bytes (default), kilobytes (with `k' suffix), or
508 megabytes (with `m' suffix).
510 Note that quota will never affect downloading a single file. So
511 if you specify `wget -Q10k ftp://wuarchive.wustl.edu/ls-lR.gz',
512 all of the `ls-lR.gz' will be downloaded. The same goes even when
513 several URLs are specified on the command-line. However, quota is
514 respected when retrieving either recursively, or from an input
515 file. Thus you may safely type `wget -Q2m -i sites'--download
516 will be aborted when the quota is exceeded.
518 Setting quota to 0 or to `inf' unlimits the download quota.
521 File: wget.info, Node: Directory Options, Next: HTTP Options, Prev: Download Options, Up: Invoking
528 Do not create a hierarchy of directories when retrieving
529 recursively. With this option turned on, all files will get saved
530 to the current directory, without clobbering (if a name shows up
531 more than once, the filenames will get extensions `.n').
534 `--force-directories'
535 The opposite of `-nd'--create a hierarchy of directories, even if
536 one would not have been created otherwise. E.g. `wget -x
537 http://fly.cc.fer.hr/robots.txt' will save the downloaded file to
538 `fly.cc.fer.hr/robots.txt'.
541 `--no-host-directories'
542 Disable generation of host-prefixed directories. By default,
543 invoking Wget with `-r http://fly.cc.fer.hr/' will create a
544 structure of directories beginning with `fly.cc.fer.hr/'. This
545 option disables such behavior.
548 Ignore NUMBER directory components. This is useful for getting a
549 fine-grained control over the directory where recursive retrieval
552 Take, for example, the directory at
553 `ftp://ftp.xemacs.org/pub/xemacs/'. If you retrieve it with `-r',
554 it will be saved locally under `ftp.xemacs.org/pub/xemacs/'.
555 While the `-nH' option can remove the `ftp.xemacs.org/' part, you
556 are still stuck with `pub/xemacs'. This is where `--cut-dirs'
557 comes in handy; it makes Wget not "see" NUMBER remote directory
558 components. Here are several examples of how `--cut-dirs' option
561 No options -> ftp.xemacs.org/pub/xemacs/
563 -nH --cut-dirs=1 -> xemacs/
564 -nH --cut-dirs=2 -> .
566 --cut-dirs=1 -> ftp.xemacs.org/xemacs/
569 If you just want to get rid of the directory structure, this
570 option is similar to a combination of `-nd' and `-P'. However,
571 unlike `-nd', `--cut-dirs' does not lose with subdirectories--for
572 instance, with `-nH --cut-dirs=1', a `beta/' subdirectory will be
573 placed to `xemacs/beta', as one would expect.
576 `--directory-prefix=PREFIX'
577 Set directory prefix to PREFIX. The "directory prefix" is the
578 directory where all other files and subdirectories will be saved
579 to, i.e. the top of the retrieval tree. The default is `.' (the
583 File: wget.info, Node: HTTP Options, Next: FTP Options, Prev: Directory Options, Up: Invoking
590 If a file of type `text/html' is downloaded and the URL does not
591 end with the regexp "\.[Hh][Tt][Mm][Ll]?", this option will cause
592 the suffix `.html' to be appended to the local filename. This is
593 useful, for instance, when you're you're mirroring a remote site
594 that uses `.asp' pages, but you want the mirrored pages to be
595 viewable on your stock Apache server. Another good use for this
596 is when you're downloading the output of CGIs. A URL like
597 `http://site.com/article.cgi?25' will be saved as
598 `article.cgi?25.html'.
600 Note that filenames changed in this way will be re-downloaded
601 every time you re-mirror a site, because wget can't tell that the
602 local `X.html' file corresponds to remote URL `X' (since it
603 doesn't yet know that the URL produces output of type `text/html'.
604 To prevent this re-downloading, you must use `-k' and `-K' so
605 that the original version of the file will be saved as `X.orig'
606 (*Note Recursive Retrieval Options::).
609 `--http-passwd=PASSWORD'
610 Specify the username USER and password PASSWORD on an HTTP server.
611 According to the type of the challenge, Wget will encode them
612 using either the `basic' (insecure) or the `digest' authentication
615 Another way to specify username and password is in the URL itself
616 (*Note URL Format::). For more information about security issues
617 with Wget, *Note Security Considerations::.
621 When set to off, disable server-side cache. In this case, Wget
622 will send the remote server an appropriate directive (`Pragma:
623 no-cache') to get the file from the remote service, rather than
624 returning the cached version. This is especially useful for
625 retrieving and flushing out-of-date documents on proxy servers.
627 Caching is allowed by default.
630 Unfortunately, some HTTP servers (CGI programs, to be more
631 precise) send out bogus `Content-Length' headers, which makes Wget
632 go wild, as it thinks not all the document was retrieved. You can
633 spot this syndrome if Wget retries getting the same document again
634 and again, each time claiming that the (otherwise normal)
635 connection has closed on the very same byte.
637 With this option, Wget will ignore the `Content-Length' header--as
640 `--header=ADDITIONAL-HEADER'
641 Define an ADDITIONAL-HEADER to be passed to the HTTP servers.
642 Headers must contain a `:' preceded by one or more non-blank
643 characters, and must not contain newlines.
645 You may define more than one additional header by specifying
646 `--header' more than once.
648 wget --header='Accept-Charset: iso-8859-2' \
649 --header='Accept-Language: hr' \
650 http://fly.cc.fer.hr/
652 Specification of an empty string as the header value will clear all
653 previous user-defined headers.
656 `--proxy-passwd=PASSWORD'
657 Specify the username USER and password PASSWORD for authentication
658 on a proxy server. Wget will encode them using the `basic'
659 authentication scheme.
662 Include `Referer: URL' header in HTTP request. Useful for
663 retrieving documents with server-side processing that assume they
664 are always being retrieved by interactive web browsers and only
665 come out properly when Referer is set to one of the pages that
670 Save the headers sent by the HTTP server to the file, preceding the
671 actual contents, with an empty line as the separator.
674 `--user-agent=AGENT-STRING'
675 Identify as AGENT-STRING to the HTTP server.
677 The HTTP protocol allows the clients to identify themselves using a
678 `User-Agent' header field. This enables distinguishing the WWW
679 software, usually for statistical purposes or for tracing of
680 protocol violations. Wget normally identifies as `Wget/VERSION',
681 VERSION being the current version number of Wget.
683 However, some sites have been known to impose the policy of
684 tailoring the output according to the `User-Agent'-supplied
685 information. While conceptually this is not such a bad idea, it
686 has been abused by servers denying information to clients other
687 than `Mozilla' or Microsoft `Internet Explorer'. This option
688 allows you to change the `User-Agent' line issued by Wget. Use of
689 this option is discouraged, unless you really know what you are
692 *NOTE* that Netscape Communications Corp. has claimed that false
693 transmissions of `Mozilla' as the `User-Agent' are a copyright
694 infringement, which will be prosecuted. *DO NOT* misrepresent
698 File: wget.info, Node: FTP Options, Next: Recursive Retrieval Options, Prev: HTTP Options, Up: Invoking
704 Usually, when retrieving FTP directories recursively and a symbolic
705 link is encountered, the linked-to file is not downloaded.
706 Instead, a matching symbolic link is created on the local
707 filesystem. The pointed-to file will not be downloaded unless
708 this recursive retrieval would have encountered it separately and
709 downloaded it anyway.
711 When `--retr-symlinks' is specified, however, symbolic links are
712 traversed and the pointed-to files are retrieved. At this time,
713 this option does not cause wget to traverse symlinks to
714 directories and recurse through them, but in the future it should
715 be enhanced to do this.
717 Note that when retrieving a file (not a directory) because it was
718 specified on the commandline, rather than because it was recursed
719 to, this option has no effect. Symbolic links are always
720 traversed in this case.
724 Turn FTP globbing on or off. Globbing means you may use the
725 shell-like special characters ("wildcards"), like `*', `?', `['
726 and `]' to retrieve more than one file from the same directory at
729 wget ftp://gnjilux.cc.fer.hr/*.msg
731 By default, globbing will be turned on if the URL contains a
732 globbing character. This option may be used to turn globbing on
735 You may have to quote the URL to protect it from being expanded by
736 your shell. Globbing makes Wget look for a directory listing,
737 which is system-specific. This is why it currently works only
738 with Unix FTP servers (and the ones emulating Unix `ls' output).
741 Use the "passive" FTP retrieval scheme, in which the client
742 initiates the data connection. This is sometimes required for FTP
743 to work behind firewalls.
746 File: wget.info, Node: Recursive Retrieval Options, Next: Recursive Accept/Reject Options, Prev: FTP Options, Up: Invoking
748 Recursive Retrieval Options
749 ===========================
753 Turn on recursive retrieving. *Note Recursive Retrieval:: for more
758 Specify recursion maximum depth level DEPTH (*Note Recursive
759 Retrieval::). The default maximum depth is 5.
762 This option tells Wget to delete every single file it downloads,
763 *after* having done so. It is useful for pre-fetching popular
764 pages through proxy, e.g.:
766 wget -r -nd --delete-after http://whatever.com/~popular/page/
768 The `-r' option is to retrieve recursively, and `-nd' not to
773 Convert the non-relative links to relative ones locally. Only the
774 references to the documents actually downloaded will be converted;
775 the rest will be left unchanged.
777 Note that only at the end of the download can Wget know which
778 links have been downloaded. Because of that, much of the work
779 done by `-k' will be performed at the end of the downloads.
783 When converting a file, back up the original version with a `.orig'
784 suffix. Affects the behavior of `-N' (*Note HTTP Time-Stamping
789 Turn on options suitable for mirroring. This option turns on
790 recursion and time-stamping, sets infinite recursion depth and
791 keeps FTP directory listings. It is currently equivalent to `-r
795 `--dont-remove-listing'
796 Don't remove the temporary `.listing' files generated by FTP
797 retrievals. Normally, these files contain the raw directory
798 listings received from FTP servers. Not removing them can be
799 useful to access the full remote file list when running a mirror,
800 or for debugging purposes.
804 This option causes wget to download all the files that are
805 necessary to properly display a given HTML page. This includes
806 such things as inlined images, sounds, and referenced stylesheets.
808 Ordinarily, when downloading a single HTML page, any requisite
809 documents that may be needed to display it properly are not
810 downloaded. Using `-r' together with `-l' can help, but since
811 wget does not ordinarily distinguish between external and inlined
812 documents, one is generally left with "leaf documents" that are
813 missing their requisites.
815 For instance, say document `1.html' contains an `<IMG>' tag
816 referencing `1.gif' and an `<A>' tag pointing to external document
817 `2.html'. Say that `2.html' is the same but that its image is
818 `2.gif' and it links to `3.html'. Say this continues up to some
819 arbitrarily high number.
821 If one executes the command:
823 wget -r -l 2 http://SITE/1.html
825 then `1.html', `1.gif', `2.html', `2.gif', and `3.html' will be
826 downloaded. As you can see, `3.html' is without its requisite
827 `3.gif' because wget is simply counting the number of hops (up to
828 2) away from `1.html' in order to determine where to stop the
829 recursion. However, with this command:
831 wget -r -l 2 -p http://SITE/1.html
833 all the above files *and* `3.html''s requisite `3.gif' will be
834 downloaded. Similarly,
836 wget -r -l 1 -p http://SITE/1.html
838 will cause `1.html', `1.gif', `2.html', and `2.gif' to be
839 downloaded. One might think that:
841 wget -r -l 0 -p http://SITE/1.html
843 would download just `1.html' and `1.gif', but unfortunately this
844 is not the case, because `-l 0' is equivalent to `-l inf' - that
845 is, infinite recursion. To download a single HTML page (or a
846 handful of them, all specified on the commandline or in a `-i' URL
847 input file) and its requisites, simply leave off `-p' and `-l':
849 wget -p http://SITE/1.html
851 Note that wget will behave as if `-r' had been specified, but only
852 that single page and its requisites will be downloaded. Links
853 from that page to external documents will not be followed.
854 Actually, to download a single page and all its requisites (even
855 if they exist on separate websites), and make sure the lot
856 displays properly locally, this author likes to use a few options
859 wget -H -k -K -nh -p http://SITE/DOCUMENT
861 To finish off this topic, it's worth knowing that wget's idea of an
862 external document link is any URL specified in an `<A>' tag, an
863 `<AREA>' tag, or a `<LINK>' tag other than `<LINK
867 File: wget.info, Node: Recursive Accept/Reject Options, Prev: Recursive Retrieval Options, Up: Invoking
869 Recursive Accept/Reject Options
870 ===============================
872 `-A ACCLIST --accept ACCLIST'
873 `-R REJLIST --reject REJLIST'
874 Specify comma-separated lists of file name suffixes or patterns to
875 accept or reject (*Note Types of Files:: for more details).
878 `--domains=DOMAIN-LIST'
879 Set domains to be accepted and DNS looked-up, where DOMAIN-LIST is
880 a comma-separated list. Note that it does *not* turn on `-H'.
881 This option speeds things up, even if only one host is spanned
882 (*Note Domain Acceptance::).
884 `--exclude-domains DOMAIN-LIST'
885 Exclude the domains given in a comma-separated DOMAIN-LIST from
886 DNS-lookup (*Note Domain Acceptance::).
889 Follow FTP links from HTML documents. Without this option, Wget
890 will ignore all the FTP links.
893 Wget has an internal table of HTML tag / attribute pairs that it
894 considers when looking for linked documents during a recursive
895 retrieval. If a user wants only a subset of those tags to be
896 considered, however, he or she should be specify such tags in a
897 comma-separated LIST with this option.
901 This is the opposite of the `--follow-tags' option. To skip
902 certain HTML tags when recursively looking for documents to
903 download, specify them in a comma-separated LIST.
905 In the past, the `-G' option was the best bet for downloading a
906 single page and its requisites, using a commandline like:
908 wget -Ga,area -H -k -K -nh -r http://SITE/DOCUMENT
910 However, the author of this option came across a page with tags
911 like `<LINK REL="home" HREF="/">' and came to the realization that
912 `-G' was not enough. One can't just tell wget to ignore `<LINK>',
913 because then stylesheets will not be downloaded. Now the best bet
914 for downloading a single page and its requisites is the dedicated
915 `--page-requisites' option.
919 Enable spanning across hosts when doing recursive retrieving
924 Follow relative links only. Useful for retrieving a specific home
925 page without any distractions, not even those from the same hosts
926 (*Note Relative Links::).
929 `--include-directories=LIST'
930 Specify a comma-separated list of directories you wish to follow
931 when downloading (*Note Directory-Based Limits:: for more
932 details.) Elements of LIST may contain wildcards.
935 `--exclude-directories=LIST'
936 Specify a comma-separated list of directories you wish to exclude
937 from download (*Note Directory-Based Limits:: for more details.)
938 Elements of LIST may contain wildcards.
942 Disable the time-consuming DNS lookup of almost all hosts (*Note
948 Do not ever ascend to the parent directory when retrieving
949 recursively. This is a useful option, since it guarantees that
950 only the files *below* a certain hierarchy will be downloaded.
951 *Note Directory-Based Limits:: for more details.
954 File: wget.info, Node: Recursive Retrieval, Next: Following Links, Prev: Invoking, Up: Top
959 GNU Wget is capable of traversing parts of the Web (or a single HTTP
960 or FTP server), depth-first following links and directory structure.
961 This is called "recursive" retrieving, or "recursion".
963 With HTTP URLs, Wget retrieves and parses the HTML from the given
964 URL, documents, retrieving the files the HTML document was referring
965 to, through markups like `href', or `src'. If the freshly downloaded
966 file is also of type `text/html', it will be parsed and followed
969 The maximum "depth" to which the retrieval may descend is specified
970 with the `-l' option (the default maximum depth is five layers). *Note
971 Recursive Retrieval::.
973 When retrieving an FTP URL recursively, Wget will retrieve all the
974 data from the given directory tree (including the subdirectories up to
975 the specified depth) on the remote server, creating its mirror image
976 locally. FTP retrieval is also limited by the `depth' parameter.
978 By default, Wget will create a local directory tree, corresponding to
979 the one found on the remote server.
981 Recursive retrieving can find a number of applications, the most
982 important of which is mirroring. It is also useful for WWW
983 presentations, and any other opportunities where slow network
984 connections should be bypassed by storing the files locally.
986 You should be warned that invoking recursion may cause grave
987 overloading on your system, because of the fast exchange of data
988 through the network; all of this may hamper other users' work. The
989 same stands for the foreign server you are mirroring--the more requests
990 it gets in a rows, the greater is its load.
992 Careless retrieving can also fill your file system uncontrollably,
993 which can grind the machine to a halt.
995 The load can be minimized by lowering the maximum recursion level
996 (`-l') and/or by lowering the number of retries (`-t'). You may also
997 consider using the `-w' option to slow down your requests to the remote
998 servers, as well as the numerous options to narrow the number of
999 followed links (*Note Following Links::).
1001 Recursive retrieval is a good thing when used properly. Please take
1002 all precautions not to wreak havoc through carelessness.
1005 File: wget.info, Node: Following Links, Next: Time-Stamping, Prev: Recursive Retrieval, Up: Top
1010 When retrieving recursively, one does not wish to retrieve loads of
1011 unnecessary data. Most of the time the users bear in mind exactly what
1012 they want to download, and want Wget to follow only specific links.
1014 For example, if you wish to download the music archive from
1015 `fly.cc.fer.hr', you will not want to download all the home pages that
1016 happen to be referenced by an obscure part of the archive.
1018 Wget possesses several mechanisms that allows you to fine-tune which
1019 links it will follow.
1023 * Relative Links:: Follow relative links only.
1024 * Host Checking:: Follow links on the same host.
1025 * Domain Acceptance:: Check on a list of domains.
1026 * All Hosts:: No host restrictions.
1027 * Types of Files:: Getting only certain files.
1028 * Directory-Based Limits:: Getting only certain directories.
1029 * FTP Links:: Following FTP links.
1032 File: wget.info, Node: Relative Links, Next: Host Checking, Prev: Following Links, Up: Following Links
1037 When only relative links are followed (option `-L'), recursive
1038 retrieving will never span hosts. No time-expensive DNS-lookups will
1039 be performed, and the process will be very fast, with the minimum
1040 strain of the network. This will suit your needs often, especially when
1041 mirroring the output of various `x2html' converters, since they
1042 generally output relative links.
1045 File: wget.info, Node: Host Checking, Next: Domain Acceptance, Prev: Relative Links, Up: Following Links
1050 The drawback of following the relative links solely is that humans
1051 often tend to mix them with absolute links to the very same host, and
1052 the very same page. In this mode (which is the default mode for
1053 following links) all URLs that refer to the same host will be retrieved.
1055 The problem with this option are the aliases of the hosts and
1056 domains. Thus there is no way for Wget to know that `regoc.srce.hr' and
1057 `www.srce.hr' are the same host, or that `fly.cc.fer.hr' is the same as
1058 `fly.cc.etf.hr'. Whenever an absolute link is encountered, the host is
1059 DNS-looked-up with `gethostbyname' to check whether we are maybe
1060 dealing with the same hosts. Although the results of `gethostbyname'
1061 are cached, it is still a great slowdown, e.g. when dealing with large
1062 indices of home pages on different hosts (because each of the hosts
1063 must be DNS-resolved to see whether it just *might* be an alias of the
1066 To avoid the overhead you may use `-nh', which will turn off
1067 DNS-resolving and make Wget compare hosts literally. This will make
1068 things run much faster, but also much less reliable (e.g. `www.srce.hr'
1069 and `regoc.srce.hr' will be flagged as different hosts).
1071 Note that modern HTTP servers allow one IP address to host several
1072 "virtual servers", each having its own directory hierarchy. Such
1073 "servers" are distinguished by their hostnames (all of which point to
1074 the same IP address); for this to work, a client must send a `Host'
1075 header, which is what Wget does. However, in that case Wget *must not*
1076 try to divine a host's "real" address, nor try to use the same hostname
1077 for each access, i.e. `-nh' must be turned on.
1079 In other words, the `-nh' option must be used to enable the
1080 retrieval from virtual servers distinguished by their hostnames. As the
1081 number of such server setups grow, the behavior of `-nh' may become the
1082 default in the future.
1085 File: wget.info, Node: Domain Acceptance, Next: All Hosts, Prev: Host Checking, Up: Following Links
1090 With the `-D' option you may specify the domains that will be
1091 followed. The hosts the domain of which is not in this list will not be
1092 DNS-resolved. Thus you can specify `-Dmit.edu' just to make sure that
1093 *nothing outside of MIT gets looked up*. This is very important and
1094 useful. It also means that `-D' does *not* imply `-H' (span all
1095 hosts), which must be specified explicitly. Feel free to use this
1096 options since it will speed things up, with almost all the reliability
1097 of checking for all hosts. Thus you could invoke
1099 wget -r -D.hr http://fly.cc.fer.hr/
1101 to make sure that only the hosts in `.hr' domain get DNS-looked-up
1102 for being equal to `fly.cc.fer.hr'. So `fly.cc.etf.hr' will be checked
1103 (only once!) and found equal, but `www.gnu.ai.mit.edu' will not even be
1106 Of course, domain acceptance can be used to limit the retrieval to
1107 particular domains with spanning of hosts in them, but then you must
1108 specify `-H' explicitly. E.g.:
1110 wget -r -H -Dmit.edu,stanford.edu http://www.mit.edu/
1112 will start with `http://www.mit.edu/', following links across MIT
1115 If there are domains you want to exclude specifically, you can do it
1116 with `--exclude-domains', which accepts the same type of arguments of
1117 `-D', but will *exclude* all the listed domains. For example, if you
1118 want to download all the hosts from `foo.edu' domain, with the
1119 exception of `sunsite.foo.edu', you can do it like this:
1121 wget -rH -Dfoo.edu --exclude-domains sunsite.foo.edu http://www.foo.edu/
1124 File: wget.info, Node: All Hosts, Next: Types of Files, Prev: Domain Acceptance, Up: Following Links
1129 When `-H' is specified without `-D', all hosts are freely spanned.
1130 There are no restrictions whatsoever as to what part of the net Wget
1131 will go to fetch documents, other than maximum retrieval depth. If a
1132 page references `www.yahoo.com', so be it. Such an option is rarely
1136 File: wget.info, Node: Types of Files, Next: Directory-Based Limits, Prev: All Hosts, Up: Following Links
1141 When downloading material from the web, you will often want to
1142 restrict the retrieval to only certain file types. For example, if you
1143 are interested in downloading GIFs, you will not be overjoyed to get
1144 loads of PostScript documents, and vice versa.
1146 Wget offers two options to deal with this problem. Each option
1147 description lists a short name, a long name, and the equivalent command
1153 The argument to `--accept' option is a list of file suffixes or
1154 patterns that Wget will download during recursive retrieval. A
1155 suffix is the ending part of a file, and consists of "normal"
1156 letters, e.g. `gif' or `.jpg'. A matching pattern contains
1157 shell-like wildcards, e.g. `books*' or `zelazny*196[0-9]*'.
1159 So, specifying `wget -A gif,jpg' will make Wget download only the
1160 files ending with `gif' or `jpg', i.e. GIFs and JPEGs. On the
1161 other hand, `wget -A "zelazny*196[0-9]*"' will download only files
1162 beginning with `zelazny' and containing numbers from 1960 to 1969
1163 anywhere within. Look up the manual of your shell for a
1164 description of how pattern matching works.
1166 Of course, any number of suffixes and patterns can be combined
1167 into a comma-separated list, and given as an argument to `-A'.
1172 The `--reject' option works the same way as `--accept', only its
1173 logic is the reverse; Wget will download all files *except* the
1174 ones matching the suffixes (or patterns) in the list.
1176 So, if you want to download a whole page except for the cumbersome
1177 MPEGs and .AU files, you can use `wget -R mpg,mpeg,au'.
1178 Analogously, to download all files except the ones beginning with
1179 `bjork', use `wget -R "bjork*"'. The quotes are to prevent
1180 expansion by the shell.
1182 The `-A' and `-R' options may be combined to achieve even better
1183 fine-tuning of which files to retrieve. E.g. `wget -A "*zelazny*" -R
1184 .ps' will download all the files having `zelazny' as a part of their
1185 name, but *not* the PostScript files.
1187 Note that these two options do not affect the downloading of HTML
1188 files; Wget must load all the HTMLs to know where to go at
1189 all--recursive retrieval would make no sense otherwise.