1 This is Info file wget.info, produced by Makeinfo version 1.67 from the
2 input file ./wget.texi.
4 INFO-DIR-SECTION Net Utilities
5 INFO-DIR-SECTION World Wide Web
7 * Wget: (wget). The non-interactive network downloader.
10 This file documents the the GNU Wget utility for downloading network
13 Copyright (C) 1996, 1997, 1998 Free Software Foundation, Inc.
15 Permission is granted to make and distribute verbatim copies of this
16 manual provided the copyright notice and this permission notice are
17 preserved on all copies.
19 Permission is granted to copy and distribute modified versions of
20 this manual under the conditions for verbatim copying, provided also
21 that the sections entitled "Copying" and "GNU General Public License"
22 are included exactly as in the original, and provided that the entire
23 resulting derived work is distributed under the terms of a permission
24 notice identical to this one.
27 File: wget.info, Node: Top, Next: Overview, Prev: (dir), Up: (dir)
32 This manual documents version 1.5.3 of GNU Wget, the freely
33 available utility for network download.
35 Copyright (C) 1996, 1997, 1998 Free Software Foundation, Inc.
39 * Overview:: Features of Wget.
40 * Invoking:: Wget command-line arguments.
41 * Recursive Retrieval:: Description of recursive retrieval.
42 * Following Links:: The available methods of chasing links.
43 * Time-Stamping:: Mirroring according to time-stamps.
44 * Startup File:: Wget's initialization file.
45 * Examples:: Examples of usage.
46 * Various:: The stuff that doesn't fit anywhere else.
47 * Appendices:: Some useful references.
48 * Copying:: You may give out copies of Wget.
49 * Concept Index:: Topics covered by this manual.
52 File: wget.info, Node: Overview, Next: Invoking, Prev: Top, Up: Top
57 GNU Wget is a freely available network utility to retrieve files from
58 the World Wide Web, using HTTP (Hyper Text Transfer Protocol) and FTP
59 (File Transfer Protocol), the two most widely used Internet protocols.
60 It has many useful features to make downloading easier, some of them
63 * Wget is non-interactive, meaning that it can work in the
64 background, while the user is not logged on. This allows you to
65 start a retrieval and disconnect from the system, letting Wget
66 finish the work. By contrast, most of the Web browsers require
67 constant user's presence, which can be a great hindrance when
68 transferring a lot of data.
70 * Wget is capable of descending recursively through the structure of
71 HTML documents and FTP directory trees, making a local copy of the
72 directory hierarchy similar to the one on the remote server. This
73 feature can be used to mirror archives and home pages, or traverse
74 the web in search of data, like a WWW robot (*Note Robots::). In
75 that spirit, Wget understands the `norobots' convention.
77 * File name wildcard matching and recursive mirroring of directories
78 are available when retrieving via FTP. Wget can read the
79 time-stamp information given by both HTTP and FTP servers, and
80 store it locally. Thus Wget can see if the remote file has
81 changed since last retrieval, and automatically retrieve the new
82 version if it has. This makes Wget suitable for mirroring of FTP
83 sites, as well as home pages.
85 * Wget works exceedingly well on slow or unstable connections,
86 retrying the document until it is fully retrieved, or until a
87 user-specified retry count is surpassed. It will try to resume the
88 download from the point of interruption, using `REST' with FTP and
89 `Range' with HTTP servers that support them.
91 * By default, Wget supports proxy servers, which can lighten the
92 network load, speed up retrieval and provide access behind
93 firewalls. However, if you are behind a firewall that requires
94 that you use a socks style gateway, you can get the socks library
95 and build wget with support for socks. Wget also supports the
96 passive FTP downloading as an option.
98 * Builtin features offer mechanisms to tune which links you wish to
99 follow (*Note Following Links::).
101 * The retrieval is conveniently traced with printing dots, each dot
102 representing a fixed amount of data received (1KB by default).
103 These representations can be customized to your preferences.
105 * Most of the features are fully configurable, either through
106 command line options, or via the initialization file `.wgetrc'
107 (*Note Startup File::). Wget allows you to define "global"
108 startup files (`/usr/local/etc/wgetrc' by default) for site
111 * Finally, GNU Wget is free software. This means that everyone may
112 use it, redistribute it and/or modify it under the terms of the
113 GNU General Public License, as published by the Free Software
114 Foundation (*Note Copying::).
117 File: wget.info, Node: Invoking, Next: Recursive Retrieval, Prev: Overview, Up: Top
122 By default, Wget is very simple to invoke. The basic syntax is:
124 wget [OPTION]... [URL]...
126 Wget will simply download all the URLs specified on the command
127 line. URL is a "Uniform Resource Locator", as defined below.
129 However, you may wish to change some of the default parameters of
130 Wget. You can do it two ways: permanently, adding the appropriate
131 command to `.wgetrc' (*Note Startup File::), or specifying it on the
138 * Basic Startup Options::
139 * Logging and Input File Options::
141 * Directory Options::
144 * Recursive Retrieval Options::
145 * Recursive Accept/Reject Options::
148 File: wget.info, Node: URL Format, Next: Option Syntax, Prev: Invoking, Up: Invoking
153 "URL" is an acronym for Uniform Resource Locator. A uniform
154 resource locator is a compact string representation for a resource
155 available via the Internet. Wget recognizes the URL syntax as per
156 RFC1738. This is the most widely used form (square brackets denote
159 http://host[:port]/directory/file
160 ftp://host[:port]/directory/file
162 You can also encode your username and password within a URL:
164 ftp://user:password@host/path
165 http://user:password@host/path
167 Either USER or PASSWORD, or both, may be left out. If you leave out
168 either the HTTP username or password, no authentication will be sent.
169 If you leave out the FTP username, `anonymous' will be used. If you
170 leave out the FTP password, your email address will be supplied as a
173 You can encode unsafe characters in a URL as `%xy', `xy' being the
174 hexadecimal representation of the character's ASCII value. Some common
175 unsafe characters include `%' (quoted as `%25'), `:' (quoted as `%3A'),
176 and `@' (quoted as `%40'). Refer to RFC1738 for a comprehensive list
177 of unsafe characters.
179 Wget also supports the `type' feature for FTP URLs. By default, FTP
180 documents are retrieved in the binary mode (type `i'), which means that
181 they are downloaded unchanged. Another useful mode is the `a'
182 ("ASCII") mode, which converts the line delimiters between the
183 different operating systems, and is thus useful for text files. Here
186 ftp://host/directory/file;type=a
188 Two alternative variants of URL specification are also supported,
189 because of historical (hysterical?) reasons and their wide-spreadedness.
191 FTP-only syntax (supported by `NcFTP'):
194 HTTP-only syntax (introduced by `Netscape'):
197 These two alternative forms are deprecated, and may cease being
198 supported in the future.
200 If you do not understand the difference between these notations, or
201 do not know which one to use, just use the plain ordinary format you use
202 with your favorite browser, like `Lynx' or `Netscape'.
204 ---------- Footnotes ----------
206 (1) If you have a `.netrc' file in your home directory, password
207 will also be searched for there.
210 File: wget.info, Node: Option Syntax, Next: Basic Startup Options, Prev: URL Format, Up: Invoking
215 Since Wget uses GNU getopts to process its arguments, every option
216 has a short form and a long form. Long options are more convenient to
217 remember, but take time to type. You may freely mix different option
218 styles, or specify options after the command-line arguments. Thus you
221 wget -r --tries=10 http://fly.cc.fer.hr/ -o log
223 The space between the option accepting an argument and the argument
224 may be omitted. Instead `-o log' you can write `-olog'.
226 You may put several options that do not require arguments together,
231 This is a complete equivalent of:
235 Since the options can be specified after the arguments, you may
236 terminate them with `--'. So the following will try to download URL
237 `-x', reporting failure to `log':
241 The options that accept comma-separated lists all respect the
242 convention that specifying an empty list clears its value. This can be
243 useful to clear the `.wgetrc' settings. For instance, if your `.wgetrc'
244 sets `exclude_directories' to `/cgi-bin', the following example will
245 first reset it, and then set it to exclude `/~nobody' and `/~somebody'.
246 You can also clear the lists in `.wgetrc' (*Note Wgetrc Syntax::).
248 wget -X '' -X /~nobody,/~somebody
251 File: wget.info, Node: Basic Startup Options, Next: Logging and Input File Options, Prev: Option Syntax, Up: Invoking
253 Basic Startup Options
254 =====================
258 Display the version of Wget.
262 Print a help message describing all of Wget's command-line options.
266 Go to background immediately after startup. If no output file is
267 specified via the `-o', output is redirected to `wget-log'.
271 Execute COMMAND as if it were a part of `.wgetrc' (*Note Startup
272 File::). A command thus invoked will be executed *after* the
273 commands in `.wgetrc', thus taking precedence over them.
276 File: wget.info, Node: Logging and Input File Options, Next: Download Options, Prev: Basic Startup Options, Up: Invoking
278 Logging and Input File Options
279 ==============================
282 `--output-file=LOGFILE'
283 Log all messages to LOGFILE. The messages are normally reported
287 `--append-output=LOGFILE'
288 Append to LOGFILE. This is the same as `-o', only it appends to
289 LOGFILE instead of overwriting the old log file. If LOGFILE does
290 not exist, a new file is created.
294 Turn on debug output, meaning various information important to the
295 developers of Wget if it does not work properly. Your system
296 administrator may have chosen to compile Wget without debug
297 support, in which case `-d' will not work. Please note that
298 compiling with debug support is always safe--Wget compiled with
299 the debug support will *not* print any debug info unless requested
300 with `-d'. *Note Reporting Bugs:: for more information on how to
301 use `-d' for sending bug reports.
305 Turn off Wget's output.
309 Turn on verbose output, with all the available data. The default
314 Non-verbose output--turn off verbose without being completely quiet
315 (use `-q' for that), which means that error messages and basic
316 information still get printed.
320 Read URLs from FILE, in which case no URLs need to be on the
321 command line. If there are URLs both on the command line and in
322 an input file, those on the command lines will be the first ones to
323 be retrieved. The FILE need not be an HTML document (but no harm
324 if it is)--it is enough if the URLs are just listed sequentially.
326 However, if you specify `--force-html', the document will be
327 regarded as `html'. In that case you may have problems with
328 relative links, which you can solve either by adding `<base
329 href="URL">' to the documents or by specifying `--base=URL' on the
334 When input is read from a file, force it to be treated as an HTML
335 file. This enables you to retrieve relative links from existing
336 HTML files on your local disk, by adding `<base href="URL">' to
337 HTML, or using the `--base' command-line option.
340 File: wget.info, Node: Download Options, Next: Directory Options, Prev: Logging and Input File Options, Up: Invoking
347 Set number of retries to NUMBER. Specify 0 or `inf' for infinite
351 `--output-document=FILE'
352 The documents will not be written to the appropriate files, but
353 all will be concatenated together and written to FILE. If FILE
354 already exists, it will be overwritten. If the FILE is `-', the
355 documents will be written to standard output. Including this
356 option automatically sets the number of tries to 1.
360 Do not clobber existing files when saving to directory hierarchy
361 within recursive retrieval of several files. This option is
362 *extremely* useful when you wish to continue where you left off
363 with retrieval of many files. If the files have the `.html' or
364 (yuck) `.htm' suffix, they will be loaded from the local disk, and
365 parsed as if they have been retrieved from the Web.
369 Continue getting an existing file. This is useful when you want to
370 finish up the download started by another program, or a previous
371 instance of Wget. Thus you can write:
373 wget -c ftp://sunsite.doc.ic.ac.uk/ls-lR.Z
375 If there is a file name `ls-lR.Z' in the current directory, Wget
376 will assume that it is the first portion of the remote file, and
377 will require the server to continue the retrieval from an offset
378 equal to the length of the local file.
380 Note that you need not specify this option if all you want is Wget
381 to continue retrieving where it left off when the connection is
382 lost--Wget does this by default. You need this option only when
383 you want to continue retrieval of a file already halfway
384 retrieved, saved by another FTP client, or left by Wget being
387 Without `-c', the previous example would just begin to download the
388 remote file to `ls-lR.Z.1'. The `-c' option is also applicable
389 for HTTP servers that support the `Range' header.
392 Set the retrieval style to STYLE. Wget traces the retrieval of
393 each document by printing dots on the screen, each dot
394 representing a fixed amount of retrieved data. Any number of dots
395 may be separated in a "cluster", to make counting easier. This
396 option allows you to choose one of the pre-defined styles,
397 determining the number of bytes represented by a dot, the number
398 of dots in a cluster, and the number of dots on the line.
400 With the `default' style each dot represents 1K, there are ten dots
401 in a cluster and 50 dots in a line. The `binary' style has a more
402 "computer"-like orientation--8K dots, 16-dots clusters and 48 dots
403 per line (which makes for 384K lines). The `mega' style is
404 suitable for downloading very large files--each dot represents 64K
405 retrieved, there are eight dots in a cluster, and 48 dots on each
406 line (so each line contains 3M). The `micro' style is exactly the
407 reverse; it is suitable for downloading small files, with 128-byte
408 dots, 8 dots per cluster, and 48 dots (6K) per line.
412 Turn on time-stamping. *Note Time-Stamping:: for details.
416 Print the headers sent by HTTP servers and responses sent by FTP
420 When invoked with this option, Wget will behave as a Web "spider",
421 which means that it will not download the pages, just check that
422 they are there. You can use it to check your bookmarks, e.g. with:
424 wget --spider --force-html -i bookmarks.html
426 This feature needs much more work for Wget to get close to the
427 functionality of real WWW spiders.
431 Set the read timeout to SECONDS seconds. Whenever a network read
432 is issued, the file descriptor is checked for a timeout, which
433 could otherwise leave a pending connection (uninterrupted read).
434 The default timeout is 900 seconds (fifteen minutes). Setting
435 timeout to 0 will disable checking for timeouts.
437 Please do not lower the default timeout value with this option
438 unless you know what you are doing.
442 Wait the specified number of seconds between the retrievals. Use
443 of this option is recommended, as it lightens the server load by
444 making the requests less frequent. Instead of in seconds, the
445 time can be specified in minutes using the `m' suffix, in hours
446 using `h' suffix, or in days using `d' suffix.
448 Specifying a large value for this option is useful if the network
449 or the destination host is down, so that Wget can wait long enough
450 to reasonably expect the network error to be fixed before the
455 Turn proxy support on or off. The proxy is on by default if the
456 appropriate environmental variable is defined.
460 Specify download quota for automatic retrievals. The value can be
461 specified in bytes (default), kilobytes (with `k' suffix), or
462 megabytes (with `m' suffix).
464 Note that quota will never affect downloading a single file. So
465 if you specify `wget -Q10k ftp://wuarchive.wustl.edu/ls-lR.gz',
466 all of the `ls-lR.gz' will be downloaded. The same goes even when
467 several URLs are specified on the command-line. However, quota is
468 respected when retrieving either recursively, or from an input
469 file. Thus you may safely type `wget -Q2m -i sites'--download
470 will be aborted when the quota is exceeded.
472 Setting quota to 0 or to `inf' unlimits the download quota.
475 File: wget.info, Node: Directory Options, Next: HTTP Options, Prev: Download Options, Up: Invoking
482 Do not create a hierarchy of directories when retrieving
483 recursively. With this option turned on, all files will get saved
484 to the current directory, without clobbering (if a name shows up
485 more than once, the filenames will get extensions `.n').
488 `--force-directories'
489 The opposite of `-nd'--create a hierarchy of directories, even if
490 one would not have been created otherwise. E.g. `wget -x
491 http://fly.cc.fer.hr/robots.txt' will save the downloaded file to
492 `fly.cc.fer.hr/robots.txt'.
495 `--no-host-directories'
496 Disable generation of host-prefixed directories. By default,
497 invoking Wget with `-r http://fly.cc.fer.hr/' will create a
498 structure of directories beginning with `fly.cc.fer.hr/'. This
499 option disables such behavior.
502 Ignore NUMBER directory components. This is useful for getting a
503 fine-grained control over the directory where recursive retrieval
506 Take, for example, the directory at
507 `ftp://ftp.xemacs.org/pub/xemacs/'. If you retrieve it with `-r',
508 it will be saved locally under `ftp.xemacs.org/pub/xemacs/'.
509 While the `-nH' option can remove the `ftp.xemacs.org/' part, you
510 are still stuck with `pub/xemacs'. This is where `--cut-dirs'
511 comes in handy; it makes Wget not "see" NUMBER remote directory
512 components. Here are several examples of how `--cut-dirs' option
515 No options -> ftp.xemacs.org/pub/xemacs/
517 -nH --cut-dirs=1 -> xemacs/
518 -nH --cut-dirs=2 -> .
520 --cut-dirs=1 -> ftp.xemacs.org/xemacs/
523 If you just want to get rid of the directory structure, this
524 option is similar to a combination of `-nd' and `-P'. However,
525 unlike `-nd', `--cut-dirs' does not lose with subdirectories--for
526 instance, with `-nH --cut-dirs=1', a `beta/' subdirectory will be
527 placed to `xemacs/beta', as one would expect.
530 `--directory-prefix=PREFIX'
531 Set directory prefix to PREFIX. The "directory prefix" is the
532 directory where all other files and subdirectories will be saved
533 to, i.e. the top of the retrieval tree. The default is `.' (the
537 File: wget.info, Node: HTTP Options, Next: FTP Options, Prev: Directory Options, Up: Invoking
543 `--http-passwd=PASSWORD'
544 Specify the username USER and password PASSWORD on an HTTP server.
545 According to the type of the challenge, Wget will encode them
546 using either the `basic' (insecure) or the `digest' authentication
549 Another way to specify username and password is in the URL itself
550 (*Note URL Format::). For more information about security issues
551 with Wget, *Note Security Considerations::.
555 When set to off, disable server-side cache. In this case, Wget
556 will send the remote server an appropriate directive (`Pragma:
557 no-cache') to get the file from the remote service, rather than
558 returning the cached version. This is especially useful for
559 retrieving and flushing out-of-date documents on proxy servers.
561 Caching is allowed by default.
564 Unfortunately, some HTTP servers (CGI programs, to be more
565 precise) send out bogus `Content-Length' headers, which makes Wget
566 go wild, as it thinks not all the document was retrieved. You can
567 spot this syndrome if Wget retries getting the same document again
568 and again, each time claiming that the (otherwise normal)
569 connection has closed on the very same byte.
571 With this option, Wget will ignore the `Content-Length' header--as
574 `--header=ADDITIONAL-HEADER'
575 Define an ADDITIONAL-HEADER to be passed to the HTTP servers.
576 Headers must contain a `:' preceded by one or more non-blank
577 characters, and must not contain newlines.
579 You may define more than one additional header by specifying
580 `--header' more than once.
582 wget --header='Accept-Charset: iso-8859-2' \
583 --header='Accept-Language: hr' \
584 http://fly.cc.fer.hr/
586 Specification of an empty string as the header value will clear all
587 previous user-defined headers.
590 `--proxy-passwd=PASSWORD'
591 Specify the username USER and password PASSWORD for authentication
592 on a proxy server. Wget will encode them using the `basic'
593 authentication scheme.
597 Save the headers sent by the HTTP server to the file, preceding the
598 actual contents, with an empty line as the separator.
601 `--user-agent=AGENT-STRING'
602 Identify as AGENT-STRING to the HTTP server.
604 The HTTP protocol allows the clients to identify themselves using a
605 `User-Agent' header field. This enables distinguishing the WWW
606 software, usually for statistical purposes or for tracing of
607 protocol violations. Wget normally identifies as `Wget/VERSION',
608 VERSION being the current version number of Wget.
610 However, some sites have been known to impose the policy of
611 tailoring the output according to the `User-Agent'-supplied
612 information. While conceptually this is not such a bad idea, it
613 has been abused by servers denying information to clients other
614 than `Mozilla' or Microsoft `Internet Explorer'. This option
615 allows you to change the `User-Agent' line issued by Wget. Use of
616 this option is discouraged, unless you really know what you are
619 *NOTE* that Netscape Communications Corp. has claimed that false
620 transmissions of `Mozilla' as the `User-Agent' are a copyright
621 infringement, which will be prosecuted. *DO NOT* misrepresent
625 File: wget.info, Node: FTP Options, Next: Recursive Retrieval Options, Prev: HTTP Options, Up: Invoking
631 Retrieve symbolic links on FTP sites as if they were plain files,
632 i.e. don't just create links locally.
636 Turn FTP globbing on or off. Globbing means you may use the
637 shell-like special characters ("wildcards"), like `*', `?', `['
638 and `]' to retrieve more than one file from the same directory at
641 wget ftp://gnjilux.cc.fer.hr/*.msg
643 By default, globbing will be turned on if the URL contains a
644 globbing character. This option may be used to turn globbing on
647 You may have to quote the URL to protect it from being expanded by
648 your shell. Globbing makes Wget look for a directory listing,
649 which is system-specific. This is why it currently works only
650 with Unix FTP servers (and the ones emulating Unix `ls' output).
653 Use the "passive" FTP retrieval scheme, in which the client
654 initiates the data connection. This is sometimes required for FTP
655 to work behind firewalls.
658 File: wget.info, Node: Recursive Retrieval Options, Next: Recursive Accept/Reject Options, Prev: FTP Options, Up: Invoking
660 Recursive Retrieval Options
661 ===========================
665 Turn on recursive retrieving. *Note Recursive Retrieval:: for more
670 Specify recursion maximum depth level DEPTH (*Note Recursive
671 Retrieval::). The default maximum depth is 5.
674 This option tells Wget to delete every single file it downloads,
675 *after* having done so. It is useful for pre-fetching popular
676 pages through proxy, e.g.:
678 wget -r -nd --delete-after http://whatever.com/~popular/page/
680 The `-r' option is to retrieve recursively, and `-nd' not to
685 Convert the non-relative links to relative ones locally. Only the
686 references to the documents actually downloaded will be converted;
687 the rest will be left unchanged.
689 Note that only at the end of the download can Wget know which
690 links have been downloaded. Because of that, much of the work
691 done by `-k' will be performed at the end of the downloads.
695 Turn on options suitable for mirroring. This option turns on
696 recursion and time-stamping, sets infinite recursion depth and
697 keeps FTP directory listings. It is currently equivalent to `-r
701 `--dont-remove-listing'
702 Don't remove the temporary `.listing' files generated by FTP
703 retrievals. Normally, these files contain the raw directory
704 listings received from FTP servers. Not removing them can be
705 useful to access the full remote file list when running a mirror,
706 or for debugging purposes.
709 File: wget.info, Node: Recursive Accept/Reject Options, Prev: Recursive Retrieval Options, Up: Invoking
711 Recursive Accept/Reject Options
712 ===============================
714 `-A ACCLIST --accept ACCLIST'
715 `-R REJLIST --reject REJLIST'
716 Specify comma-separated lists of file name suffixes or patterns to
717 accept or reject (*Note Types of Files:: for more details).
720 `--domains=DOMAIN-LIST'
721 Set domains to be accepted and DNS looked-up, where DOMAIN-LIST is
722 a comma-separated list. Note that it does *not* turn on `-H'.
723 This option speeds things up, even if only one host is spanned
724 (*Note Domain Acceptance::).
726 `--exclude-domains DOMAIN-LIST'
727 Exclude the domains given in a comma-separated DOMAIN-LIST from
728 DNS-lookup (*Note Domain Acceptance::).
732 Follow relative links only. Useful for retrieving a specific home
733 page without any distractions, not even those from the same hosts
734 (*Note Relative Links::).
737 Follow FTP links from HTML documents. Without this option, Wget
738 will ignore all the FTP links.
742 Enable spanning across hosts when doing recursive retrieving
746 `--include-directories=LIST'
747 Specify a comma-separated list of directories you wish to follow
748 when downloading (*Note Directory-Based Limits:: for more
749 details.) Elements of LIST may contain wildcards.
752 `--exclude-directories=LIST'
753 Specify a comma-separated list of directories you wish to exclude
754 from download (*Note Directory-Based Limits:: for more details.)
755 Elements of LIST may contain wildcards.
759 Disable the time-consuming DNS lookup of almost all hosts (*Note
764 Do not ever ascend to the parent directory when retrieving
765 recursively. This is a useful option, since it guarantees that
766 only the files *below* a certain hierarchy will be downloaded.
767 *Note Directory-Based Limits:: for more details.
770 File: wget.info, Node: Recursive Retrieval, Next: Following Links, Prev: Invoking, Up: Top
775 GNU Wget is capable of traversing parts of the Web (or a single HTTP
776 or FTP server), depth-first following links and directory structure.
777 This is called "recursive" retrieving, or "recursion".
779 With HTTP URLs, Wget retrieves and parses the HTML from the given
780 URL, documents, retrieving the files the HTML document was referring
781 to, through markups like `href', or `src'. If the freshly downloaded
782 file is also of type `text/html', it will be parsed and followed
785 The maximum "depth" to which the retrieval may descend is specified
786 with the `-l' option (the default maximum depth is five layers). *Note
787 Recursive Retrieval::.
789 When retrieving an FTP URL recursively, Wget will retrieve all the
790 data from the given directory tree (including the subdirectories up to
791 the specified depth) on the remote server, creating its mirror image
792 locally. FTP retrieval is also limited by the `depth' parameter.
794 By default, Wget will create a local directory tree, corresponding to
795 the one found on the remote server.
797 Recursive retrieving can find a number of applications, the most
798 important of which is mirroring. It is also useful for WWW
799 presentations, and any other opportunities where slow network
800 connections should be bypassed by storing the files locally.
802 You should be warned that invoking recursion may cause grave
803 overloading on your system, because of the fast exchange of data
804 through the network; all of this may hamper other users' work. The
805 same stands for the foreign server you are mirroring--the more requests
806 it gets in a rows, the greater is its load.
808 Careless retrieving can also fill your file system unctrollably,
809 which can grind the machine to a halt.
811 The load can be minimized by lowering the maximum recursion level
812 (`-l') and/or by lowering the number of retries (`-t'). You may also
813 consider using the `-w' option to slow down your requests to the remote
814 servers, as well as the numerous options to narrow the number of
815 followed links (*Note Following Links::).
817 Recursive retrieval is a good thing when used properly. Please take
818 all precautions not to wreak havoc through carelessness.
821 File: wget.info, Node: Following Links, Next: Time-Stamping, Prev: Recursive Retrieval, Up: Top
826 When retrieving recursively, one does not wish to retrieve the loads
827 of unnecessary data. Most of the time the users bear in mind exactly
828 what they want to download, and want Wget to follow only specific links.
830 For example, if you wish to download the music archive from
831 `fly.cc.fer.hr', you will not want to download all the home pages that
832 happen to be referenced by an obscure part of the archive.
834 Wget possesses several mechanisms that allows you to fine-tune which
835 links it will follow.
839 * Relative Links:: Follow relative links only.
840 * Host Checking:: Follow links on the same host.
841 * Domain Acceptance:: Check on a list of domains.
842 * All Hosts:: No host restrictions.
843 * Types of Files:: Getting only certain files.
844 * Directory-Based Limits:: Getting only certain directories.
845 * FTP Links:: Following FTP links.
848 File: wget.info, Node: Relative Links, Next: Host Checking, Prev: Following Links, Up: Following Links
853 When only relative links are followed (option `-L'), recursive
854 retrieving will never span hosts. No time-expensive DNS-lookups will
855 be performed, and the process will be very fast, with the minimum
856 strain of the network. This will suit your needs often, especially when
857 mirroring the output of various `x2html' converters, since they
858 generally output relative links.
861 File: wget.info, Node: Host Checking, Next: Domain Acceptance, Prev: Relative Links, Up: Following Links
866 The drawback of following the relative links solely is that humans
867 often tend to mix them with absolute links to the very same host, and
868 the very same page. In this mode (which is the default mode for
869 following links) all URLs the that refer to the same host will be
872 The problem with this option are the aliases of the hosts and
873 domains. Thus there is no way for Wget to know that `regoc.srce.hr' and
874 `www.srce.hr' are the same host, or that `fly.cc.fer.hr' is the same as
875 `fly.cc.etf.hr'. Whenever an absolute link is encountered, the host is
876 DNS-looked-up with `gethostbyname' to check whether we are maybe
877 dealing with the same hosts. Although the results of `gethostbyname'
878 are cached, it is still a great slowdown, e.g. when dealing with large
879 indices of home pages on different hosts (because each of the hosts
880 must be and DNS-resolved to see whether it just *might* an alias of the
883 To avoid the overhead you may use `-nh', which will turn off
884 DNS-resolving and make Wget compare hosts literally. This will make
885 things run much faster, but also much less reliable (e.g. `www.srce.hr'
886 and `regoc.srce.hr' will be flagged as different hosts).
888 Note that modern HTTP servers allows one IP address to host several
889 "virtual servers", each having its own directory hieratchy. Such
890 "servers" are distinguished by their hostnames (all of which point to
891 the same IP address); for this to work, a client must send a `Host'
892 header, which is what Wget does. However, in that case Wget *must not*
893 try to divine a host's "real" address, nor try to use the same hostname
894 for each access, i.e. `-nh' must be turned on.
896 In other words, the `-nh' option must be used to enabling the
897 retrieval from virtual servers distinguished by their hostnames. As the
898 number of such server setups grow, the behavior of `-nh' may become the
899 default in the future.
902 File: wget.info, Node: Domain Acceptance, Next: All Hosts, Prev: Host Checking, Up: Following Links
907 With the `-D' option you may specify the domains that will be
908 followed. The hosts the domain of which is not in this list will not be
909 DNS-resolved. Thus you can specify `-Dmit.edu' just to make sure that
910 *nothing outside of MIT gets looked up*. This is very important and
911 useful. It also means that `-D' does *not* imply `-H' (span all
912 hosts), which must be specified explicitly. Feel free to use this
913 options since it will speed things up, with almost all the reliability
914 of checking for all hosts. Thus you could invoke
916 wget -r -D.hr http://fly.cc.fer.hr/
918 to make sure that only the hosts in `.hr' domain get DNS-looked-up
919 for being equal to `fly.cc.fer.hr'. So `fly.cc.etf.hr' will be checked
920 (only once!) and found equal, but `www.gnu.ai.mit.edu' will not even be
923 Of course, domain acceptance can be used to limit the retrieval to
924 particular domains with spanning of hosts in them, but then you must
925 specify `-H' explicitly. E.g.:
927 wget -r -H -Dmit.edu,stanford.edu http://www.mit.edu/
929 will start with `http://www.mit.edu/', following links across MIT
932 If there are domains you want to exclude specifically, you can do it
933 with `--exclude-domains', which accepts the same type of arguments of
934 `-D', but will *exclude* all the listed domains. For example, if you
935 want to download all the hosts from `foo.edu' domain, with the
936 exception of `sunsite.foo.edu', you can do it like this:
938 wget -rH -Dfoo.edu --exclude-domains sunsite.foo.edu http://www.foo.edu/
941 File: wget.info, Node: All Hosts, Next: Types of Files, Prev: Domain Acceptance, Up: Following Links
946 When `-H' is specified without `-D', all hosts are freely spanned.
947 There are no restrictions whatsoever as to what part of the net Wget
948 will go to fetch documents, other than maximum retrieval depth. If a
949 page references `www.yahoo.com', so be it. Such an option is rarely
953 File: wget.info, Node: Types of Files, Next: Directory-Based Limits, Prev: All Hosts, Up: Following Links
958 When downloading material from the web, you will often want to
959 restrict the retrieval to only certain file types. For example, if you
960 are interested in downloading GIFS, you will not be overjoyed to get
961 loads of Postscript documents, and vice versa.
963 Wget offers two options to deal with this problem. Each option
964 description lists a short name, a long name, and the equivalent command
970 The argument to `--accept' option is a list of file suffixes or
971 patterns that Wget will download during recursive retrieval. A
972 suffix is the ending part of a file, and consists of "normal"
973 letters, e.g. `gif' or `.jpg'. A matching pattern contains
974 shell-like wildcards, e.g. `books*' or `zelazny*196[0-9]*'.
976 So, specifying `wget -A gif,jpg' will make Wget download only the
977 files ending with `gif' or `jpg', i.e. GIFs and JPEGs. On the
978 other hand, `wget -A "zelazny*196[0-9]*"' will download only files
979 beginning with `zelazny' and containing numbers from 1960 to 1969
980 anywhere within. Look up the manual of your shell for a
981 description of how pattern matching works.
983 Of course, any number of suffixes and patterns can be combined
984 into a comma-separated list, and given as an argument to `-A'.
989 The `--reject' option works the same way as `--accept', only its
990 logic is the reverse; Wget will download all files *except* the
991 ones matching the suffixes (or patterns) in the list.
993 So, if you want to download a whole page except for the cumbersome
994 MPEGs and .AU files, you can use `wget -R mpg,mpeg,au'.
995 Analogously, to download all files except the ones beginning with
996 `bjork', use `wget -R "bjork*"'. The quotes are to prevent
997 expansion by the shell.
999 The `-A' and `-R' options may be combined to achieve even better
1000 fine-tuning of which files to retrieve. E.g. `wget -A "*zelazny*" -R
1001 .ps' will download all the files having `zelazny' as a part of their
1002 name, but *not* the postscript files.
1004 Note that these two options do not affect the downloading of HTML
1005 files; Wget must load all the HTMLs to know where to go at
1006 all--recursive retrieval would make no sense otherwise.
1009 File: wget.info, Node: Directory-Based Limits, Next: FTP Links, Prev: Types of Files, Up: Following Links
1011 Directory-Based Limits
1012 ======================
1014 Regardless of other link-following facilities, it is often useful to
1015 place the restriction of what files to retrieve based on the directories
1016 those files are placed in. There can be many reasons for this--the
1017 home pages may be organized in a reasonable directory structure; or some
1018 directories may contain useless information, e.g. `/cgi-bin' or `/dev'
1021 Wget offers three different options to deal with this requirement.
1022 Each option description lists a short name, a long name, and the
1023 equivalent command in `.wgetrc'.
1027 `include_directories = LIST'
1028 `-I' option accepts a comma-separated list of directories included
1029 in the retrieval. Any other directories will simply be ignored.
1030 The directories are absolute paths.
1032 So, if you wish to download from `http://host/people/bozo/'
1033 following only links to bozo's colleagues in the `/people'
1034 directory and the bogus scripts in `/cgi-bin', you can specify:
1036 wget -I /people,/cgi-bin http://host/people/bozo/
1040 `exclude_directories = LIST'
1041 `-X' option is exactly the reverse of `-I'--this is a list of
1042 directories *excluded* from the download. E.g. if you do not want
1043 Wget to download things from `/cgi-bin' directory, specify `-X
1044 /cgi-bin' on the command line.
1046 The same as with `-A'/`-R', these two options can be combined to
1047 get a better fine-tuning of downloading subdirectories. E.g. if
1048 you want to load all the files from `/pub' hierarchy except for
1049 `/pub/worthless', specify `-I/pub -X/pub/worthless'.
1054 The simplest, and often very useful way of limiting directories is
1055 disallowing retrieval of the links that refer to the hierarchy
1056 "upper" than the beginning directory, i.e. disallowing ascent to
1057 the parent directory/directories.
1059 The `--no-parent' option (short `-np') is useful in this case.
1060 Using it guarantees that you will never leave the existing
1061 hierarchy. Supposing you issue Wget with:
1063 wget -r --no-parent http://somehost/~luzer/my-archive/
1065 You may rest assured that none of the references to
1066 `/~his-girls-homepage/' or `/~luzer/all-my-mpegs/' will be
1067 followed. Only the archive you are interested in will be
1068 downloaded. Essentially, `--no-parent' is similar to
1069 `-I/~luzer/my-archive', only it handles redirections in a more
1070 intelligent fashion.
1073 File: wget.info, Node: FTP Links, Prev: Directory-Based Limits, Up: Following Links
1078 The rules for FTP are somewhat specific, as it is necessary for them
1079 to be. FTP links in HTML documents are often included for purposes of
1080 reference, and it is often inconvenient to download them by default.
1082 To have FTP links followed from HTML documents, you need to specify
1083 the `--follow-ftp' option. Having done that, FTP links will span hosts
1084 regardless of `-H' setting. This is logical, as FTP links rarely point
1085 to the same host where the HTTP server resides. For similar reasons,
1086 the `-L' options has no effect on such downloads. On the other hand,
1087 domain acceptance (`-D') and suffix rules (`-A' and `-R') apply
1090 Also note that followed links to FTP directories will not be
1091 retrieved recursively further.
1094 File: wget.info, Node: Time-Stamping, Next: Startup File, Prev: Following Links, Up: Top
1099 One of the most important aspects of mirroring information from the
1100 Internet is updating your archives.
1102 Downloading the whole archive again and again, just to replace a few
1103 changed files is expensive, both in terms of wasted bandwidth and money,
1104 and the time to do the update. This is why all the mirroring tools
1105 offer the option of incremental updating.
1107 Such an updating mechanism means that the remote server is scanned in
1108 search of "new" files. Only those new files will be downloaded in the
1109 place of the old ones.
1111 A file is considered new if one of these two conditions are met:
1113 1. A file of that name does not already exist locally.
1115 2. A file of that name does exist, but the remote file was modified
1116 more recently than the local file.
1118 To implement this, the program needs to be aware of the time of last
1119 modification of both remote and local files. Such information are
1120 called the "time-stamps".
1122 The time-stamping in GNU Wget is turned on using `--timestamping'
1123 (`-N') option, or through `timestamping = on' directive in `.wgetrc'.
1124 With this option, for each file it intends to download, Wget will check
1125 whether a local file of the same name exists. If it does, and the
1126 remote file is older, Wget will not download it.
1128 If the local file does not exist, or the sizes of the files do not
1129 match, Wget will download the remote file no matter what the time-stamps
1134 * Time-Stamping Usage::
1135 * HTTP Time-Stamping Internals::
1136 * FTP Time-Stamping Internals::
1139 File: wget.info, Node: Time-Stamping Usage, Next: HTTP Time-Stamping Internals, Prev: Time-Stamping, Up: Time-Stamping
1144 The usage of time-stamping is simple. Say you would like to
1145 download a file so that it keeps its date of modification.
1147 wget -S http://www.gnu.ai.mit.edu/
1149 A simple `ls -l' shows that the time stamp on the local file equals
1150 the state of the `Last-Modified' header, as returned by the server. As
1151 you can see, the time-stamping info is preserved locally, even without
1154 Several days later, you would like Wget to check if the remote file
1155 has changed, and download it if it has.
1157 wget -N http://www.gnu.ai.mit.edu/
1159 Wget will ask the server for the last-modified date. If the local
1160 file is newer, the remote file will not be re-fetched. However, if the
1161 remote file is more recent, Wget will proceed fetching it normally.
1163 The same goes for FTP. For example:
1165 wget ftp://ftp.ifi.uio.no/pub/emacs/gnus/*
1167 `ls' will show that the timestamps are set according to the state on
1168 the remote server. Reissuing the command with `-N' will make Wget
1169 re-fetch *only* the files that have been modified.
1171 In both HTTP and FTP retrieval Wget will time-stamp the local file
1172 correctly (with or without `-N') if it gets the stamps, i.e. gets the
1173 directory listing for FTP or the `Last-Modified' header for HTTP.
1175 If you wished to mirror the GNU archive every week, you would use the
1176 following command every week:
1178 wget --timestamping -r ftp://prep.ai.mit.edu/pub/gnu/
1181 File: wget.info, Node: HTTP Time-Stamping Internals, Next: FTP Time-Stamping Internals, Prev: Time-Stamping Usage, Up: Time-Stamping
1183 HTTP Time-Stamping Internals
1184 ============================
1186 Time-stamping in HTTP is implemented by checking of the
1187 `Last-Modified' header. If you wish to retrieve the file `foo.html'
1188 through HTTP, Wget will check whether `foo.html' exists locally. If it
1189 doesn't, `foo.html' will be retrieved unconditionally.
1191 If the file does exist locally, Wget will first check its local
1192 time-stamp (similar to the way `ls -l' checks it), and then send a
1193 `HEAD' request to the remote server, demanding the information on the
1196 The `Last-Modified' header is examined to find which file was
1197 modified more recently (which makes it "newer"). If the remote file is
1198 newer, it will be downloaded; if it is older, Wget will give up.(1)
1200 Arguably, HTTP time-stamping should be implemented using the
1201 `If-Modified-Since' request.
1203 ---------- Footnotes ----------
1205 (1) As an additional check, Wget will look at the `Content-Length'
1206 header, and compare the sizes; if they are not the same, the remote
1207 file will be downloaded no matter what the time-stamp says.
1210 File: wget.info, Node: FTP Time-Stamping Internals, Prev: HTTP Time-Stamping Internals, Up: Time-Stamping
1212 FTP Time-Stamping Internals
1213 ===========================
1215 In theory, FTP time-stamping works much the same as HTTP, only FTP
1216 has no headers--time-stamps must be received from the directory
1219 For each directory files must be retrieved from, Wget will use the
1220 `LIST' command to get the listing. It will try to analyze the listing,
1221 assuming that it is a Unix `ls -l' listing, and extract the
1222 time-stamps. The rest is exactly the same as for HTTP.
1224 Assumption that every directory listing is a Unix-style listing may
1225 sound extremely constraining, but in practice it is not, as many
1226 non-Unix FTP servers use the Unixoid listing format because most (all?)
1227 of the clients understand it. Bear in mind that RFC959 defines no
1228 standard way to get a file list, let alone the time-stamps. We can
1229 only hope that a future standard will define this.
1231 Another non-standard solution includes the use of `MDTM' command
1232 that is supported by some FTP servers (including the popular
1233 `wu-ftpd'), which returns the exact time of the specified file. Wget
1234 may support this command in the future.
1237 File: wget.info, Node: Startup File, Next: Examples, Prev: Time-Stamping, Up: Top
1242 Once you know how to change default settings of Wget through command
1243 line arguments, you may wish to make some of those settings permanent.
1244 You can do that in a convenient way by creating the Wget startup
1247 Besides `.wgetrc' is the "main" initialization file, it is
1248 convenient to have a special facility for storing passwords. Thus Wget
1249 reads and interprets the contents of `$HOME/.netrc', if it finds it.
1250 You can find `.netrc' format in your system manuals.
1252 Wget reads `.wgetrc' upon startup, recognizing a limited set of
1257 * Wgetrc Location:: Location of various wgetrc files.
1258 * Wgetrc Syntax:: Syntax of wgetrc.
1259 * Wgetrc Commands:: List of available commands.
1260 * Sample Wgetrc:: A wgetrc example.
1263 File: wget.info, Node: Wgetrc Location, Next: Wgetrc Syntax, Prev: Startup File, Up: Startup File
1268 When initializing, Wget will look for a "global" startup file,
1269 `/usr/local/etc/wgetrc' by default (or some prefix other than
1270 `/usr/local', if Wget was not installed there) and read commands from
1271 there, if it exists.
1273 Then it will look for the user's file. If the environmental variable
1274 `WGETRC' is set, Wget will try to load that file. Failing that, no
1275 further attempts will be made.
1277 If `WGETRC' is not set, Wget will try to load `$HOME/.wgetrc'.
1279 The fact that user's settings are loaded after the system-wide ones
1280 means that in case of collision user's wgetrc *overrides* the
1281 system-wide wgetrc (in `/usr/local/etc/wgetrc' by default). Fascist