HTML files on your local disk, by adding `<base href="URL">' to
HTML, or using the `--base' command-line option.
+`-B URL'
+`--base=URL'
+ When used in conjunction with `-F', prepends URL to relative links
+ in the file specified by `-i'.
+
\1f
File: wget.info, Node: Download Options, Next: Directory Options, Prev: Logging and Input File Options, Up: Invoking
`-nc'
`--no-clobber'
- Do not clobber existing files when saving to directory hierarchy
- within recursive retrieval of several files. This option is
- *extremely* useful when you wish to continue where you left off
- with retrieval of many files. If the files have the `.html' or
- (yuck) `.htm' suffix, they will be loaded from the local disk, and
- parsed as if they have been retrieved from the Web.
+ If a file is downloaded more than once in the same directory,
+ wget's behavior depends on a few options, including `-nc'. In
+ certain cases, the local file will be "clobbered", or overwritten,
+ upon repeated download. In other cases it will be preserved.
+
+ When running wget without `-N', `-nc', or `-r', downloading the
+ same file in the same directory will result in the original copy
+ of `FILE' being preserved and the second copy being named
+ `FILE.1'. If that file is downloaded yet again, the third copy
+ will be named `FILE.2', and so on. When `-nc' is specified, this
+ behavior is suppressed, and wget will refuse to download newer
+ copies of `FILE'. Therefore, "no-clobber" is actually a misnomer
+ in this mode - it's not clobbering that's prevented (as the
+ numeric suffixes were already preventing clobbering), but rather
+ the multiple version saving that's prevented.
+
+ When running wget with `-r', but without `-N' or `-nc',
+ re-downloading a file will result in the new copy simply
+ overwriting the old. Adding `-nc' will prevent this behavior,
+ instead causing the original version to be preserved and any newer
+ copies on the server to be ignored.
+
+ When running wget with `-N', with or without `-r', the decision as
+ to whether or not to download a newer copy of a file depends on
+ the local and remote timestamp and size of the file (*Note
+ Time-Stamping::). `-nc' may not be specified at the same time as
+ `-N'.
+
+ Note that when `-nc' is specified, files with the suffixes `.html'
+ or (yuck) `.htm' will be loaded from the local disk and parsed as
+ if they had been retrieved from the Web.
`-c'
`--continue'
remote file to `ls-lR.Z.1'. The `-c' option is also applicable
for HTTP servers that support the `Range' header.
+ Note that if you use `-c' on a file that's already downloaded
+ completely, `FILE' will not be changed, nor will a second `FILE.1'
+ copy be created.
+
`--dot-style=STYLE'
Set the retrieval style to STYLE. Wget traces the retrieval of
each document by printing dots on the screen, each dot
to reasonably expect the network error to be fixed before the
retry.
+`--waitretry=SECONDS'
+ If you don't want Wget to wait between *every* retrieval, but only
+ between retries of failed downloads, you can use this option.
+ Wget will use "linear backoff", waiting 1 second after the first
+ failure on a given file, then waiting 2 seconds after the second
+ failure on that file, up to the maximum number of SECONDS you
+ specify. Therefore, a value of 10 will actually make Wget wait up
+ to (1 + 2 + ... + 10) = 55 seconds per file.
+
+ Note that this option is turned on by default in the global
+ `wgetrc' file.
+
`-Y on/off'
`--proxy=on/off'
Turn proxy support on or off. The proxy is on by default if the
on a proxy server. Wget will encode them using the `basic'
authentication scheme.
+`--referer=URL'
+ Include `Referer: URL' header in HTTP request. Useful for
+ retrieving documents with server-side processing that assume they
+ are always being retrieved by interactive web browsers and only
+ come out properly when Referer is set to one of the pages that
+ point to them.
+
`-s'
`--save-headers'
Save the headers sent by the HTTP server to the file, preceding the
useful to access the full remote file list when running a mirror,
or for debugging purposes.
+`-p'
+`--page-requisites'
+ This option causes wget to download all the files that are
+ necessary to properly display a given HTML page. This includes
+ such things as inlined images, sounds, and referenced stylesheets.
+
+ Ordinarily, when downloading a single HTML page, any requisite
+ documents that may be needed to display it properly are not
+ downloaded. Using `-r' together with `-l' can help, but since
+ wget does not ordinarily distinguish between external and inlined
+ documents, one is generally left with "leaf documents" that are
+ missing their requisites.
+
+ For instance, say document `1.html' contains an `<IMG>' tag
+ referencing `1.gif' and an `<A>' tag pointing to external document
+ `2.html'. Say that `2.html' is the same but that its image is
+ `2.gif' and it links to `3.html'. Say this continues up to some
+ arbitrarily high number.
+
+ If one executes the command:
+
+ wget -r -l 2 http://SITE/1.html
+
+ then `1.html', `1.gif', `2.html', `2.gif', and `3.html' will be
+ downloaded. As you can see, `3.html' is without its requisite
+ `3.gif' because wget is simply counting the number of hops (up to
+ 2) away from `1.html' in order to determine where to stop the
+ recursion. However, with this command:
+
+ wget -r -l 2 -p http://SITE/1.html
+
+ all the above files *and* `3.html''s requisite `3.gif' will be
+ downloaded. Similarly,
+
+ wget -r -l 1 -p http://SITE/1.html
+
+ will cause `1.html', `1.gif', `2.html', and `2.gif' to be
+ downloaded. One might think that:
+
+ wget -r -l 0 -p http://SITE/1.html
+
+ would download just `1.html' and `1.gif', but unfortunately this
+ is not the case, because `-l 0' is equivalent to `-l inf' - that
+ is, infinite recursion. To download a single HTML page (or a
+ handful of them, all specified on the commandline or in a `-i' URL
+ input file) and its requisites, simply leave off `-p' and `-l':
+
+ wget -p http://SITE/1.html
+
+ Note that wget will behave as if `-r' had been specified, but only
+ that single page and its requisites will be downloaded. Links
+ from that page to external documents will not be followed.
+ Actually, to download a single page and all its requisites (even
+ if they exist on separate websites), and make sure the lot
+ displays properly locally, this author likes to use a few options
+ in addition to `-p':
+
+ wget -H -k -K -nh -p http://SITE/DOCUMENT
+
+ To finish off this topic, it's worth knowing that wget's idea of an
+ external document link is any URL specified in an `<A>' tag, an
+ `<AREA>' tag, or a `<LINK>' tag other than `<LINK
+ REL="stylesheet">'.
+
\1f
File: wget.info, Node: Recursive Accept/Reject Options, Prev: Recursive Retrieval Options, Up: Invoking
`--ignore-tags=LIST'
This is the opposite of the `--follow-tags' option. To skip
certain HTML tags when recursively looking for documents to
- download, specify them in a comma-separated LIST. The author of
- this option likes to use the following command to download a
- single HTML page and all documents necessary to display it
- properly:
+ download, specify them in a comma-separated LIST.
+
+ In the past, the `-G' option was the best bet for downloading a
+ single page and its requisites, using a commandline like:
wget -Ga,area -H -k -K -nh -r http://SITE/DOCUMENT
+ However, the author of this option came across a page with tags
+ like `<LINK REL="home" HREF="/">' and came to the realization that
+ `-G' was not enough. One can't just tell wget to ignore `<LINK>',
+ because then stylesheets will not be downloaded. Now the best bet
+ for downloading a single page and its requisites is the dedicated
+ `--page-requisites' option.
+
`-H'
`--span-hosts'
Enable spanning across hosts when doing recursive retrieving
Also note that followed links to FTP directories will not be
retrieved recursively further.
-\1f
-File: wget.info, Node: Time-Stamping, Next: Startup File, Prev: Following Links, Up: Top
-
-Time-Stamping
-*************
-
- One of the most important aspects of mirroring information from the
-Internet is updating your archives.
-
- Downloading the whole archive again and again, just to replace a few
-changed files is expensive, both in terms of wasted bandwidth and money,
-and the time to do the update. This is why all the mirroring tools
-offer the option of incremental updating.
-
- Such an updating mechanism means that the remote server is scanned in
-search of "new" files. Only those new files will be downloaded in the
-place of the old ones.
-
- A file is considered new if one of these two conditions are met:
-
- 1. A file of that name does not already exist locally.
-
- 2. A file of that name does exist, but the remote file was modified
- more recently than the local file.
-
- To implement this, the program needs to be aware of the time of last
-modification of both remote and local files. Such information are
-called the "time-stamps".
-
- The time-stamping in GNU Wget is turned on using `--timestamping'
-(`-N') option, or through `timestamping = on' directive in `.wgetrc'.
-With this option, for each file it intends to download, Wget will check
-whether a local file of the same name exists. If it does, and the
-remote file is older, Wget will not download it.
-
- If the local file does not exist, or the sizes of the files do not
-match, Wget will download the remote file no matter what the time-stamps
-say.
-
-* Menu:
-
-* Time-Stamping Usage::
-* HTTP Time-Stamping Internals::
-* FTP Time-Stamping Internals::
-
-\1f
-File: wget.info, Node: Time-Stamping Usage, Next: HTTP Time-Stamping Internals, Prev: Time-Stamping, Up: Time-Stamping
-
-Time-Stamping Usage
-===================
-
- The usage of time-stamping is simple. Say you would like to
-download a file so that it keeps its date of modification.
-
- wget -S http://www.gnu.ai.mit.edu/
-
- A simple `ls -l' shows that the time stamp on the local file equals
-the state of the `Last-Modified' header, as returned by the server. As
-you can see, the time-stamping info is preserved locally, even without
-`-N'.
-
- Several days later, you would like Wget to check if the remote file
-has changed, and download it if it has.
-
- wget -N http://www.gnu.ai.mit.edu/
-
- Wget will ask the server for the last-modified date. If the local
-file is newer, the remote file will not be re-fetched. However, if the
-remote file is more recent, Wget will proceed fetching it normally.
-
- The same goes for FTP. For example:
-
- wget ftp://ftp.ifi.uio.no/pub/emacs/gnus/*
-
- `ls' will show that the timestamps are set according to the state on
-the remote server. Reissuing the command with `-N' will make Wget
-re-fetch *only* the files that have been modified.
-
- In both HTTP and FTP retrieval Wget will time-stamp the local file
-correctly (with or without `-N') if it gets the stamps, i.e. gets the
-directory listing for FTP or the `Last-Modified' header for HTTP.
-
- If you wished to mirror the GNU archive every week, you would use the
-following command every week:
-
- wget --timestamping -r ftp://prep.ai.mit.edu/pub/gnu/
-
-\1f
-File: wget.info, Node: HTTP Time-Stamping Internals, Next: FTP Time-Stamping Internals, Prev: Time-Stamping Usage, Up: Time-Stamping
-
-HTTP Time-Stamping Internals
-============================
-
- Time-stamping in HTTP is implemented by checking of the
-`Last-Modified' header. If you wish to retrieve the file `foo.html'
-through HTTP, Wget will check whether `foo.html' exists locally. If it
-doesn't, `foo.html' will be retrieved unconditionally.
-
- If the file does exist locally, Wget will first check its local
-time-stamp (similar to the way `ls -l' checks it), and then send a
-`HEAD' request to the remote server, demanding the information on the
-remote file.
-
- The `Last-Modified' header is examined to find which file was
-modified more recently (which makes it "newer"). If the remote file is
-newer, it will be downloaded; if it is older, Wget will give up.(1)
-
- When `--backup-converted' (`-K') is specified in conjunction with
-`-N', server file `X' is compared to local file `X.orig', if extant,
-rather than being compared to local file `X', which will always differ
-if it's been converted by `--convert-links' (`-k').
-
- Arguably, HTTP time-stamping should be implemented using the
-`If-Modified-Since' request.
-
- ---------- Footnotes ----------
-
- (1) As an additional check, Wget will look at the `Content-Length'
-header, and compare the sizes; if they are not the same, the remote
-file will be downloaded no matter what the time-stamp says.
-
-\1f
-File: wget.info, Node: FTP Time-Stamping Internals, Prev: HTTP Time-Stamping Internals, Up: Time-Stamping
-
-FTP Time-Stamping Internals
-===========================
-
- In theory, FTP time-stamping works much the same as HTTP, only FTP
-has no headers--time-stamps must be received from the directory
-listings.
-
- For each directory files must be retrieved from, Wget will use the
-`LIST' command to get the listing. It will try to analyze the listing,
-assuming that it is a Unix `ls -l' listing, and extract the
-time-stamps. The rest is exactly the same as for HTTP.
-
- Assumption that every directory listing is a Unix-style listing may
-sound extremely constraining, but in practice it is not, as many
-non-Unix FTP servers use the Unixoid listing format because most (all?)
-of the clients understand it. Bear in mind that RFC959 defines no
-standard way to get a file list, let alone the time-stamps. We can
-only hope that a future standard will define this.
-
- Another non-standard solution includes the use of `MDTM' command
-that is supported by some FTP servers (including the popular
-`wu-ftpd'), which returns the exact time of the specified file. Wget
-may support this command in the future.
-