HTML files on your local disk, by adding `<base href="URL">' to
HTML, or using the `--base' command-line option.
+`-B URL'
+`--base=URL'
+ When used in conjunction with `-F', prepends URL to relative links
+ in the file specified by `-i'.
+
\1f
File: wget.info, Node: Download Options, Next: Directory Options, Prev: Logging and Input File Options, Up: Invoking
`-nc'
`--no-clobber'
- Do not clobber existing files when saving to directory hierarchy
- within recursive retrieval of several files. This option is
- *extremely* useful when you wish to continue where you left off
- with retrieval of many files. If the files have the `.html' or
- (yuck) `.htm' suffix, they will be loaded from the local disk, and
- parsed as if they have been retrieved from the Web.
+ If a file is downloaded more than once in the same directory,
+ wget's behavior depends on a few options, including `-nc'. In
+ certain cases, the local file will be "clobbered", or overwritten,
+ upon repeated download. In other cases it will be preserved.
+
+ When running wget without `-N', `-nc', or `-r', downloading the
+ same file in the same directory will result in the original copy
+ of `FILE' being preserved and the second copy being named
+ `FILE.1'. If that file is downloaded yet again, the third copy
+ will be named `FILE.2', and so on. When `-nc' is specified, this
+ behavior is suppressed, and wget will refuse to download newer
+ copies of `FILE'. Therefore, "no-clobber" is actually a misnomer
+ in this mode - it's not clobbering that's prevented (as the
+ numeric suffixes were already preventing clobbering), but rather
+ the multiple version saving that's prevented.
+
+ When running wget with `-r', but without `-N' or `-nc',
+ re-downloading a file will result in the new copy simply
+ overwriting the old. Adding `-nc' will prevent this behavior,
+ instead causing the original version to be preserved and any newer
+ copies on the server to be ignored.
+
+ When running wget with `-N', with or without `-r', the decision as
+ to whether or not to download a newer copy of a file depends on
+ the local and remote timestamp and size of the file (*Note
+ Time-Stamping::). `-nc' may not be specified at the same time as
+ `-N'.
+
+ Note that when `-nc' is specified, files with the suffixes `.html'
+ or (yuck) `.htm' will be loaded from the local disk and parsed as
+ if they had been retrieved from the Web.
`-c'
`--continue'
remote file to `ls-lR.Z.1'. The `-c' option is also applicable
for HTTP servers that support the `Range' header.
+ Note that if you use `-c' on a file that's already downloaded
+ completely, `FILE' will not be changed, nor will a second `FILE.1'
+ copy be created.
+
`--dot-style=STYLE'
Set the retrieval style to STYLE. Wget traces the retrieval of
each document by printing dots on the screen, each dot
on a proxy server. Wget will encode them using the `basic'
authentication scheme.
+`--referer=URL'
+ Include `Referer: URL' header in HTTP request. Useful for
+ retrieving documents with server-side processing that assume they
+ are always being retrieved by interactive web browsers and only
+ come out properly when Referer is set to one of the pages that
+ point to them.
+
`-s'
`--save-headers'
Save the headers sent by the HTTP server to the file, preceding the
useful to access the full remote file list when running a mirror,
or for debugging purposes.
+`-p'
+`--page-requisites'
+ This option causes wget to download all the files that are
+ necessary to properly display a given HTML page. This includes
+ such things as inlined images, sounds, and referenced stylesheets.
+
+ Ordinarily, when downloading a single HTML page, any requisite
+ documents that may be needed to display it properly are not
+ downloaded. Using `-r' together with `-l' can help, but since
+ wget does not ordinarily distinguish between external and inlined
+ documents, one is generally left with "leaf documents" that are
+ missing their requisites.
+
+ For instance, say document `1.html' contains an `<IMG>' tag
+ referencing `1.gif' and an `<A>' tag pointing to external document
+ `2.html'. Say that `2.html' is the same but that its image is
+ `2.gif' and it links to `3.html'. Say this continues up to some
+ arbitrarily high number.
+
+ If one executes the command:
+
+ wget -r -l 2 http://SITE/1.html
+
+ then `1.html', `1.gif', `2.html', `2.gif', and `3.html' will be
+ downloaded. As you can see, `3.html' is without its requisite
+ `3.gif' because wget is simply counting the number of hops (up to
+ 2) away from `1.html' in order to determine where to stop the
+ recursion. However, with this command:
+
+ wget -r -l 2 -p http://SITE/1.html
+
+ all the above files *and* `3.html''s requisite `3.gif' will be
+ downloaded. Similarly,
+
+ wget -r -l 1 -p http://SITE/1.html
+
+ will cause `1.html', `1.gif', `2.html', and `2.gif' to be
+ downloaded. One might think that:
+
+ wget -r -l 0 -p http://SITE/1.html
+
+ would download just `1.html' and `1.gif', but unfortunately this
+ is not the case, because `-l 0' is equivalent to `-l inf' - that
+ is, infinite recursion. To download a single HTML page (or a
+ handful of them, all specified on the commandline or in a `-i' URL
+ input file) and its requisites, simply leave off `-p' and `-l':
+
+ wget -p http://SITE/1.html
+
+ Note that wget will behave as if `-r' had been specified, but only
+ that single page and its requisites will be downloaded. Links
+ from that page to external documents will not be followed.
+ Actually, to download a single page and all its requisites (even
+ if they exist on separate websites), and make sure the lot
+ displays properly locally, this author likes to use a few options
+ in addition to `-p':
+
+ wget -H -k -K -nh -p http://SITE/DOCUMENT
+
+ To finish off this topic, it's worth knowing that wget's idea of an
+ external document link is any URL specified in an `<A>' tag, an
+ `<AREA>' tag, or a `<LINK>' tag other than `<LINK
+ REL="stylesheet">'.
+
\1f
File: wget.info, Node: Recursive Accept/Reject Options, Prev: Recursive Retrieval Options, Up: Invoking
`--ignore-tags=LIST'
This is the opposite of the `--follow-tags' option. To skip
certain HTML tags when recursively looking for documents to
- download, specify them in a comma-separated LIST. The author of
- this option likes to use the following command to download a
- single HTML page and all documents necessary to display it
- properly:
+ download, specify them in a comma-separated LIST.
+
+ In the past, the `-G' option was the best bet for downloading a
+ single page and its requisites, using a commandline like:
wget -Ga,area -H -k -K -nh -r http://SITE/DOCUMENT
+ However, the author of this option came across a page with tags
+ like `<LINK REL="home" HREF="/">' and came to the realization that
+ `-G' was not enough. One can't just tell wget to ignore `<LINK>',
+ because then stylesheets will not be downloaded. Now the best bet
+ for downloading a single page and its requisites is the dedicated
+ `--page-requisites' option.
+
`-H'
`--span-hosts'
Enable spanning across hosts when doing recursive retrieving
Also note that followed links to FTP directories will not be
retrieved recursively further.
-\1f
-File: wget.info, Node: Time-Stamping, Next: Startup File, Prev: Following Links, Up: Top
-
-Time-Stamping
-*************
-
- One of the most important aspects of mirroring information from the
-Internet is updating your archives.
-
- Downloading the whole archive again and again, just to replace a few
-changed files is expensive, both in terms of wasted bandwidth and money,
-and the time to do the update. This is why all the mirroring tools
-offer the option of incremental updating.
-
- Such an updating mechanism means that the remote server is scanned in
-search of "new" files. Only those new files will be downloaded in the
-place of the old ones.
-
- A file is considered new if one of these two conditions are met:
-
- 1. A file of that name does not already exist locally.
-
- 2. A file of that name does exist, but the remote file was modified
- more recently than the local file.
-
- To implement this, the program needs to be aware of the time of last
-modification of both remote and local files. Such information are
-called the "time-stamps".
-
- The time-stamping in GNU Wget is turned on using `--timestamping'
-(`-N') option, or through `timestamping = on' directive in `.wgetrc'.
-With this option, for each file it intends to download, Wget will check
-whether a local file of the same name exists. If it does, and the
-remote file is older, Wget will not download it.
-
- If the local file does not exist, or the sizes of the files do not
-match, Wget will download the remote file no matter what the time-stamps
-say.
-
-* Menu:
-
-* Time-Stamping Usage::
-* HTTP Time-Stamping Internals::
-* FTP Time-Stamping Internals::
-
-\1f
-File: wget.info, Node: Time-Stamping Usage, Next: HTTP Time-Stamping Internals, Prev: Time-Stamping, Up: Time-Stamping
-
-Time-Stamping Usage
-===================
-
- The usage of time-stamping is simple. Say you would like to
-download a file so that it keeps its date of modification.
-
- wget -S http://www.gnu.ai.mit.edu/
-
- A simple `ls -l' shows that the time stamp on the local file equals
-the state of the `Last-Modified' header, as returned by the server. As
-you can see, the time-stamping info is preserved locally, even without
-`-N'.
-
- Several days later, you would like Wget to check if the remote file
-has changed, and download it if it has.
-
- wget -N http://www.gnu.ai.mit.edu/
-
- Wget will ask the server for the last-modified date. If the local
-file is newer, the remote file will not be re-fetched. However, if the
-remote file is more recent, Wget will proceed fetching it normally.
-
- The same goes for FTP. For example:
-
- wget ftp://ftp.ifi.uio.no/pub/emacs/gnus/*
-
- `ls' will show that the timestamps are set according to the state on
-the remote server. Reissuing the command with `-N' will make Wget
-re-fetch *only* the files that have been modified.
-
- In both HTTP and FTP retrieval Wget will time-stamp the local file
-correctly (with or without `-N') if it gets the stamps, i.e. gets the
-directory listing for FTP or the `Last-Modified' header for HTTP.
-
- If you wished to mirror the GNU archive every week, you would use the
-following command every week:
-
- wget --timestamping -r ftp://prep.ai.mit.edu/pub/gnu/
-
-\1f
-File: wget.info, Node: HTTP Time-Stamping Internals, Next: FTP Time-Stamping Internals, Prev: Time-Stamping Usage, Up: Time-Stamping
-
-HTTP Time-Stamping Internals
-============================
-
- Time-stamping in HTTP is implemented by checking of the
-`Last-Modified' header. If you wish to retrieve the file `foo.html'
-through HTTP, Wget will check whether `foo.html' exists locally. If it
-doesn't, `foo.html' will be retrieved unconditionally.
-
- If the file does exist locally, Wget will first check its local
-time-stamp (similar to the way `ls -l' checks it), and then send a
-`HEAD' request to the remote server, demanding the information on the
-remote file.
-
- The `Last-Modified' header is examined to find which file was
-modified more recently (which makes it "newer"). If the remote file is
-newer, it will be downloaded; if it is older, Wget will give up.(1)
-
- When `--backup-converted' (`-K') is specified in conjunction with
-`-N', server file `X' is compared to local file `X.orig', if extant,
-rather than being compared to local file `X', which will always differ
-if it's been converted by `--convert-links' (`-k').
-
- Arguably, HTTP time-stamping should be implemented using the
-`If-Modified-Since' request.
-
- ---------- Footnotes ----------
-
- (1) As an additional check, Wget will look at the `Content-Length'
-header, and compare the sizes; if they are not the same, the remote
-file will be downloaded no matter what the time-stamp says.
-
-\1f
-File: wget.info, Node: FTP Time-Stamping Internals, Prev: HTTP Time-Stamping Internals, Up: Time-Stamping
-
-FTP Time-Stamping Internals
-===========================
-
- In theory, FTP time-stamping works much the same as HTTP, only FTP
-has no headers--time-stamps must be received from the directory
-listings.
-
- For each directory files must be retrieved from, Wget will use the
-`LIST' command to get the listing. It will try to analyze the listing,
-assuming that it is a Unix `ls -l' listing, and extract the
-time-stamps. The rest is exactly the same as for HTTP.
-
- Assumption that every directory listing is a Unix-style listing may
-sound extremely constraining, but in practice it is not, as many
-non-Unix FTP servers use the Unixoid listing format because most (all?)
-of the clients understand it. Bear in mind that RFC959 defines no
-standard way to get a file list, let alone the time-stamps. We can
-only hope that a future standard will define this.
-
- Another non-standard solution includes the use of `MDTM' command
-that is supported by some FTP servers (including the popular
-`wu-ftpd'), which returns the exact time of the specified file. Wget
-may support this command in the future.
-