@end iftex
@c This should really be auto-generated!
-@set VERSION 1.7-dev
-@set UPDATED Jan 2001
+@set VERSION 1.7-pre1
+@set UPDATED May 2001
@dircategory Net Utilities
@dircategory World Wide Web
This manual documents version @value{VERSION} of GNU Wget, the freely
available utility for network download.
-Copyright @copyright{} 1996, 1997, 1998, 2000 Free Software Foundation, Inc.
+Copyright @copyright{} 1996, 1997, 1998, 2000, 2001 Free Software
+Foundation, Inc.
@menu
* Overview:: Features of Wget.
and parsed as if they had been retrieved from the Web.
@cindex continue retrieval
+@cindex incomplete downloads
@cindex resume download
@item -c
@itemx --continue
Without @samp{-c}, the previous example would just download the remote
file to @file{ls-lR.Z.1}, leaving the truncated @file{ls-lR.Z} file
-alone.
-
-Note that @samp{-c} only works with @sc{ftp} servers and with @sc{http}
-servers that support the @code{Range} header. Also note that Wget has
-no way of verifying that the local file really is a valid prefix of the
-remote file. If you use a lame @sc{http} proxy that inserts a ``transfer
-interrupted'' string, you'll get a corrupted file. In the future a
+alone.
+
+Beginning with Wget 1.7, if you use @samp{-c} on a non-empty file, and
+it turns out that the server does not support continued downloading,
+Wget will refuse to start the download from scratch, which would
+effectively ruin existing contents. If you really want the download to
+start from scratch, remove the file.
+
+Also beginning with Wget 1.7, if you use @samp{-c} on a file which is of
+equal size as the one on the server, Wget will refuse to download the
+file and print an explanatory message. The same happens when the file
+is smaller on the server than locally (presumably because it was changed
+on the server since your last download attempt)---because ``continuing''
+is not meaningful, no download occurs.
+
+On the other side of the coin, while using @samp{-c}, any file that's
+bigger on the server than locally will be considered an incomplete
+download and only @code{(length(remote) - length(local))} bytes will be
+downloaded and tacked onto the end of the local file. This behavior can
+be desirable in certain cases---for instance, you can use @samp{wget -c}
+to download just the new portion that's been appended to a data
+collection or log file.
+
+However, if the file is bigger on the server because it's been
+@emph{changed}, as opposed to just @emph{appended} to, you'll end up
+with a garbled file. Wget has no way of verifying that the local file
+is really a valid prefix of the remote file. You need to be especially
+careful of this when using @samp{-c} in conjunction with @samp{-r},
+since every file will be considered as an "incomplete download" candidate.
+
+Another instance where you'll get a garbled file if you try to use
+@samp{-c} is if you have a lame @sc{http} proxy that inserts a
+``transfer interrupted'' string into the local file. In the future a
``rollback'' option may be added to deal with this case.
-Currently, if you use @samp{-c} on a @file{file} that's already
-completely downloaded, it will be re-downloaded from scratch (to the
-same name, not to @file{file.1}). This will be fixed in a future
-version of Wget.
+Note that @samp{-c} only works with @sc{ftp} servers and with @sc{http}
+servers that support the @code{Range} header.
@cindex dot style
@cindex retrieval tracing style
Caching is allowed by default.
+@cindex cookies
+@item --cookies=on/off
+When set to off, disable the use of cookies. Cookies are a mechanism
+for maintaining server-side state. The server sends the client a cookie
+using the @code{Set-Cookie} header, and the client responds with the
+same cookie upon further requests. Since cookies allow the server
+owners to keep track of visitors and for sites to exchange this
+information, some consider them a breach of privacy. The default is to
+use cookies; however, @emph{storing} cookies is not on by default.
+
+@cindex loading cookies
+@cindex cookies, loading
+@item --load-cookies @var{file}
+Load cookies from @var{file} before the first HTTP retrieval. The
+format of @var{file} is one used by Netscape and Mozilla, at least their
+Unix version.
+
+@cindex saving cookies
+@cindex cookies, saving
+@item --save-cookies @var{file}
+Save cookies from @var{file} at the end of session. Cookies whose
+expiry time is not specified, or those that have already expired, are
+not saved.
+
@cindex Content-Length, ignore
@cindex ignore length
@item --ignore-length
@section FTP Options
@table @samp
-@cindex symbolic links, retrieving
-@item --retr-symlinks
-Usually, when retrieving @sc{ftp} directories recursively and a symbolic
-link is encountered, the linked-to file is not downloaded. Instead, a
-matching symbolic link is created on the local filesystem. The
-pointed-to file will not be downloaded unless this recursive retrieval
-would have encountered it separately and downloaded it anyway.
-
-When @samp{--retr-symlinks} is specified, however, symbolic links are
-traversed and the pointed-to files are retrieved. At this time, this
-option does not cause Wget to traverse symlinks to directories and
-recurse through them, but in the future it should be enhanced to do
-this.
-
-Note that when retrieving a file (not a directory) because it was
-specified on the commandline, rather than because it was recursed to,
-this option has no effect. Symbolic links are always traversed in this
-case.
+@cindex .listing files, removing
+@item -nr
+@itemx --dont-remove-listing
+Don't remove the temporary @file{.listing} files generated by @sc{ftp}
+retrievals. Normally, these files contain the raw directory listings
+received from @sc{ftp} servers. Not removing them can be useful for
+debugging purposes, or when you want to be able to easily check on the
+contents of remote server directories (e.g. to verify that a mirror
+you're running is complete).
+
+Note that even though Wget writes to a known filename for this file,
+this is not a security hole in the scenario of a user making
+@file{.listing} a symbolic link to @file{/etc/passwd} or something and
+asking @code{root} to run Wget in his or her directory. Depending on
+the options used, either Wget will refuse to write to @file{.listing},
+making the globbing/recursion/time-stamping operation fail, or the
+symbolic link will be deleted and replaced with the actual
+@file{.listing} file, or the listing will be written to a
+@file{.listing.@var{number}} file.
+
+Even though this situation isn't a problem, though, @code{root} should
+never run Wget in a non-trusted user's directory. A user could do
+something as simple as linking @file{index.html} to @file{/etc/passwd}
+and asking @code{root} to run Wget with @samp{-N} or @samp{-r} so the file
+will be overwritten.
@cindex globbing, toggle
@item -g on/off
Use the @dfn{passive} @sc{ftp} retrieval scheme, in which the client
initiates the data connection. This is sometimes required for @sc{ftp}
to work behind firewalls.
+
+@cindex symbolic links, retrieving
+@item --retr-symlinks
+Usually, when retrieving @sc{ftp} directories recursively and a symbolic
+link is encountered, the linked-to file is not downloaded. Instead, a
+matching symbolic link is created on the local filesystem. The
+pointed-to file will not be downloaded unless this recursive retrieval
+would have encountered it separately and downloaded it anyway.
+
+When @samp{--retr-symlinks} is specified, however, symbolic links are
+traversed and the pointed-to files are retrieved. At this time, this
+option does not cause Wget to traverse symlinks to directories and
+recurse through them, but in the future it should be enhanced to do
+this.
+
+Note that when retrieving a file (not a directory) because it was
+specified on the commandline, rather than because it was recursed to,
+this option has no effect. Symbolic links are always traversed in this
+case.
@end table
@node Recursive Retrieval Options, Recursive Accept/Reject Options, FTP Options, Invoking
@cindex link conversion
@item -k
@itemx --convert-links
-Convert the non-relative links to relative ones locally. Only the
-references to the documents actually downloaded will be converted; the
-rest will be left unchanged.
+After the download is complete, convert the links in the document to
+make them suitable for local viewing. This affects not only the visible
+hyperlinks, but any part of the document that links to external content,
+such as embedded images, links to style sheets, hyperlinks to non-HTML
+content, etc.
+
+Each link will be changed in one of the two ways:
+
+@itemize @bullet
+@item
+The links to files that have been downloaded by Wget will be changed to
+refer to the file they point to as a relative link.
+
+Example: if the downloaded file @file{/foo/doc.html} links to
+@file{/bar/img.gif}, also downloaded, then the link in @file{doc.html}
+will be modified to point to @samp{../bar/img.gif}. This kind of
+transformation works reliably for arbitrary combinations of directories.
+
+@item
+The links to files that have not been downloaded by Wget will be changed
+to include host name and absolute path of the location they point to.
+
+Example: if the downloaded file @file{/foo/doc.html} links to
+@file{/bar/img.gif} (or to @file{../bar/img.gif}), then the link in
+@file{doc.html} will be modified to point to
+@file{http://@var{hostname}/bar/img.gif}.
+@end itemize
+
+Because of this, local browsing works reliably: if a linked file was
+downloaded, the link will refer to its local name; if it was not
+downloaded, the link will refer to its full Internet address rather than
+presenting a broken link. The fact that the former links are converted
+to relative links ensures that you can move the downloaded hierarchy to
+another directory.
Note that only at the end of the download can Wget know which links have
-been downloaded. Because of that, much of the work done by @samp{-k}
-will be performed at the end of the downloads.
+been downloaded. Because of that, the work done by @samp{-k} will be
+performed at the end of all the downloads.
@cindex backing up converted files
@item -K
directory listings. It is currently equivalent to
@samp{-r -N -l inf -nr}.
-@item -nr
-@itemx --dont-remove-listing
-Don't remove the temporary @file{.listing} files generated by @sc{ftp}
-retrievals. Normally, these files contain the raw directory listings
-received from @sc{ftp} servers. Not removing them can be useful to
-access the full remote file list when running a mirror, or for debugging
-purposes.
-
@cindex page requisites
@cindex required images, downloading
@item -p
For instance, say document @file{1.html} contains an @code{<IMG>} tag
referencing @file{1.gif} and an @code{<A>} tag pointing to external
-document @file{2.html}. Say that @file{2.html} is the same but that its
+document @file{2.html}. Say that @file{2.html} is similar but that its
image is @file{2.gif} and it links to @file{3.html}. Say this
continues up to some arbitrarily high number.
this is not the case, because @samp{-l 0} is equivalent to
@samp{-l inf}---that is, infinite recursion. To download a single HTML
page (or a handful of them, all specified on the commandline or in a
-@samp{-i} @sc{url} input file) and its requisites, simply leave off
-@samp{-p} and @samp{-l}:
+@samp{-i} @sc{url} input file) and its (or their) requisites, simply leave off
+@samp{-r} and @samp{-l}:
@example
wget -p http://@var{site}/1.html
wget -E -H -k -K -nh -p http://@var{site}/@var{document}
@end example
+In one case you'll need to add a couple more options. If @var{document}
+is a @code{<FRAMESET>} page, the "one more hop" that @samp{-p} gives you
+won't be enough---you'll get the @code{<FRAME>} pages that are
+referenced, but you won't get @emph{their} requisites. Therefore, in
+this case you'll need to add @samp{-r -l1} to the commandline. The
+@samp{-r -l1} will recurse from the @code{<FRAMESET>} page to to the
+@code{<FRAME>} pages, and the @samp{-p} will get their requisites. If
+you're already using a recursion level of 1 or more, you'll need to up
+it by one. In the future, @samp{-p} may be made smarter so that it'll
+do "two more hops" in the case of a @code{<FRAMESET>} page.
+
To finish off this topic, it's worth knowing that Wget's idea of an
external document link is any URL specified in an @code{<A>} tag, an
@code{<AREA>} tag, or a @code{<LINK>} tag other than @code{<LINK
@end enumerate
To implement this, the program needs to be aware of the time of last
-modification of both remote and local files. Such information are
-called the @dfn{time-stamps}.
+modification of both local and remote files. We call this information the
+@dfn{time-stamp} of a file.
The time-stamping in GNU Wget is turned on using @samp{--timestamping}
(@samp{-N}) option, or through @code{timestamping = on} directive in
A simple @code{ls -l} shows that the time stamp on the local file equals
the state of the @code{Last-Modified} header, as returned by the server.
As you can see, the time-stamping info is preserved locally, even
-without @samp{-N}.
+without @samp{-N} (at least for @sc{http}).
Several days later, you would like Wget to check if the remote file has
changed, and download it if it has.
@end example
Wget will ask the server for the last-modified date. If the local file
-is newer, the remote file will not be re-fetched. However, if the remote
-file is more recent, Wget will proceed fetching it normally.
+has the same timestamp as the server, or a newer one, the remote file
+will not be re-fetched. However, if the remote file is more recent,
+Wget will proceed to fetch it.
The same goes for @sc{ftp}. For example:
@example
-wget ftp://ftp.ifi.uio.no/pub/emacs/gnus/*
+wget "ftp://ftp.ifi.uio.no/pub/emacs/gnus/*"
@end example
-@code{ls} will show that the timestamps are set according to the state
-on the remote server. Reissuing the command with @samp{-N} will make
-Wget re-fetch @emph{only} the files that have been modified.
+(The quotes around that URL are to prevent the shell from trying to
+interpret the @samp{*}.)
-In both @sc{http} and @sc{ftp} retrieval Wget will time-stamp the local
-file correctly (with or without @samp{-N}) if it gets the stamps,
-i.e. gets the directory listing for @sc{ftp} or the @code{Last-Modified}
-header for @sc{http}.
+After download, a local directory listing will show that the timestamps
+match those on the remote server. Reissuing the command with @samp{-N}
+will make Wget re-fetch @emph{only} the files that have been modified
+since the last download.
-If you wished to mirror the GNU archive every week, you would use the
-following command every week:
+If you wished to mirror the GNU archive every week, you would use a
+command like the following, weekly:
@example
-wget --timestamping -r ftp://prep.ai.mit.edu/pub/gnu/
+wget --timestamping -r ftp://ftp.gnu.org/pub/gnu/
@end example
+Note that time-stamping will only work for files for which the server
+gives a timestamp. For @sc{http}, this depends on getting a
+@code{Last-Modified} header. For @sc{ftp}, this depends on getting a
+directory listing with dates in a format that Wget can parse
+(@pxref{FTP Time-Stamping Internals}).
+
@node HTTP Time-Stamping Internals, FTP Time-Stamping Internals, Time-Stamping Usage, Time-Stamping
@section HTTP Time-Stamping Internals
@cindex http time-stamping
@cindex ftp time-stamping
In theory, @sc{ftp} time-stamping works much the same as @sc{http}, only
-@sc{ftp} has no headers---time-stamps must be received from the
-directory listings.
-
-For each directory files must be retrieved from, Wget will use the
-@code{LIST} command to get the listing. It will try to analyze the
-listing, assuming that it is a Unix @code{ls -l} listing, and extract
-the time-stamps. The rest is exactly the same as for @sc{http}.
+@sc{ftp} has no headers---time-stamps must be ferreted out of directory
+listings.
+
+If an @sc{ftp} download is recursive or uses globbing, Wget will use the
+@sc{ftp} @code{LIST} command to get a file listing for the directory
+containing the desired file(s). It will try to analyze the listing,
+treating it like Unix @code{ls -l} output, extracting the time-stamps.
+The rest is exactly the same as for @sc{http}. Note that when
+retrieving individual files from an @sc{ftp} server without using
+globbing or recursion, listing files will not be downloaded (and thus
+files will not be time-stamped) unless @samp{-N} is specified.
Assumption that every directory listing is a Unix-style listing may
sound extremely constraining, but in practice it is not, as many
Enable/disable host-prefixed file names. @samp{-nH} disables it.
@item continue = on/off
-Enable/disable continuation of the retrieval---the same as @samp{-c}
-(which enables it).
+If set to on, force continuation of preexistent partially retrieved
+files. See @samp{-c} before setting it.
@item background = on/off
Enable/disable going to background---the same as @samp{-b} (which
@item convert links = on/off
Convert non-relative links locally. The same as @samp{-k}.
+@item cookies = on/off
+When set to off, disallow cookies. See the @samp{--cookies} option.
+
+@item load_cookies = @var{file}
+Load cookies from @var{file}. See @samp{--load-cookies}.
+
+@item save_cookies = @var{file}
+Save cookies to @var{file}. See @samp{--save-cookies}.
+
@item cut_dirs = @var{n}
Ignore @var{n} remote directory components.