@c This should really be generated automatically, possibly by including
@c an auto-generated file.
-@set VERSION 1.8.1
-@set UPDATED December 2001
+@set VERSION 1.9-cvs
+@set UPDATED September 2003
@dircategory Net Utilities
@dircategory World Wide Web
Select the type of the progress indicator you wish to use. Legal
indicators are ``dot'' and ``bar''.
-The ``dot'' indicator is used by default. It traces the retrieval by
-printing dots on the screen, each dot representing a fixed amount of
-downloaded data.
+The ``bar'' indicator is used by default. It draws an ASCII progress
+bar graphics (a.k.a ``thermometer'' display) indicating the status of
+retrieval. If the output is not a TTY, the ``dot'' bar will be used by
+default.
+
+Use @samp{--progress=dot} to switch to the ``dot'' display. It traces
+the retrieval by printing dots on the screen, each dot representing a
+fixed amount of downloaded data.
When using the dotted retrieval, you may also set the @dfn{style} by
specifying the type as @samp{dot:@var{style}}. Different styles assign
files---each dot represents 64K retrieved, there are eight dots in a
cluster, and 48 dots on each line (so each line contains 3M).
-Specifying @samp{--progress=bar} will draw a nice ASCII progress bar
-graphics (a.k.a ``thermometer'' display) to indicate retrieval. If the
-output is not a TTY, this option will be ignored, and Wget will revert
-to the dot indicator. If you want to force the bar indicator, use
-@samp{--progress=bar:force}.
+Note that you can set the default style using the @code{progress}
+command in @file{.wgetrc}. That setting may be overridden from the
+command line. The exception is that, when the output is not a TTY, the
+``dot'' progress will be favored over ``bar''. To force the bar output,
+use @samp{--progress=bar:force}.
@item -N
@itemx --timestamping
@cindex timeout
@item -T seconds
@itemx --timeout=@var{seconds}
-Set the read timeout to @var{seconds} seconds. Whenever a network read
-is issued, the file descriptor is checked for a timeout, which could
-otherwise leave a pending connection (uninterrupted read). The default
+Set the network timeout to @var{seconds} seconds. Whenever Wget
+connects to or reads from a remote host, it checks for a timeout and
+aborts the operation if the time expires. This prevents anomalous
+occurrences such as hanging reads or infinite connects. The default
timeout is 900 seconds (fifteen minutes). Setting timeout to 0 will
disable checking for timeouts.
Please do not lower the default timeout value with this option unless
you know what you are doing.
+@cindex bandwidth, limit
+@cindex rate, limit
+@cindex limit bandwidth
+@item --limit-rate=@var{amount}
+Limit the download speed to @var{amount} bytes per second. Amount may
+be expressed in bytes, kilobytes with the @samp{k} suffix, or megabytes
+with the @samp{m} suffix. For example, @samp{--limit-rate=20k} will
+limit the retrieval rate to 20KB/s. This kind of thing is useful when,
+for whatever reason, you don't want Wget to consume the entire evailable
+bandwidth.
+
+Note that Wget implementeds the limiting by sleeping the appropriate
+amount of time after a network read that took less time than specified
+by the rate. Eventually this strategy causes the TCP transfer to slow
+down to approximately the specified rate. However, it takes some time
+for this balance to be achieved, so don't be surprised if limiting the
+rate doesn't work with very small files. Also, the "sleeping" strategy
+will misfire when an extremely small bandwidth, say less than 1.5KB/s,
+is specified.
+
@cindex pause
@cindex wait
@item -w @var{seconds}
aborted when the quota is exceeded.
Setting quota to 0 or to @samp{inf} unlimits the download quota.
+
+@cindex DNS cache
+@cindex caching of DNS lookups
+@itemx --dns-cache=off
+Turn off caching of DNS lookups. Normally, Wget remembers the addresses
+it looked up from DNS so it doesn't have to repeatedly contact the DNS
+server for the same (typically small) set of addresses it retrieves
+from. This cache exists in memory only; a new Wget run will contact DNS
+again.
+
+However, in some cases it is not desirable to cache host names, even for
+the duration of a short-running application like Wget. For example,
+some HTTP servers are hosted on machines with dynamically allocated IP
+addresses that change from time to time. Their DNS entries are updated
+along with each change. When Wget's download from such a host gets
+interrupted by IP address change, Wget retries the download, but (due to
+DNS caching) it contacts the old address. With the DNS cache turned
+off, Wget will repeat the DNS lookup for every connect and will thus get
+the correct dynamic address every time---at the cost of additional DNS
+lookups where they're probably not needed.
+
+If you don't understand the above description, you probably won't need
+this option.
@end table
@node Directory Options, HTTP Options, Download Options, Invoking
Turning dirstruct on or off---the same as @samp{-x} or @samp{-nd},
respectively.
+@item dns_cache = on/off
+Turn DNS caching on/off. Since DNS caching is on by default, this
+option is normally used to turn it off. Same as @samp{--dns-cache}.
+
@item domains = @var{string}
Same as @samp{-D} (@pxref{Spanning Hosts}).
as there is, provided there is more than or equal to the value in
@code{Content-Length}.
+@item limit_rate = @var{rate}
+Limit the download speed to no more than @var{rate} bytes per second.
+The same as @samp{--limit-rate}.
+
@item logfile = @var{string}
Set logfile---the same as @samp{-o}.
same as @samp{--retr-symlinks}.
@item robots = on/off
-Use (or not) @file{/robots.txt} file (@pxref{Robots}). Be sure to know
-what you are doing before changing the default (which is @samp{on}).
+Specify whether the norobots convention is respected by Wget, ``on'' by
+default. This switch controls both the @file{/robots.txt} and the
+@samp{nofollow} aspect of the spec. @xref{Robot Exclusion}, for more
+details about this. Be sure you know what you are doing before turning
+this off.
@item server_response = on/off
Choose whether or not to print the @sc{http} and @sc{ftp} server
This chapter contains some references I consider useful.
@menu
-* Robots:: Wget as a WWW robot.
+* Robot Exclusion:: Wget's support for RES.
* Security Considerations:: Security with Wget.
* Contributors:: People who helped.
@end menu
-@node Robots, Security Considerations, Appendices, Appendices
-@section Robots
-@cindex robots
+@node Robot Exclusion, Security Considerations, Appendices, Appendices
+@section Robot Exclusion
+@cindex robot exclusion
@cindex robots.txt
@cindex server maintenance
sucking all the available data in progress. @samp{wget -r @var{site}},
and you're set. Great? Not for the server admin.
-While Wget is retrieving static pages, there's not much of a problem.
-But for Wget, there is no real difference between a static page and the
-most demanding CGI. For instance, a site I know has a section handled
-by an, uh, @dfn{bitchin'} CGI script that converts all the Info files to
-HTML. The script can and does bring the machine to its knees without
-providing anything useful to the downloader.
-
-For such and similar cases various robot exclusion schemes have been
-devised as a means for the server administrators and document authors to
-protect chosen portions of their sites from the wandering of robots.
-
-The more popular mechanism is the @dfn{Robots Exclusion Standard}, or
-@sc{res}, written by Martijn Koster et al. in 1994. It specifies the
-format of a text file containing directives that instruct the robots
-which URL paths to avoid. To be found by the robots, the specifications
-must be placed in @file{/robots.txt} in the server root, which the
-robots are supposed to download and parse.
-
-Wget supports @sc{res} when downloading recursively. So, when you
-issue:
+As long as Wget is only retrieving static pages, and doing it at a
+reasonable rate (see the @samp{--wait} option), there's not much of a
+problem. The trouble is that Wget can't tell the difference between the
+smallest static page and the most demanding CGI. A site I know has a
+section handled by an, uh, @dfn{bitchin'} CGI Perl script that converts
+Info files to HTML on the fly. The script is slow, but works well
+enough for human users viewing an occasional Info file. However, when
+someone's recursive Wget download stumbles upon the index page that
+links to all the Info files through the script, the system is brought to
+its knees without providing anything useful to the downloader.
+
+To avoid this kind of accident, as well as to preserve privacy for
+documents that need to be protected from well-behaved robots, the
+concept of @dfn{robot exclusion} has been invented. The idea is that
+the server administrators and document authors can specify which
+portions of the site they wish to protect from the robots.
+
+The most popular mechanism, and the de facto standard supported by all
+the major robots, is the ``Robots Exclusion Standard'' (RES) written by
+Martijn Koster et al. in 1994. It specifies the format of a text file
+containing directives that instruct the robots which URL paths to avoid.
+To be found by the robots, the specifications must be placed in
+@file{/robots.txt} in the server root, which the robots are supposed to
+download and parse.
+
+Although Wget is not a web robot in the strictest sense of the word, it
+can downloads large parts of the site without the user's intervention to
+download an individual page. Because of that, Wget honors RES when
+downloading recursively. For instance, when you issue:
@example
wget -r http://www.server.com/
method of robot exclusion in addition to the usual @file{/robots.txt}
exclusion.
-@node Security Considerations, Contributors, Robots, Appendices
+If you know what you are doing and really really wish to turn off the
+robot exclusion, set the @code{robots} variable to @samp{off} in your
+@file{.wgetrc}. You can achieve the same effect from the command line
+using the @code{-e} switch, e.g. @samp{wget -e robots=off @var{url}...}.
+
+@node Security Considerations, Contributors, Robot Exclusion, Appendices
@section Security Considerations
@cindex security