@afourpaper
@end iftex
-@c This should really be auto-generated!
-@set VERSION 1.8
-@set UPDATED December 2001
+@c This should really be generated automatically, possibly by including
+@c an auto-generated file.
+@set VERSION 1.9-cvs
+@set UPDATED September 2003
@dircategory Net Utilities
@dircategory World Wide Web
@file{.netrc} file in your home directory, password will also be
searched for there.}
+@strong{Important Note}: if you specify a password-containing @sc{url}
+on the command line, the username and password will be plainly visible
+to all users on the system, by way of @code{ps}. On multi-user systems,
+this is a big security risk. To work around it, use @code{wget -i -}
+and feed the @sc{url}s to Wget's standard input, each on a separate
+line, terminated by @kbd{C-d}.
+
You can encode unsafe characters in a @sc{url} as @samp{%xy}, @code{xy}
being the hexadecimal representation of the character's @sc{ascii}
value. Some common unsafe characters include @samp{%} (quoted as
Select the type of the progress indicator you wish to use. Legal
indicators are ``dot'' and ``bar''.
-The ``dot'' indicator is used by default. It traces the retrieval by
-printing dots on the screen, each dot representing a fixed amount of
-downloaded data.
+The ``bar'' indicator is used by default. It draws an ASCII progress
+bar graphics (a.k.a ``thermometer'' display) indicating the status of
+retrieval. If the output is not a TTY, the ``dot'' bar will be used by
+default.
+
+Use @samp{--progress=dot} to switch to the ``dot'' display. It traces
+the retrieval by printing dots on the screen, each dot representing a
+fixed amount of downloaded data.
When using the dotted retrieval, you may also set the @dfn{style} by
specifying the type as @samp{dot:@var{style}}. Different styles assign
files---each dot represents 64K retrieved, there are eight dots in a
cluster, and 48 dots on each line (so each line contains 3M).
-Specifying @samp{--progress=bar} will draw a nice ASCII progress bar
-graphics (a.k.a ``thermometer'' display) to indicate retrieval. If the
-output is not a TTY, this option will be ignored, and Wget will revert
-to the dot indicator. If you want to force the bar indicator, use
-@samp{--progress=bar:force}.
+Note that you can set the default style using the @code{progress}
+command in @file{.wgetrc}. That setting may be overridden from the
+command line. The exception is that, when the output is not a TTY, the
+``dot'' progress will be favored over ``bar''. To force the bar output,
+use @samp{--progress=bar:force}.
@item -N
@itemx --timestamping
@cindex timeout
@item -T seconds
@itemx --timeout=@var{seconds}
-Set the read timeout to @var{seconds} seconds. Whenever a network read
-is issued, the file descriptor is checked for a timeout, which could
-otherwise leave a pending connection (uninterrupted read). The default
+Set the network timeout to @var{seconds} seconds. Whenever Wget
+connects to or reads from a remote host, it checks for a timeout and
+aborts the operation if the time expires. This prevents anomalous
+occurrences such as hanging reads or infinite connects. The default
timeout is 900 seconds (fifteen minutes). Setting timeout to 0 will
disable checking for timeouts.
Please do not lower the default timeout value with this option unless
you know what you are doing.
+@cindex bandwidth, limit
+@cindex rate, limit
+@cindex limit bandwidth
+@item --limit-rate=@var{amount}
+Limit the download speed to @var{amount} bytes per second. Amount may
+be expressed in bytes, kilobytes with the @samp{k} suffix, or megabytes
+with the @samp{m} suffix. For example, @samp{--limit-rate=20k} will
+limit the retrieval rate to 20KB/s. This kind of thing is useful when,
+for whatever reason, you don't want Wget to consume the entire evailable
+bandwidth.
+
+Note that Wget implementeds the limiting by sleeping the appropriate
+amount of time after a network read that took less time than specified
+by the rate. Eventually this strategy causes the TCP transfer to slow
+down to approximately the specified rate. However, it takes some time
+for this balance to be achieved, so don't be surprised if limiting the
+rate doesn't work with very small files. Also, the "sleeping" strategy
+will misfire when an extremely small bandwidth, say less than 1.5KB/s,
+is specified.
+
@cindex pause
@cindex wait
@item -w @var{seconds}
aborted when the quota is exceeded.
Setting quota to 0 or to @samp{inf} unlimits the download quota.
+
+@cindex DNS cache
+@cindex caching of DNS lookups
+@itemx --dns-cache=off
+Turn off caching of DNS lookups. Normally, Wget remembers the addresses
+it looked up from DNS so it doesn't have to repeatedly contact the DNS
+server for the same (typically small) set of addresses it retrieves
+from. This cache exists in memory only; a new Wget run will contact DNS
+again.
+
+However, in some cases it is not desirable to cache host names, even for
+the duration of a short-running application like Wget. For example,
+some HTTP servers are hosted on machines with dynamically allocated IP
+addresses that change from time to time. Their DNS entries are updated
+along with each change. When Wget's download from such a host gets
+interrupted by IP address change, Wget retries the download, but (due to
+DNS caching) it contacts the old address. With the DNS cache turned
+off, Wget will repeat the DNS lookup for every connect and will thus get
+the correct dynamic address every time---at the cost of additional DNS
+lookups where they're probably not needed.
+
+If you don't understand the above description, you probably won't need
+this option.
@end table
@node Directory Options, HTTP Options, Download Options, Invoking
@code{digest} authentication scheme.
Another way to specify username and password is in the @sc{url} itself
-(@pxref{URL Format}). For more information about security issues with
-Wget, @xref{Security Considerations}.
+(@pxref{URL Format}). Either method reveals your password to anyone who
+bothers to run @code{ps}. To prevent the passwords from being seen,
+store them in @file{.wgetrc} or @file{.netrc}, and make sure to protect
+those files from other users with @code{chmod}. If the passwords are
+really important, do not leave them lying in those files either---edit
+the files and delete them after Wget has started the download.
+
+For more information about security issues with Wget, @xref{Security
+Considerations}.
@cindex proxy
@cindex cache
@cindex loading cookies
@cindex cookies, loading
@item --load-cookies @var{file}
-Load cookies from @var{file} before the first HTTP retrieval. The
-format of @var{file} is one used by Netscape and Mozilla, at least their
-Unix version.
+Load cookies from @var{file} before the first HTTP retrieval.
+@var{file} is a textual file in the format originally used by Netscape's
+@file{cookies.txt} file.
You will typically use this option when mirroring sites that require
that you be logged in to access some or all of their content. The login
browser sends when communicating with the site. This is achieved by
@samp{--load-cookies}---simply point Wget to the location of the
@file{cookies.txt} file, and it will send the same cookies your browser
-would send in the same situation. If you're using Netscape navigator,
-specify @samp{--load-cookies ~/.netscape/cookies.txt}. Mozilla keeps
-the cookies file somewhere under @file{~/.mozilla}, in the profile
-directory, but it's also named @file{cookies.txt}, the full path usually
-being @file{~/.mozilla/default/@var{some-weird-string}/cookies.txt}.
+would send in the same situation. Different browsers keep textual
+cookie files in different locations:
-If you're using Wget under a non-Unix OS, or if you are using a
-different browser, @samp{--load-cookies} will not work.
+@table @asis
+@item Netscape 4.x.
+The cookies are in @file{~/.netscape/cookies.txt}.
+
+@item Mozilla and Netscape 6.x.
+Mozilla's cookie file is also named @file{cookies.txt}, located
+somewhere under @file{~/.mozilla}, in the directory of your profile.
+The full path usually ends up looking somewhat like
+@file{~/.mozilla/default/@var{some-weird-string}/cookies.txt}.
+
+@item Internet Explorer.
+You can produce a cookie file Wget can use by using the File menu,
+Import and Export, Export Cookies. This has been tested with Internet
+Explorer 5; it is not guaranteed to work with earlier versions.
+
+@item Other browsers.
+If you are using a different browser to create your cookies,
+@samp{--load-cookies} will only work if you can locate or produce a
+cookie file in the Netscape format that Wget expects.
+@end table
-In that case you can view the cookies using the cookie manager provided
-by your browser, and write down the name and value of the cookie needed
-for the site. Then you can bypass the ``official'' cookie code and
-simply tell Wget to use that one cookie, like this:
+If you cannot use @samp{--load-cookies}, there might still be an
+alternative. If your browser supports a ``cookie manager'', you can use
+it to view the cookies used when accessing the site you're mirroring.
+Write down the name and value of the cookie, and manually instruct Wget
+to send those cookies, bypassing the ``official'' cookie support:
@example
wget --cookies=off --header "Cookie: @var{name}=@var{value}"
authentication on a proxy server. Wget will encode them using the
@code{basic} authentication scheme.
+Security considerations similar to those with @samp{--http-passwd}
+pertain here as well.
+
@cindex http referer
@cindex referer, http
@item --referer=@var{url}
Turning dirstruct on or off---the same as @samp{-x} or @samp{-nd},
respectively.
+@item dns_cache = on/off
+Turn DNS caching on/off. Since DNS caching is on by default, this
+option is normally used to turn it off. Same as @samp{--dns-cache}.
+
@item domains = @var{string}
Same as @samp{-D} (@pxref{Spanning Hosts}).
as there is, provided there is more than or equal to the value in
@code{Content-Length}.
+@item limit_rate = @var{rate}
+Limit the download speed to no more than @var{rate} bytes per second.
+The same as @samp{--limit-rate}.
+
@item logfile = @var{string}
Set logfile---the same as @samp{-o}.
same as @samp{--retr-symlinks}.
@item robots = on/off
-Use (or not) @file{/robots.txt} file (@pxref{Robots}). Be sure to know
-what you are doing before changing the default (which is @samp{on}).
+Specify whether the norobots convention is respected by Wget, ``on'' by
+default. This switch controls both the @file{/robots.txt} and the
+@samp{nofollow} aspect of the spec. @xref{Robot Exclusion}, for more
+details about this. Be sure you know what you are doing before turning
+this off.
@item server_response = on/off
Choose whether or not to print the @sc{http} and @sc{ftp} server
responses---the same as @samp{-S}.
-@item simple_host_check = on/off
-Same as @samp{-nh} (@pxref{Spanning Hosts}).
-
@item span_hosts = on/off
Same as @samp{-H}.
wget ftp://hniksic:mypassword@@unix.server.com/.emacs
@end example
+Note, however, that this usage is not advisable on multi-user systems
+because it reveals your password to anyone who looks at the output of
+@code{ps}.
+
@cindex redirecting output
@item
You would like the output documents to go to standard output instead of
This chapter contains some references I consider useful.
@menu
-* Robots:: Wget as a WWW robot.
+* Robot Exclusion:: Wget's support for RES.
* Security Considerations:: Security with Wget.
* Contributors:: People who helped.
@end menu
-@node Robots, Security Considerations, Appendices, Appendices
-@section Robots
-@cindex robots
+@node Robot Exclusion, Security Considerations, Appendices, Appendices
+@section Robot Exclusion
+@cindex robot exclusion
@cindex robots.txt
@cindex server maintenance
sucking all the available data in progress. @samp{wget -r @var{site}},
and you're set. Great? Not for the server admin.
-While Wget is retrieving static pages, there's not much of a problem.
-But for Wget, there is no real difference between a static page and the
-most demanding CGI. For instance, a site I know has a section handled
-by an, uh, @dfn{bitchin'} CGI script that converts all the Info files to
-HTML. The script can and does bring the machine to its knees without
-providing anything useful to the downloader.
-
-For such and similar cases various robot exclusion schemes have been
-devised as a means for the server administrators and document authors to
-protect chosen portions of their sites from the wandering of robots.
-
-The more popular mechanism is the @dfn{Robots Exclusion Standard}, or
-@sc{res}, written by Martijn Koster et al. in 1994. It specifies the
-format of a text file containing directives that instruct the robots
-which URL paths to avoid. To be found by the robots, the specifications
-must be placed in @file{/robots.txt} in the server root, which the
-robots are supposed to download and parse.
-
-Wget supports @sc{res} when downloading recursively. So, when you
-issue:
+As long as Wget is only retrieving static pages, and doing it at a
+reasonable rate (see the @samp{--wait} option), there's not much of a
+problem. The trouble is that Wget can't tell the difference between the
+smallest static page and the most demanding CGI. A site I know has a
+section handled by an, uh, @dfn{bitchin'} CGI Perl script that converts
+Info files to HTML on the fly. The script is slow, but works well
+enough for human users viewing an occasional Info file. However, when
+someone's recursive Wget download stumbles upon the index page that
+links to all the Info files through the script, the system is brought to
+its knees without providing anything useful to the downloader.
+
+To avoid this kind of accident, as well as to preserve privacy for
+documents that need to be protected from well-behaved robots, the
+concept of @dfn{robot exclusion} has been invented. The idea is that
+the server administrators and document authors can specify which
+portions of the site they wish to protect from the robots.
+
+The most popular mechanism, and the de facto standard supported by all
+the major robots, is the ``Robots Exclusion Standard'' (RES) written by
+Martijn Koster et al. in 1994. It specifies the format of a text file
+containing directives that instruct the robots which URL paths to avoid.
+To be found by the robots, the specifications must be placed in
+@file{/robots.txt} in the server root, which the robots are supposed to
+download and parse.
+
+Although Wget is not a web robot in the strictest sense of the word, it
+can downloads large parts of the site without the user's intervention to
+download an individual page. Because of that, Wget honors RES when
+downloading recursively. For instance, when you issue:
@example
wget -r http://www.server.com/
Until version 1.8, Wget supported the first version of the standard,
written by Martijn Koster in 1994 and available at
-@url{http://info.webcrawler.com/mak/projects/robots/norobots.html}. As
-of version 1.8, Wget has supported the additional directives specified
-in the internet draft @samp{<draft-koster-robots-00.txt>} titled ``A
-Method for Web Robots Control''. The draft, which has as far as I know
-never made to an @sc{rfc}, is available at
-@url{http://info.webcrawler.com/mak/projects/robots/norobots-rfc.html}.
+@url{http://www.robotstxt.org/wc/norobots.html}. As of version 1.8,
+Wget has supported the additional directives specified in the internet
+draft @samp{<draft-koster-robots-00.txt>} titled ``A Method for Web
+Robots Control''. The draft, which has as far as I know never made to
+an @sc{rfc}, is available at
+@url{http://www.robotstxt.org/wc/norobots-rfc.txt}.
This manual no longer includes the text of the Robot Exclusion Standard.
@end example
This is explained in some detail at
-@url{http://info.webcrawler.com/mak/projects/robots/meta-user.html}.
-Wget supports this method of robot exclusion in addition to the usual
-@file{/robots.txt} exclusion.
+@url{http://www.robotstxt.org/wc/meta-user.html}. Wget supports this
+method of robot exclusion in addition to the usual @file{/robots.txt}
+exclusion.
-@node Security Considerations, Contributors, Robots, Appendices
+If you know what you are doing and really really wish to turn off the
+robot exclusion, set the @code{robots} variable to @samp{off} in your
+@file{.wgetrc}. You can achieve the same effect from the command line
+using the @code{-e} switch, e.g. @samp{wget -e robots=off @var{url}...}.
+
+@node Security Considerations, Contributors, Robot Exclusion, Appendices
@section Security Considerations
@cindex security
main issues, and some solutions.
@enumerate
-@item
-The passwords on the command line are visible using @code{ps}. If this
-is a problem, avoid putting passwords from the command line---e.g. you
-can use @file{.netrc} for this.
+@item The passwords on the command line are visible using @code{ps}.
+The best way around it is to use @code{wget -i -} and feed the @sc{url}s
+to Wget's standard input, each on a separate line, terminated by
+@kbd{C-d}. Another workaround is to use @file{.netrc} to store
+passwords; however, storing unencrypted passwords is also considered a
+security risk.
@item
Using the insecure @dfn{basic} authentication scheme, unencrypted