@afourpaper
@end iftex
-@c This should really be auto-generated!
-@set VERSION 1.8-dev
-@set UPDATED November 2001
+@c This should really be generated automatically, possibly by including
+@c an auto-generated file.
+@set VERSION 1.9-cvs
+@set UPDATED September 2003
@dircategory Net Utilities
@dircategory World Wide Web
@cindex features
@c man begin DESCRIPTION
-GNU Wget is a freely available network utility to retrieve files from
-the World Wide Web, using @sc{http} (Hyper Text Transfer Protocol) and
-@sc{ftp} (File Transfer Protocol), the two most widely used Internet
-protocols. It has many useful features to make downloading easier, some
-of them being:
+GNU Wget is a free utility for non-interactive download of files from
+the Web. It supports @sc{http}, @sc{https}, and @sc{ftp} protocols, as
+well as retrieval through @sc{http} proxies.
+
+@c man end
+This chapter is a partial overview of Wget's features.
@itemize @bullet
@item
+@c man begin DESCRIPTION
Wget is non-interactive, meaning that it can work in the background,
while the user is not logged on. This allows you to start a retrieval
and disconnect from the system, letting Wget finish the work. By
@c man end
@sp 1
-@c man begin DESCRIPTION
@item
-Wget is capable of descending recursively through the structure of
-@sc{html} documents and @sc{ftp} directory trees, making a local copy of
-the directory hierarchy similar to the one on the remote server. This
-feature can be used to mirror archives and home pages, or traverse the
-web in search of data, like a @sc{www} robot (@pxref{Robots}). In that
-spirit, Wget understands the @code{norobots} convention.
+@ignore
+@c man begin DESCRIPTION
+
+@c man end
+@end ignore
+@c man begin DESCRIPTION
+Wget can follow links in @sc{html} pages and create local versions of
+remote web sites, fully recreating the directory structure of the
+original site. This is sometimes referred to as ``recursive
+downloading.'' While doing that, Wget respects the Robot Exclusion
+Standard (@file{/robots.txt}). Wget can be instructed to convert the
+links in downloaded @sc{html} files to the local files for offline
+viewing.
@c man end
@sp 1
-@c man begin DESCRIPTION
@item
File name wildcard matching and recursive mirroring of directories are
available when retrieving via @sc{ftp}. Wget can read the time-stamp
retrieval, and automatically retrieve the new version if it has. This
makes Wget suitable for mirroring of @sc{ftp} sites, as well as home
pages.
-@c man end
@sp 1
-@c man begin DESCRIPTION
@item
-Wget works exceedingly well on slow or unstable connections,
-retrying the document until it is fully retrieved, or until a
-user-specified retry count is surpassed. It will try to resume the
-download from the point of interruption, using @code{REST} with @sc{ftp}
-and @code{Range} with @sc{http} servers that support them.
+@ignore
+@c man begin DESCRIPTION
+
+@c man end
+@end ignore
+@c man begin DESCRIPTION
+Wget has been designed for robustness over slow or unstable network
+connections; if a download fails due to a network problem, it will
+keep retrying until the whole file has been retrieved. If the server
+supports regetting, it will instruct the server to continue the
+download from where it left off.
@c man end
@sp 1
-@c man begin DESCRIPTION
@item
-By default, Wget supports proxy servers, which can lighten the network
-load, speed up retrieval and provide access behind firewalls. However,
-if you are behind a firewall that requires that you use a socks style
-gateway, you can get the socks library and build Wget with support for
-socks. Wget also supports the passive @sc{ftp} downloading as an
-option.
-@c man end
+Wget supports proxy servers, which can lighten the network load, speed
+up retrieval and provide access behind firewalls. However, if you are
+behind a firewall that requires that you use a socks style gateway, you
+can get the socks library and build Wget with support for socks. Wget
+also supports the passive @sc{ftp} downloading as an option.
@sp 1
-@c man begin DESCRIPTION
@item
Builtin features offer mechanisms to tune which links you wish to follow
(@pxref{Following Links}).
-@c man end
@sp 1
-@c man begin DESCRIPTION
@item
The retrieval is conveniently traced with printing dots, each dot
representing a fixed amount of data received (1KB by default). These
representations can be customized to your preferences.
-@c man end
@sp 1
-@c man begin DESCRIPTION
@item
Most of the features are fully configurable, either through command line
options, or via the initialization file @file{.wgetrc} (@pxref{Startup
File}). Wget allows you to define @dfn{global} startup files
(@file{/usr/local/etc/wgetrc} by default) for site settings.
-@c man end
@ignore
@c man begin FILES
@end ignore
@sp 1
-@c man begin DESCRIPTION
@item
Finally, GNU Wget is free software. This means that everyone may use
it, redistribute it and/or modify it under the terms of the GNU General
Public License, as published by the Free Software Foundation
(@pxref{Copying}).
@end itemize
-@c man end
@node Invoking, Recursive Retrieval, Overview, Top
@chapter Invoking
@file{.netrc} file in your home directory, password will also be
searched for there.}
+@strong{Important Note}: if you specify a password-containing @sc{url}
+on the command line, the username and password will be plainly visible
+to all users on the system, by way of @code{ps}. On multi-user systems,
+this is a big security risk. To work around it, use @code{wget -i -}
+and feed the @sc{url}s to Wget's standard input, each on a separate
+line, terminated by @kbd{C-d}.
+
You can encode unsafe characters in a @sc{url} as @samp{%xy}, @code{xy}
being the hexadecimal representation of the character's @sc{ascii}
value. Some common unsafe characters include @samp{%} (quoted as
Select the type of the progress indicator you wish to use. Legal
indicators are ``dot'' and ``bar''.
-The ``dot'' indicator is used by default. It traces the retrieval by
-printing dots on the screen, each dot representing a fixed amount of
-downloaded data.
+The ``bar'' indicator is used by default. It draws an ASCII progress
+bar graphics (a.k.a ``thermometer'' display) indicating the status of
+retrieval. If the output is not a TTY, the ``dot'' bar will be used by
+default.
+
+Use @samp{--progress=dot} to switch to the ``dot'' display. It traces
+the retrieval by printing dots on the screen, each dot representing a
+fixed amount of downloaded data.
When using the dotted retrieval, you may also set the @dfn{style} by
specifying the type as @samp{dot:@var{style}}. Different styles assign
files---each dot represents 64K retrieved, there are eight dots in a
cluster, and 48 dots on each line (so each line contains 3M).
-Specifying @samp{--progress=bar} will draw a nice ASCII progress bar
-graphics (a.k.a ``thermometer'' display) to indicate retrieval. If the
-output is not a TTY, this option will be ignored, and Wget will revert
-to the dot indicator. If you want to force the bar indicator, use
-@samp{--progress=bar:force}.
+Note that you can set the default style using the @code{progress}
+command in @file{.wgetrc}. That setting may be overridden from the
+command line. The exception is that, when the output is not a TTY, the
+``dot'' progress will be favored over ``bar''. To force the bar output,
+use @samp{--progress=bar:force}.
@item -N
@itemx --timestamping
@cindex timeout
@item -T seconds
@itemx --timeout=@var{seconds}
-Set the read timeout to @var{seconds} seconds. Whenever a network read
-is issued, the file descriptor is checked for a timeout, which could
-otherwise leave a pending connection (uninterrupted read). The default
+Set the network timeout to @var{seconds} seconds. Whenever Wget
+connects to or reads from a remote host, it checks for a timeout and
+aborts the operation if the time expires. This prevents anomalous
+occurrences such as hanging reads or infinite connects. The default
timeout is 900 seconds (fifteen minutes). Setting timeout to 0 will
disable checking for timeouts.
Please do not lower the default timeout value with this option unless
you know what you are doing.
+@cindex bandwidth, limit
+@cindex rate, limit
+@cindex limit bandwidth
+@item --limit-rate=@var{amount}
+Limit the download speed to @var{amount} bytes per second. Amount may
+be expressed in bytes, kilobytes with the @samp{k} suffix, or megabytes
+with the @samp{m} suffix. For example, @samp{--limit-rate=20k} will
+limit the retrieval rate to 20KB/s. This kind of thing is useful when,
+for whatever reason, you don't want Wget to consume the entire evailable
+bandwidth.
+
+Note that Wget implementeds the limiting by sleeping the appropriate
+amount of time after a network read that took less time than specified
+by the rate. Eventually this strategy causes the TCP transfer to slow
+down to approximately the specified rate. However, it takes some time
+for this balance to be achieved, so don't be surprised if limiting the
+rate doesn't work with very small files. Also, the "sleeping" strategy
+will misfire when an extremely small bandwidth, say less than 1.5KB/s,
+is specified.
+
@cindex pause
@cindex wait
@item -w @var{seconds}
aborted when the quota is exceeded.
Setting quota to 0 or to @samp{inf} unlimits the download quota.
+
+@cindex DNS cache
+@cindex caching of DNS lookups
+@itemx --dns-cache=off
+Turn off caching of DNS lookups. Normally, Wget remembers the addresses
+it looked up from DNS so it doesn't have to repeatedly contact the DNS
+server for the same (typically small) set of addresses it retrieves
+from. This cache exists in memory only; a new Wget run will contact DNS
+again.
+
+However, in some cases it is not desirable to cache host names, even for
+the duration of a short-running application like Wget. For example,
+some HTTP servers are hosted on machines with dynamically allocated IP
+addresses that change from time to time. Their DNS entries are updated
+along with each change. When Wget's download from such a host gets
+interrupted by IP address change, Wget retries the download, but (due to
+DNS caching) it contacts the old address. With the DNS cache turned
+off, Wget will repeat the DNS lookup for every connect and will thus get
+the correct dynamic address every time---at the cost of additional DNS
+lookups where they're probably not needed.
+
+If you don't understand the above description, you probably won't need
+this option.
@end table
@node Directory Options, HTTP Options, Download Options, Invoking
@code{digest} authentication scheme.
Another way to specify username and password is in the @sc{url} itself
-(@pxref{URL Format}). For more information about security issues with
-Wget, @xref{Security Considerations}.
+(@pxref{URL Format}). Either method reveals your password to anyone who
+bothers to run @code{ps}. To prevent the passwords from being seen,
+store them in @file{.wgetrc} or @file{.netrc}, and make sure to protect
+those files from other users with @code{chmod}. If the passwords are
+really important, do not leave them lying in those files either---edit
+the files and delete them after Wget has started the download.
+
+For more information about security issues with Wget, @xref{Security
+Considerations}.
@cindex proxy
@cindex cache
@cindex loading cookies
@cindex cookies, loading
@item --load-cookies @var{file}
-Load cookies from @var{file} before the first HTTP retrieval. The
-format of @var{file} is one used by Netscape and Mozilla, at least their
-Unix version.
+Load cookies from @var{file} before the first HTTP retrieval.
+@var{file} is a textual file in the format originally used by Netscape's
+@file{cookies.txt} file.
+
+You will typically use this option when mirroring sites that require
+that you be logged in to access some or all of their content. The login
+process typically works by the web server issuing an @sc{http} cookie
+upon receiving and verifying your credentials. The cookie is then
+resent by the browser when accessing that part of the site, and so
+proves your identity.
+
+Mirroring such a site requires Wget to send the same cookies your
+browser sends when communicating with the site. This is achieved by
+@samp{--load-cookies}---simply point Wget to the location of the
+@file{cookies.txt} file, and it will send the same cookies your browser
+would send in the same situation. Different browsers keep textual
+cookie files in different locations:
+
+@table @asis
+@item Netscape 4.x.
+The cookies are in @file{~/.netscape/cookies.txt}.
+
+@item Mozilla and Netscape 6.x.
+Mozilla's cookie file is also named @file{cookies.txt}, located
+somewhere under @file{~/.mozilla}, in the directory of your profile.
+The full path usually ends up looking somewhat like
+@file{~/.mozilla/default/@var{some-weird-string}/cookies.txt}.
+
+@item Internet Explorer.
+You can produce a cookie file Wget can use by using the File menu,
+Import and Export, Export Cookies. This has been tested with Internet
+Explorer 5; it is not guaranteed to work with earlier versions.
+
+@item Other browsers.
+If you are using a different browser to create your cookies,
+@samp{--load-cookies} will only work if you can locate or produce a
+cookie file in the Netscape format that Wget expects.
+@end table
+
+If you cannot use @samp{--load-cookies}, there might still be an
+alternative. If your browser supports a ``cookie manager'', you can use
+it to view the cookies used when accessing the site you're mirroring.
+Write down the name and value of the cookie, and manually instruct Wget
+to send those cookies, bypassing the ``official'' cookie support:
+
+@example
+wget --cookies=off --header "Cookie: @var{name}=@var{value}"
+@end example
@cindex saving cookies
@cindex cookies, saving
authentication on a proxy server. Wget will encode them using the
@code{basic} authentication scheme.
+Security considerations similar to those with @samp{--http-passwd}
+pertain here as well.
+
@cindex http referer
@cindex referer, http
@item --referer=@var{url}
wget -E -H -k -K -p http://@var{site}/@var{document}
@end example
-In one case you'll need to add a couple more options. If @var{document}
-is a @code{<FRAMESET>} page, the "one more hop" that @samp{-p} gives you
-won't be enough---you'll get the @code{<FRAME>} pages that are
-referenced, but you won't get @emph{their} requisites. Therefore, in
-this case you'll need to add @samp{-r -l1} to the commandline. The
-@samp{-r -l1} will recurse from the @code{<FRAMESET>} page to to the
-@code{<FRAME>} pages, and the @samp{-p} will get their requisites. If
-you're already using a recursion level of 1 or more, you'll need to up
-it by one. In the future, @samp{-p} may be made smarter so that it'll
-do "two more hops" in the case of a @code{<FRAMESET>} page.
-
To finish off this topic, it's worth knowing that Wget's idea of an
external document link is any URL specified in an @code{<A>} tag, an
@code{<AREA>} tag, or a @code{<LINK>} tag other than @code{<LINK
Turning dirstruct on or off---the same as @samp{-x} or @samp{-nd},
respectively.
+@item dns_cache = on/off
+Turn DNS caching on/off. Since DNS caching is on by default, this
+option is normally used to turn it off. Same as @samp{--dns-cache}.
+
@item domains = @var{string}
Same as @samp{-D} (@pxref{Spanning Hosts}).
as there is, provided there is more than or equal to the value in
@code{Content-Length}.
+@item limit_rate = @var{rate}
+Limit the download speed to no more than @var{rate} bytes per second.
+The same as @samp{--limit-rate}.
+
@item logfile = @var{string}
Set logfile---the same as @samp{-o}.
same as @samp{--retr-symlinks}.
@item robots = on/off
-Use (or not) @file{/robots.txt} file (@pxref{Robots}). Be sure to know
-what you are doing before changing the default (which is @samp{on}).
+Specify whether the norobots convention is respected by Wget, ``on'' by
+default. This switch controls both the @file{/robots.txt} and the
+@samp{nofollow} aspect of the spec. @xref{Robot Exclusion}, for more
+details about this. Be sure you know what you are doing before turning
+this off.
@item server_response = on/off
Choose whether or not to print the @sc{http} and @sc{ftp} server
responses---the same as @samp{-S}.
-@item simple_host_check = on/off
-Same as @samp{-nh} (@pxref{Spanning Hosts}).
-
@item span_hosts = on/off
Same as @samp{-H}.
@chapter Examples
@cindex examples
-The examples are classified into three sections, because of clarity.
-The first section is a tutorial for beginners. The second section
-explains some of the more complex program features. The third section
-contains advice for mirror administrators, as well as even more complex
-features (that some would call perverted).
+@c man begin EXAMPLES
+The examples are divided into three sections loosely based on their
+complexity.
@menu
-* Simple Usage:: Simple, basic usage of the program.
-* Advanced Usage:: Advanced techniques of usage.
-* Guru Usage:: Mirroring and the hairy stuff.
+* Simple Usage:: Simple, basic usage of the program.
+* Advanced Usage:: Advanced tips.
+* Very Advanced Usage:: The hairy stuff.
@end menu
@node Simple Usage, Advanced Usage, Examples, Examples
wget http://fly.srk.fer.hr/
@end example
-The response will be something like:
-
-@example
-@group
---13:30:45-- http://fly.srk.fer.hr:80/en/
- => `index.html'
-Connecting to fly.srk.fer.hr:80... connected!
-HTTP request sent, awaiting response... 200 OK
-Length: 4,694 [text/html]
-
- 0K -> .... [100%]
-
-13:30:46 (23.75 KB/s) - `index.html' saved [4694/4694]
-@end group
-@end example
-
@item
But what will happen if the connection is slow, and the file is lengthy?
The connection will probably fail before the whole file is retrieved,
password.
@example
-@group
-$ wget ftp://gnjilux.srk.fer.hr/welcome.msg
---10:08:47-- ftp://gnjilux.srk.fer.hr:21/welcome.msg
- => `welcome.msg'
-Connecting to gnjilux.srk.fer.hr:21... connected!
-Logging in as anonymous ... Logged in!
-==> TYPE I ... done. ==> CWD not needed.
-==> PORT ... done. ==> RETR welcome.msg ... done.
-Length: 1,340 (unauthoritative)
-
- 0K -> . [100%]
-
-10:08:48 (1.28 MB/s) - `welcome.msg' saved [1340]
-@end group
+wget ftp://gnjilux.srk.fer.hr/welcome.msg
@end example
@item
@example
wget ftp://prep.ai.mit.edu/pub/gnu/
-lynx index.html
+links index.html
@end example
@end itemize
-@node Advanced Usage, Guru Usage, Simple Usage, Examples
+@node Advanced Usage, Very Advanced Usage, Simple Usage, Examples
@section Advanced Usage
@itemize @bullet
@item
-You would like to read the list of @sc{url}s from a file? Not a problem
-with that:
+You have a file that contains the URLs you want to download? Use the
+@samp{-i} switch:
@example
-wget -i file
+wget -i @var{file}
@end example
If you specify @samp{-} as file name, the @sc{url}s will be read from
standard input.
@item
-Create a mirror image of GNU @sc{www} site (with the same directory structure
-the original has) with only one try per document, saving the log of the
-activities to @file{gnulog}:
+Create a five levels deep mirror image of the GNU web site, with the
+same directory structure the original has, with only one try per
+document, saving the log of the activities to @file{gnulog}:
@example
-wget -r -t1 http://www.gnu.ai.mit.edu/ -o gnulog
+wget -r http://www.gnu.org/ -o gnulog
@end example
@item
-Retrieve the first layer of yahoo links:
+The same as the above, but convert the links in the @sc{html} files to
+point to local files, so you can view the documents off-line:
@example
-wget -r -l1 http://www.yahoo.com/
+wget --convert-links -r http://www.gnu.org/ -o gnulog
+@end example
+
+@item
+Retrieve only one HTML page, but make sure that all the elements needed
+for the page to be displayed, such as inline images and external style
+sheets, are also downloaded. Also make sure the downloaded page
+references the downloaded links.
+
+@example
+wget -p --convert-links http://www.server.com/dir/page.html
+@end example
+
+The HTML page will be saved to @file{www.server.com/dir/page.html}, and
+the images, stylesheets, etc., somewhere under @file{www.server.com/},
+depending on where they were on the remote server.
+
+@item
+The same as the above, but without the @file{www.server.com/} directory.
+In fact, I don't want to have all those random server directories
+anyway---just save @emph{all} those files under a @file{download/}
+subdirectory of the current directory.
+
+@example
+wget -p --convert-links -nH -nd -Pdownload \
+ http://www.server.com/dir/page.html
@end example
@item
@end example
@item
-Save the server headers with the file:
+Save the server headers with the file, perhaps for post-processing.
+
@example
wget -s http://www.lycos.com/
more index.html
@item
Retrieve the first two levels of @samp{wuarchive.wustl.edu}, saving them
-to /tmp.
+to @file{/tmp}.
@example
-wget -P/tmp -l2 ftp://wuarchive.wustl.edu/
+wget -r -l2 -P/tmp ftp://wuarchive.wustl.edu/
@end example
@item
-You want to download all the @sc{gif}s from an @sc{http} directory.
-@samp{wget http://host/dir/*.gif} doesn't work, since @sc{http}
-retrieval does not support globbing. In that case, use:
+You want to download all the @sc{gif}s from a directory on an @sc{http}
+server. You tried @samp{wget http://www.server.com/dir/*.gif}, but that
+didn't work because @sc{http} retrieval does not support globbing. In
+that case, use:
@example
-wget -r -l1 --no-parent -A.gif http://host/dir/
+wget -r -l1 --no-parent -A.gif http://www.server.com/dir/
@end example
-It is a bit of a kludge, but it works. @samp{-r -l1} means to retrieve
-recursively (@pxref{Recursive Retrieval}), with maximum depth of 1.
-@samp{--no-parent} means that references to the parent directory are
-ignored (@pxref{Directory-Based Limits}), and @samp{-A.gif} means to
+More verbose, but the effect is the same. @samp{-r -l1} means to
+retrieve recursively (@pxref{Recursive Retrieval}), with maximum depth
+of 1. @samp{--no-parent} means that references to the parent directory
+are ignored (@pxref{Directory-Based Limits}), and @samp{-A.gif} means to
download only the @sc{gif} files. @samp{-A "*.gif"} would have worked
too.
It would be:
@example
-wget -nc -r http://www.gnu.ai.mit.edu/
+wget -nc -r http://www.gnu.org/
@end example
@item
@sc{ftp}, use the appropriate @sc{url} syntax (@pxref{URL Format}).
@example
-wget ftp://hniksic:mypassword@@jagor.srce.hr/.emacs
+wget ftp://hniksic:mypassword@@unix.server.com/.emacs
@end example
+Note, however, that this usage is not advisable on multi-user systems
+because it reveals your password to anyone who looks at the output of
+@code{ps}.
+
+@cindex redirecting output
@item
-If you do not like the default retrieval visualization (1K dots with 10
-dots per cluster and 50 dots per line), you can customize it through dot
-settings (@pxref{Wgetrc Commands}). For example, many people like the
-``binary'' style of retrieval, with 8K dots and 512K lines:
+You would like the output documents to go to standard output instead of
+to files?
@example
-wget --dot-style=binary ftp://prep.ai.mit.edu/pub/gnu/README
+wget -O - http://jagor.srce.hr/ http://www.srce.hr/
@end example
-You can experiment with other styles, like:
+You can also combine the two options and make pipelines to retrieve the
+documents from remote hotlists:
@example
-wget --dot-style=mega ftp://ftp.xemacs.org/pub/xemacs/xemacs-20.4/xemacs-20.4.tar.gz
-wget --dot-style=micro http://fly.srk.fer.hr/
+wget -O - http://cool.list.com/ | wget --force-html -i -
@end example
-
-To make these settings permanent, put them in your @file{.wgetrc}, as
-described before (@pxref{Sample Wgetrc}).
@end itemize
-@node Guru Usage, , Advanced Usage, Examples
-@section Guru Usage
+@node Very Advanced Usage, , Advanced Usage, Examples
+@section Very Advanced Usage
@cindex mirroring
@itemize @bullet
@item
If you wish Wget to keep a mirror of a page (or @sc{ftp}
subdirectories), use @samp{--mirror} (@samp{-m}), which is the shorthand
-for @samp{-r -N}. You can put Wget in the crontab file asking it to
-recheck a site each Sunday:
+for @samp{-r -l inf -N}. You can put Wget in the crontab file asking it
+to recheck a site each Sunday:
@example
crontab
-0 0 * * 0 wget --mirror ftp://ftp.xemacs.org/pub/xemacs/ -o /home/me/weeklog
+0 0 * * 0 wget --mirror http://www.gnu.org/ -o /home/me/weeklog
@end example
@item
-You may wish to do the same with someone's home page. But you do not
-want to download all those images---you're only interested in @sc{html}.
+In addition to the above, you want the links to be converted for local
+viewing. But, after having read this manual, you know that link
+conversion doesn't play well with timestamping, so you also want Wget to
+back up the original HTML files before the conversion. Wget invocation
+would look like this:
@example
-wget --mirror -A.html http://www.w3.org/
+wget --mirror --convert-links --backup-converted \
+ http://www.gnu.org/ -o /home/me/weeklog
@end example
@item
-You have a presentation and would like the dumb absolute links to be
-converted to relative? Use @samp{-k}:
+But you've also noticed that local viewing doesn't work all that well
+when HTML files are saved under extensions other than @samp{.html},
+perhaps because they were served as @file{index.cgi}. So you'd like
+Wget to rename all the files served with content-type @samp{text/html}
+to @file{@var{name}.html}.
@example
-wget -k -r @var{URL}
+wget --mirror --convert-links --backup-converted \
+ --html-extension -o /home/me/weeklog \
+ http://www.gnu.org/
@end example
-@cindex redirecting output
-@item
-You would like the output documents to go to standard output instead of
-to files? OK, but Wget will automatically shut up (turn on
-@samp{--quiet}) to prevent mixing of Wget output and the retrieved
-documents.
+Or, with less typing:
@example
-wget -O - http://jagor.srce.hr/ http://www.srce.hr/
-@end example
-
-You can also combine the two options and make weird pipelines to
-retrieve the documents from remote hotlists:
-
-@example
-wget -O - http://cool.list.com/ | wget --force-html -i -
+wget -m -k -K -E http://www.gnu.org/ -o /home/me/weeklog
@end example
@end itemize
+@c man end
@node Various, Appendices, Examples, Top
@chapter Various
@item
Try to repeat the bug in as simple circumstances as possible. E.g. if
-Wget crashes on @samp{wget -rLl0 -t5 -Y0 http://yoyodyne.com -o
-/tmp/log}, you should try to see if it will crash with a simpler set of
-options.
+Wget crashes while downloading @samp{wget -rl0 -kKE -t5 -Y0
+http://yoyodyne.com -o /tmp/log}, you should try to see if the crash is
+repeatable, and if will occur with a simpler set of options. You might
+even try to start the download at the page where the crash occurred to
+see if that page somehow triggered the crash.
Also, while I will probably be interested to know the contents of your
@file{.wgetrc} file, just dumping it into the debug message is probably
a bad idea. Instead, you should first try to see if the bug repeats
with @file{.wgetrc} moved out of the way. Only if it turns out that
-@file{.wgetrc} settings affect the bug, should you mail me the relevant
-parts of the file.
+@file{.wgetrc} settings affect the bug, mail me the relevant parts of
+the file.
@item
Please start Wget with @samp{-d} option and send the log (or the
@item
If Wget has crashed, try to run it in a debugger, e.g. @code{gdb `which
wget` core} and type @code{where} to get the backtrace.
-
-@item
-Find where the bug is, fix it and send me the patches. :-)
@end enumerate
@c man end
This chapter contains some references I consider useful.
@menu
-* Robots:: Wget as a WWW robot.
+* Robot Exclusion:: Wget's support for RES.
* Security Considerations:: Security with Wget.
* Contributors:: People who helped.
@end menu
-@node Robots, Security Considerations, Appendices, Appendices
-@section Robots
-@cindex robots
+@node Robot Exclusion, Security Considerations, Appendices, Appendices
+@section Robot Exclusion
+@cindex robot exclusion
@cindex robots.txt
@cindex server maintenance
sucking all the available data in progress. @samp{wget -r @var{site}},
and you're set. Great? Not for the server admin.
-While Wget is retrieving static pages, there's not much of a problem.
-But for Wget, there is no real difference between a static page and the
-most demanding CGI. For instance, a site I know has a section handled
-by an, uh, @dfn{bitchin'} CGI script that converts all the Info files to
-HTML. The script can and does bring the machine to its knees without
-providing anything useful to the downloader.
-
-For such and similar cases various robot exclusion schemes have been
-devised as a means for the server administrators and document authors to
-protect chosen portions of their sites from the wandering of robots.
-
-The more popular mechanism is the @dfn{Robots Exclusion Standard}, or
-@sc{res}, written by Martijn Koster et al. in 1994. It specifies the
-format of a text file containing directives that instruct the robots
-which URL paths to avoid. To be found by the robots, the specifications
-must be placed in @file{/robots.txt} in the server root, which the
-robots are supposed to download and parse.
-
-Wget supports @sc{res} when downloading recursively. So, when you
-issue:
+As long as Wget is only retrieving static pages, and doing it at a
+reasonable rate (see the @samp{--wait} option), there's not much of a
+problem. The trouble is that Wget can't tell the difference between the
+smallest static page and the most demanding CGI. A site I know has a
+section handled by an, uh, @dfn{bitchin'} CGI Perl script that converts
+Info files to HTML on the fly. The script is slow, but works well
+enough for human users viewing an occasional Info file. However, when
+someone's recursive Wget download stumbles upon the index page that
+links to all the Info files through the script, the system is brought to
+its knees without providing anything useful to the downloader.
+
+To avoid this kind of accident, as well as to preserve privacy for
+documents that need to be protected from well-behaved robots, the
+concept of @dfn{robot exclusion} has been invented. The idea is that
+the server administrators and document authors can specify which
+portions of the site they wish to protect from the robots.
+
+The most popular mechanism, and the de facto standard supported by all
+the major robots, is the ``Robots Exclusion Standard'' (RES) written by
+Martijn Koster et al. in 1994. It specifies the format of a text file
+containing directives that instruct the robots which URL paths to avoid.
+To be found by the robots, the specifications must be placed in
+@file{/robots.txt} in the server root, which the robots are supposed to
+download and parse.
+
+Although Wget is not a web robot in the strictest sense of the word, it
+can downloads large parts of the site without the user's intervention to
+download an individual page. Because of that, Wget honors RES when
+downloading recursively. For instance, when you issue:
@example
wget -r http://www.server.com/
Until version 1.8, Wget supported the first version of the standard,
written by Martijn Koster in 1994 and available at
-@url{http://info.webcrawler.com/mak/projects/robots/norobots.html}. As
-of version 1.8, Wget has supported the additional directives specified
-in the internet draft @samp{<draft-koster-robots-00.txt>} titled ``A
-Method for Web Robots Control''. The draft, which has as far as I know
-never made to an @sc{rfc}, is available at
-@url{http://info.webcrawler.com/mak/projects/robots/norobots-rfc.html}.
+@url{http://www.robotstxt.org/wc/norobots.html}. As of version 1.8,
+Wget has supported the additional directives specified in the internet
+draft @samp{<draft-koster-robots-00.txt>} titled ``A Method for Web
+Robots Control''. The draft, which has as far as I know never made to
+an @sc{rfc}, is available at
+@url{http://www.robotstxt.org/wc/norobots-rfc.txt}.
This manual no longer includes the text of the Robot Exclusion Standard.
@end example
This is explained in some detail at
-@url{http://info.webcrawler.com/mak/projects/robots/meta-user.html}.
-Wget supports this method of robot exclusion in addition to the usual
-@file{/robots.txt} exclusion.
+@url{http://www.robotstxt.org/wc/meta-user.html}. Wget supports this
+method of robot exclusion in addition to the usual @file{/robots.txt}
+exclusion.
-@node Security Considerations, Contributors, Robots, Appendices
+If you know what you are doing and really really wish to turn off the
+robot exclusion, set the @code{robots} variable to @samp{off} in your
+@file{.wgetrc}. You can achieve the same effect from the command line
+using the @code{-e} switch, e.g. @samp{wget -e robots=off @var{url}...}.
+
+@node Security Considerations, Contributors, Robot Exclusion, Appendices
@section Security Considerations
@cindex security
main issues, and some solutions.
@enumerate
-@item
-The passwords on the command line are visible using @code{ps}. If this
-is a problem, avoid putting passwords from the command line---e.g. you
-can use @file{.netrc} for this.
+@item The passwords on the command line are visible using @code{ps}.
+The best way around it is to use @code{wget -i -} and feed the @sc{url}s
+to Wget's standard input, each on a separate line, terminated by
+@kbd{C-d}. Another workaround is to use @file{.netrc} to store
+passwords; however, storing unencrypted passwords is also considered a
+security risk.
@item
Using the insecure @dfn{basic} authentication scheme, unencrypted