X-Git-Url: http://sjero.net/git/?a=blobdiff_plain;f=doc%2Fwget.texi;h=78bc20dca3ac3055b03f63453bd22aded4cc5012;hb=7c802e58d3e45e3a21d99c8d24dc5be806ecf174;hp=81ccaa15c273785fbb744630d6031c5e8c47bc76;hpb=006e936b9e5da83cae6ad306c8918fb9af84ea13;p=wget
diff --git a/doc/wget.texi b/doc/wget.texi
index 81ccaa15..78bc20dc 100644
--- a/doc/wget.texi
+++ b/doc/wget.texi
@@ -14,9 +14,15 @@
@afourpaper
@end iftex
-@c This should really be auto-generated!
-@set VERSION 1.7.1
-@set UPDATED June 2001
+@c Title for man page. The weird way texi2pod.pl is written requires
+@c the preceding @set.
+@set Wget Wget
+@c man title Wget The non-interactive network downloader.
+
+@c This should really be generated automatically, possibly by including
+@c an auto-generated file.
+@set VERSION 1.9-cvs
+@set UPDATED September 2003
@dircategory Net Utilities
@dircategory World Wide Web
@@ -29,8 +35,8 @@ This file documents the the GNU Wget utility for downloading network
data.
@c man begin COPYRIGHT
-Copyright @copyright{} 1996, 1997, 1998, 2000, 2001 Free Software
-Foundation, Inc.
+Copyright @copyright{} 1996, 1997, 1998, 2000, 2001, 2002, 2003 Free
+Software Foundation, Inc.
Permission is granted to make and distribute verbatim copies of
this manual provided the copyright notice and this permission notice
@@ -112,14 +118,16 @@ Foundation, Inc.
@cindex features
@c man begin DESCRIPTION
-GNU Wget is a freely available network utility to retrieve files from
-the World Wide Web, using @sc{http} (Hyper Text Transfer Protocol) and
-@sc{ftp} (File Transfer Protocol), the two most widely used Internet
-protocols. It has many useful features to make downloading easier, some
-of them being:
+GNU Wget is a free utility for non-interactive download of files from
+the Web. It supports @sc{http}, @sc{https}, and @sc{ftp} protocols, as
+well as retrieval through @sc{http} proxies.
+
+@c man end
+This chapter is a partial overview of Wget's features.
@itemize @bullet
@item
+@c man begin DESCRIPTION
Wget is non-interactive, meaning that it can work in the background,
while the user is not logged on. This allows you to start a retrieval
and disconnect from the system, letting Wget finish the work. By
@@ -128,18 +136,23 @@ which can be a great hindrance when transferring a lot of data.
@c man end
@sp 1
-@c man begin DESCRIPTION
@item
-Wget is capable of descending recursively through the structure of
-@sc{html} documents and @sc{ftp} directory trees, making a local copy of
-the directory hierarchy similar to the one on the remote server. This
-feature can be used to mirror archives and home pages, or traverse the
-web in search of data, like a @sc{www} robot (@pxref{Robots}). In that
-spirit, Wget understands the @code{norobots} convention.
+@ignore
+@c man begin DESCRIPTION
+
+@c man end
+@end ignore
+@c man begin DESCRIPTION
+Wget can follow links in @sc{html} pages and create local versions of
+remote web sites, fully recreating the directory structure of the
+original site. This is sometimes referred to as ``recursive
+downloading.'' While doing that, Wget respects the Robot Exclusion
+Standard (@file{/robots.txt}). Wget can be instructed to convert the
+links in downloaded @sc{html} files to the local files for offline
+viewing.
@c man end
@sp 1
-@c man begin DESCRIPTION
@item
File name wildcard matching and recursive mirroring of directories are
available when retrieving via @sc{ftp}. Wget can read the time-stamp
@@ -148,52 +161,47 @@ locally. Thus Wget can see if the remote file has changed since last
retrieval, and automatically retrieve the new version if it has. This
makes Wget suitable for mirroring of @sc{ftp} sites, as well as home
pages.
-@c man end
@sp 1
-@c man begin DESCRIPTION
@item
-Wget works exceedingly well on slow or unstable connections,
-retrying the document until it is fully retrieved, or until a
-user-specified retry count is surpassed. It will try to resume the
-download from the point of interruption, using @code{REST} with @sc{ftp}
-and @code{Range} with @sc{http} servers that support them.
+@ignore
+@c man begin DESCRIPTION
+
+@c man end
+@end ignore
+@c man begin DESCRIPTION
+Wget has been designed for robustness over slow or unstable network
+connections; if a download fails due to a network problem, it will
+keep retrying until the whole file has been retrieved. If the server
+supports regetting, it will instruct the server to continue the
+download from where it left off.
@c man end
@sp 1
-@c man begin DESCRIPTION
@item
-By default, Wget supports proxy servers, which can lighten the network
-load, speed up retrieval and provide access behind firewalls. However,
-if you are behind a firewall that requires that you use a socks style
-gateway, you can get the socks library and build Wget with support for
-socks. Wget also supports the passive @sc{ftp} downloading as an
-option.
-@c man end
+Wget supports proxy servers, which can lighten the network load, speed
+up retrieval and provide access behind firewalls. However, if you are
+behind a firewall that requires that you use a socks style gateway, you
+can get the socks library and build Wget with support for socks. Wget
+also supports the passive @sc{ftp} downloading as an option.
@sp 1
-@c man begin DESCRIPTION
@item
Builtin features offer mechanisms to tune which links you wish to follow
(@pxref{Following Links}).
-@c man end
@sp 1
-@c man begin DESCRIPTION
@item
The retrieval is conveniently traced with printing dots, each dot
representing a fixed amount of data received (1KB by default). These
representations can be customized to your preferences.
-@c man end
@sp 1
-@c man begin DESCRIPTION
@item
Most of the features are fully configurable, either through command line
options, or via the initialization file @file{.wgetrc} (@pxref{Startup
File}). Wget allows you to define @dfn{global} startup files
(@file{/usr/local/etc/wgetrc} by default) for site settings.
-@c man end
@ignore
@c man begin FILES
@@ -208,14 +216,12 @@ User startup file.
@end ignore
@sp 1
-@c man begin DESCRIPTION
@item
Finally, GNU Wget is free software. This means that everyone may use
it, redistribute it and/or modify it under the terms of the GNU General
Public License, as published by the Free Software Foundation
(@pxref{Copying}).
@end itemize
-@c man end
@node Invoking, Recursive Retrieval, Overview, Top
@chapter Invoking
@@ -284,6 +290,13 @@ address will be supplied as a default password.@footnote{If you have a
@file{.netrc} file in your home directory, password will also be
searched for there.}
+@strong{Important Note}: if you specify a password-containing @sc{url}
+on the command line, the username and password will be plainly visible
+to all users on the system, by way of @code{ps}. On multi-user systems,
+this is a big security risk. To work around it, use @code{wget -i -}
+and feed the @sc{url}s to Wget's standard input, each on a separate
+line, terminated by @kbd{C-d}.
+
You can encode unsafe characters in a @sc{url} as @samp{%xy}, @code{xy}
being the hexadecimal representation of the character's @sc{ascii}
value. Some common unsafe characters include @samp{%} (quoted as
@@ -499,7 +512,9 @@ IPs.
@item -t @var{number}
@itemx --tries=@var{number}
Set number of retries to @var{number}. Specify 0 or @samp{inf} for
-infinite retrying.
+infinite retrying. The default is to retry 20 times, with the exception
+of fatal errors like ``connection refused'' or ``not found'' (404),
+which are not retried.
@item -O @var{file}
@itemx --output-document=@var{file}
@@ -611,26 +626,36 @@ Another instance where you'll get a garbled file if you try to use
Note that @samp{-c} only works with @sc{ftp} servers and with @sc{http}
servers that support the @code{Range} header.
+@cindex progress indicator
@cindex dot style
-@cindex retrieval tracing style
-@item --dot-style=@var{style}
-Set the retrieval style to @var{style}. Wget traces the retrieval of
-each document by printing dots on the screen, each dot representing a
-fixed amount of retrieved data. Any number of dots may be separated in
-a @dfn{cluster}, to make counting easier. This option allows you to
-choose one of the pre-defined styles, determining the number of bytes
-represented by a dot, the number of dots in a cluster, and the number of
-dots on the line.
-
-With the @code{default} style each dot represents 1K, there are ten dots
-in a cluster and 50 dots in a line. The @code{binary} style has a more
-``computer''-like orientation---8K dots, 16-dots clusters and 48 dots
-per line (which makes for 384K lines). The @code{mega} style is
-suitable for downloading very large files---each dot represents 64K
-retrieved, there are eight dots in a cluster, and 48 dots on each line
-(so each line contains 3M). The @code{micro} style is exactly the
-reverse; it is suitable for downloading small files, with 128-byte dots,
-8 dots per cluster, and 48 dots (6K) per line.
+@item --progress=@var{type}
+Select the type of the progress indicator you wish to use. Legal
+indicators are ``dot'' and ``bar''.
+
+The ``bar'' indicator is used by default. It draws an ASCII progress
+bar graphics (a.k.a ``thermometer'' display) indicating the status of
+retrieval. If the output is not a TTY, the ``dot'' bar will be used by
+default.
+
+Use @samp{--progress=dot} to switch to the ``dot'' display. It traces
+the retrieval by printing dots on the screen, each dot representing a
+fixed amount of downloaded data.
+
+When using the dotted retrieval, you may also set the @dfn{style} by
+specifying the type as @samp{dot:@var{style}}. Different styles assign
+different meaning to one dot. With the @code{default} style each dot
+represents 1K, there are ten dots in a cluster and 50 dots in a line.
+The @code{binary} style has a more ``computer''-like orientation---8K
+dots, 16-dots clusters and 48 dots per line (which makes for 384K
+lines). The @code{mega} style is suitable for downloading very large
+files---each dot represents 64K retrieved, there are eight dots in a
+cluster, and 48 dots on each line (so each line contains 3M).
+
+Note that you can set the default style using the @code{progress}
+command in @file{.wgetrc}. That setting may be overridden from the
+command line. The exception is that, when the output is not a TTY, the
+``dot'' progress will be favored over ``bar''. To force the bar output,
+use @samp{--progress=bar:force}.
@item -N
@itemx --timestamping
@@ -659,15 +684,34 @@ functionality of real @sc{www} spiders.
@cindex timeout
@item -T seconds
@itemx --timeout=@var{seconds}
-Set the read timeout to @var{seconds} seconds. Whenever a network read
-is issued, the file descriptor is checked for a timeout, which could
-otherwise leave a pending connection (uninterrupted read). The default
+Set the network timeout to @var{seconds} seconds. Whenever Wget
+connects to or reads from a remote host, it checks for a timeout and
+aborts the operation if the time expires. This prevents anomalous
+occurrences such as hanging reads or infinite connects. The default
timeout is 900 seconds (fifteen minutes). Setting timeout to 0 will
disable checking for timeouts.
Please do not lower the default timeout value with this option unless
you know what you are doing.
+@cindex bandwidth, limit
+@cindex rate, limit
+@cindex limit bandwidth
+@item --limit-rate=@var{amount}
+Limit the download speed to @var{amount} bytes per second. Amount may
+be expressed in bytes, kilobytes with the @samp{k} suffix, or megabytes
+with the @samp{m} suffix. For example, @samp{--limit-rate=20k} will
+limit the retrieval rate to 20KB/s. This kind of thing is useful when,
+for whatever reason, you don't want Wget to consume the entire available
+bandwidth.
+
+Note that Wget implements the limiting by sleeping the appropriate
+amount of time after a network read that took less time than specified
+by the rate. Eventually this strategy causes the TCP transfer to slow
+down to approximately the specified rate. However, it takes some time
+for this balance to be achieved, so don't be surprised if limiting the
+rate doesn't work well with very small files.
+
@cindex pause
@cindex wait
@item -w @var{seconds}
@@ -696,11 +740,33 @@ seconds per file.
Note that this option is turned on by default in the global
@file{wgetrc} file.
+@cindex wait, random
+@cindex random wait
+@item --random-wait
+Some web sites may perform log analysis to identify retrieval programs
+such as Wget by looking for statistically significant similarities in
+the time between requests. This option causes the time between requests
+to vary between 0 and 2 * @var{wait} seconds, where @var{wait} was
+specified using the @samp{-w} or @samp{--wait} options, in order to mask
+Wget's presence from such analysis.
+
+A recent article in a publication devoted to development on a popular
+consumer platform provided code to perform this analysis on the fly.
+Its author suggested blocking at the class C address level to ensure
+automated retrieval programs were blocked despite changing DHCP-supplied
+addresses.
+
+The @samp{--random-wait} option was inspired by this ill-advised
+recommendation to block many unrelated users from a web site due to the
+actions of one.
+
@cindex proxy
@item -Y on/off
@itemx --proxy=on/off
Turn proxy support on or off. The proxy is on by default if the
-appropriate environmental variable is defined.
+appropriate environment variable is defined.
+
+For more information about the use of proxies with Wget, @xref{Proxies}.
@cindex quota
@item -Q @var{quota}
@@ -718,6 +784,65 @@ Thus you may safely type @samp{wget -Q2m -i sites}---download will be
aborted when the quota is exceeded.
Setting quota to 0 or to @samp{inf} unlimits the download quota.
+
+@cindex DNS cache
+@cindex caching of DNS lookups
+@item --dns-cache=off
+Turn off caching of DNS lookups. Normally, Wget remembers the addresses
+it looked up from DNS so it doesn't have to repeatedly contact the DNS
+server for the same (typically small) set of addresses it retrieves
+from. This cache exists in memory only; a new Wget run will contact DNS
+again.
+
+However, in some cases it is not desirable to cache host names, even for
+the duration of a short-running application like Wget. For example,
+some HTTP servers are hosted on machines with dynamically allocated IP
+addresses that change from time to time. Their DNS entries are updated
+along with each change. When Wget's download from such a host gets
+interrupted by IP address change, Wget retries the download, but (due to
+DNS caching) it contacts the old address. With the DNS cache turned
+off, Wget will repeat the DNS lookup for every connect and will thus get
+the correct dynamic address every time---at the cost of additional DNS
+lookups where they're probably not needed.
+
+If you don't understand the above description, you probably won't need
+this option.
+
+@cindex file names, restrict
+@cindex Windows file names
+@item --restrict-file-names=@var{mode}
+Change which characters found in remote URLs may show up in local file
+names generated from those URLs. Characters that are @dfn{restricted}
+by this option are escaped, i.e. replaced with @samp{%HH}, where
+@samp{HH} is the hexadecimal number that corresponds to the restricted
+character.
+
+By default, Wget escapes the characters that are not valid as part of
+file names on your operating system, as well as control characters that
+are typically unprintable. This option is useful for changing these
+defaults, either because you are downloading to a non-native partition,
+or because you want to disable escaping of the control characters.
+
+When mode is set to ``unix'', Wget escapes the character @samp{/} and
+the control characters in the ranges 0--31 and 128--159. This is the
+default on Unix-like OS'es.
+
+When mode is seto to ``windows'', Wget escapes the characters @samp{\},
+@samp{|}, @samp{/}, @samp{:}, @samp{?}, @samp{"}, @samp{*}, @samp{<},
+@samp{>}, and the control characters in the ranges 0--31 and 128--159.
+In addition to this, Wget in Windows mode uses @samp{+} instead of
+@samp{:} to separate host and port in local file names, and uses
+@samp{@@} instead of @samp{?} to separate the query portion of the file
+name from the rest. Therefore, a URL that would be saved as
+@samp{www.xemacs.org:4300/search.pl?input=blah} in Unix mode would be
+saved as @samp{www.xemacs.org+4300/search.pl@@input=blah} in Windows
+mode. This mode is the default on Windows.
+
+If you append @samp{,nocontrol} to the mode, as in
+@samp{unix,nocontrol}, escaping of the control characters is also
+switched off. You can use @samp{--restrict-file-names=nocontrol} to
+turn off escaping of control characters without affecting the choice of
+the OS to use as file name restriction mode.
@end table
@node Directory Options, HTTP Options, Download Options, Invoking
@@ -823,8 +948,15 @@ encode them using either the @code{basic} (insecure) or the
@code{digest} authentication scheme.
Another way to specify username and password is in the @sc{url} itself
-(@pxref{URL Format}). For more information about security issues with
-Wget, @xref{Security Considerations}.
+(@pxref{URL Format}). Either method reveals your password to anyone who
+bothers to run @code{ps}. To prevent the passwords from being seen,
+store them in @file{.wgetrc} or @file{.netrc}, and make sure to protect
+those files from other users with @code{chmod}. If the passwords are
+really important, do not leave them lying in those files either---edit
+the files and delete them after Wget has started the download.
+
+For more information about security issues with Wget, @xref{Security
+Considerations}.
@cindex proxy
@cindex cache
@@ -851,9 +983,54 @@ use cookies; however, @emph{storing} cookies is not on by default.
@cindex loading cookies
@cindex cookies, loading
@item --load-cookies @var{file}
-Load cookies from @var{file} before the first HTTP retrieval. The
-format of @var{file} is one used by Netscape and Mozilla, at least their
-Unix version.
+Load cookies from @var{file} before the first HTTP retrieval.
+@var{file} is a textual file in the format originally used by Netscape's
+@file{cookies.txt} file.
+
+You will typically use this option when mirroring sites that require
+that you be logged in to access some or all of their content. The login
+process typically works by the web server issuing an @sc{http} cookie
+upon receiving and verifying your credentials. The cookie is then
+resent by the browser when accessing that part of the site, and so
+proves your identity.
+
+Mirroring such a site requires Wget to send the same cookies your
+browser sends when communicating with the site. This is achieved by
+@samp{--load-cookies}---simply point Wget to the location of the
+@file{cookies.txt} file, and it will send the same cookies your browser
+would send in the same situation. Different browsers keep textual
+cookie files in different locations:
+
+@table @asis
+@item Netscape 4.x.
+The cookies are in @file{~/.netscape/cookies.txt}.
+
+@item Mozilla and Netscape 6.x.
+Mozilla's cookie file is also named @file{cookies.txt}, located
+somewhere under @file{~/.mozilla}, in the directory of your profile.
+The full path usually ends up looking somewhat like
+@file{~/.mozilla/default/@var{some-weird-string}/cookies.txt}.
+
+@item Internet Explorer.
+You can produce a cookie file Wget can use by using the File menu,
+Import and Export, Export Cookies. This has been tested with Internet
+Explorer 5; it is not guaranteed to work with earlier versions.
+
+@item Other browsers.
+If you are using a different browser to create your cookies,
+@samp{--load-cookies} will only work if you can locate or produce a
+cookie file in the Netscape format that Wget expects.
+@end table
+
+If you cannot use @samp{--load-cookies}, there might still be an
+alternative. If your browser supports a ``cookie manager'', you can use
+it to view the cookies used when accessing the site you're mirroring.
+Write down the name and value of the cookie, and manually instruct Wget
+to send those cookies, bypassing the ``official'' cookie support:
+
+@example
+wget --cookies=off --header "Cookie: @var{name}=@var{value}"
+@end example
@cindex saving cookies
@cindex cookies, saving
@@ -904,6 +1081,9 @@ Specify the username @var{user} and password @var{password} for
authentication on a proxy server. Wget will encode them using the
@code{basic} authentication scheme.
+Security considerations similar to those with @samp{--http-passwd}
+pertain here as well.
+
@cindex http referer
@cindex referer, http
@item --referer=@var{url}
@@ -1178,24 +1358,47 @@ websites), and make sure the lot displays properly locally, this author
likes to use a few options in addition to @samp{-p}:
@example
-wget -E -H -k -K -nh -p http://@var{site}/@var{document}
+wget -E -H -k -K -p http://@var{site}/@var{document}
@end example
-In one case you'll need to add a couple more options. If @var{document}
-is a @code{