[svn] Improved --restrict-file-names to accept ",nocontrol".

[wget] / doc / wget.texi
diff --git a/doc/wget.texi b/doc/wget.texi

index e498d541c8ba9f7b8390174e4f93898bbe3c84d8..f0c66e691ea43ad41b73915b4af686774490ef00 100644 (file)
--- a/doc/wget.texi
+++ b/doc/wget.texi
@@ -14,9 +14,15 @@
  @afourpaper
  @end iftex
  
-@c This should really be auto-generated!
-@set VERSION 1.8
-@set UPDATED December 2001
+@c Title for man page.  The weird way texi2pod.pl is written requires
+@c the preceding @set.
+@set Wget Wget
+@c man title Wget The non-interactive network downloader.
+
+@c This should really be generated automatically, possibly by including
+@c an auto-generated file.
+@set VERSION 1.9-cvs
+@set UPDATED September 2003
  
  @dircategory Net Utilities
  @dircategory World Wide Web
@@ -29,8 +35,8 @@ This file documents the the GNU Wget utility for downloading network
  data.
  
  @c man begin COPYRIGHT
-Copyright @copyright{} 1996, 1997, 1998, 2000, 2001 Free Software
-Foundation, Inc.
+Copyright @copyright{} 1996, 1997, 1998, 2000, 2001, 2002, 2003 Free
+Software Foundation, Inc.
  
  Permission is granted to make and distribute verbatim copies of
  this manual provided the copyright notice and this permission notice
@@ -284,6 +290,13 @@ address will be supplied as a default password.@footnote{If you have a
  @file{.netrc} file in your home directory, password will also be
  searched for there.}
  
+@strong{Important Note}: if you specify a password-containing @sc{url}
+on the command line, the username and password will be plainly visible
+to all users on the system, by way of @code{ps}.  On multi-user systems,
+this is a big security risk.  To work around it, use @code{wget -i -}
+and feed the @sc{url}s to Wget's standard input, each on a separate
+line, terminated by @kbd{C-d}.
+
  You can encode unsafe characters in a @sc{url} as @samp{%xy}, @code{xy}
  being the hexadecimal representation of the character's @sc{ascii}
  value.  Some common unsafe characters include @samp{%} (quoted as
@@ -617,9 +630,14 @@ servers that support the @code{Range} header.
  Select the type of the progress indicator you wish to use.  Legal
  indicators are ``dot'' and ``bar''.
  
-The ``dot'' indicator is used by default.  It traces the retrieval by
-printing dots on the screen, each dot representing a fixed amount of
-downloaded data.
+The ``bar'' indicator is used by default.  It draws an ASCII progress
+bar graphics (a.k.a ``thermometer'' display) indicating the status of
+retrieval.  If the output is not a TTY, the ``dot'' bar will be used by
+default.
+
+Use @samp{--progress=dot} to switch to the ``dot'' display.  It traces
+the retrieval by printing dots on the screen, each dot representing a
+fixed amount of downloaded data.
  
  When using the dotted retrieval, you may also set the @dfn{style} by
  specifying the type as @samp{dot:@var{style}}.  Different styles assign
@@ -631,11 +649,11 @@ lines).  The @code{mega} style is suitable for downloading very large
  files---each dot represents 64K retrieved, there are eight dots in a
  cluster, and 48 dots on each line (so each line contains 3M).
  
-Specifying @samp{--progress=bar} will draw a nice ASCII progress bar
-graphics (a.k.a ``thermometer'' display) to indicate retrieval.  If the
-output is not a TTY, this option will be ignored, and Wget will revert
-to the dot indicator.  If you want to force the bar indicator, use
-@samp{--progress=bar:force}.
+Note that you can set the default style using the @code{progress}
+command in @file{.wgetrc}.  That setting may be overridden from the
+command line.  The exception is that, when the output is not a TTY, the
+``dot'' progress will be favored over ``bar''.  To force the bar output,
+use @samp{--progress=bar:force}.
  
  @item -N
  @itemx --timestamping
@@ -664,15 +682,34 @@ functionality of real @sc{www} spiders.
  @cindex timeout
  @item -T seconds
  @itemx --timeout=@var{seconds}
-Set the read timeout to @var{seconds} seconds.  Whenever a network read
-is issued, the file descriptor is checked for a timeout, which could
-otherwise leave a pending connection (uninterrupted read).  The default
+Set the network timeout to @var{seconds} seconds.  Whenever Wget
+connects to or reads from a remote host, it checks for a timeout and
+aborts the operation if the time expires.  This prevents anomalous
+occurrences such as hanging reads or infinite connects.  The default
  timeout is 900 seconds (fifteen minutes).  Setting timeout to 0 will
  disable checking for timeouts.
  
  Please do not lower the default timeout value with this option unless
  you know what you are doing.
  
+@cindex bandwidth, limit
+@cindex rate, limit
+@cindex limit bandwidth
+@item --limit-rate=@var{amount}
+Limit the download speed to @var{amount} bytes per second.  Amount may
+be expressed in bytes, kilobytes with the @samp{k} suffix, or megabytes
+with the @samp{m} suffix.  For example, @samp{--limit-rate=20k} will
+limit the retrieval rate to 20KB/s.  This kind of thing is useful when,
+for whatever reason, you don't want Wget to consume the entire available
+bandwidth.
+
+Note that Wget implements the limiting by sleeping the appropriate
+amount of time after a network read that took less time than specified
+by the rate.  Eventually this strategy causes the TCP transfer to slow
+down to approximately the specified rate.  However, it takes some time
+for this balance to be achieved, so don't be surprised if limiting the
+rate doesn't work well with very small files.
+
  @cindex pause
  @cindex wait
  @item -w @var{seconds}
@@ -725,7 +762,9 @@ actions of one.
  @item -Y on/off
  @itemx --proxy=on/off
  Turn proxy support on or off.  The proxy is on by default if the
-appropriate environmental variable is defined.
+appropriate environment variable is defined.
+
+For more information about the use of proxies with Wget, @xref{Proxies}.
  
  @cindex quota
  @item -Q @var{quota}
@@ -743,6 +782,65 @@ Thus you may safely type @samp{wget -Q2m -i sites}---download will be
  aborted when the quota is exceeded.
  
  Setting quota to 0 or to @samp{inf} unlimits the download quota.
+
+@cindex DNS cache
+@cindex caching of DNS lookups
+@itemx --dns-cache=off
+Turn off caching of DNS lookups.  Normally, Wget remembers the addresses
+it looked up from DNS so it doesn't have to repeatedly contact the DNS
+server for the same (typically small) set of addresses it retrieves
+from.  This cache exists in memory only; a new Wget run will contact DNS
+again.
+
+However, in some cases it is not desirable to cache host names, even for
+the duration of a short-running application like Wget.  For example,
+some HTTP servers are hosted on machines with dynamically allocated IP
+addresses that change from time to time.  Their DNS entries are updated
+along with each change.  When Wget's download from such a host gets
+interrupted by IP address change, Wget retries the download, but (due to
+DNS caching) it contacts the old address.  With the DNS cache turned
+off, Wget will repeat the DNS lookup for every connect and will thus get
+the correct dynamic address every time---at the cost of additional DNS
+lookups where they're probably not needed.
+
+If you don't understand the above description, you probably won't need
+this option.
+
+@cindex file names, restrict
+@cindex Windows file names
+@itemx --restrict-file-names=@var{mode}
+Change which characters found in remote URLs may show up in local file
+names generated from those URLs.  Characters that are @dfn{restricted}
+by this option are escaped, i.e. replaced with @samp{%HH}, where
+@samp{HH} is the hexadecimal number that corresponds to the restricted
+character.
+
+By default, Wget escapes the characters that are not valid as part of
+file names on your operating system, as well as control characters that
+are typically unprintable.  This option is useful for changing these
+defaults, either because you are downloading to a non-native partition,
+or because you want to disable escaping of the control characters.
+
+When mode is set to ``unix'', Wget escapes the character @samp{/} and
+the control characters in the ranges 0--31 and 128--159.  This is the
+default on Unix-like OS'es.
+
+When mode is seto to ``windows'', Wget escapes the characters @samp{\},
+@samp{|}, @samp{/}, @samp{:}, @samp{?}, @samp{"}, @samp{*}, @samp{<},
+@samp{>}, and the control characters in the ranges 0--31 and 128--159.
+In addition to this, Wget in Windows mode uses @samp{+} instead of
+@samp{:} to separate host and port in local file names, and uses
+@samp{@@} instead of @samp{?} to separate the query portion of the file
+name from the rest.  Therefore, a URL that would be saved as
+@samp{www.xemacs.org:4300/search.pl?input=blah} in Unix mode would be
+saved as @samp{www.xemacs.org+4300/search.pl@@input=blah} in Windows
+mode.  This mode is the default on Windows.
+
+If you append @samp{,nocontrol} to the mode, as in
+@samp{unix,nocontrol}, escaping of the control characters is also
+switched off.  You can use @samp{--restrict-file-names=nocontrol} to
+turn off escaping of control characters without affecting the choice of
+the OS to use as file name restriction mode.
  @end table
  
  @node Directory Options, HTTP Options, Download Options, Invoking
@@ -848,8 +946,15 @@ encode them using either the @code{basic} (insecure) or the
  @code{digest} authentication scheme.
  
  Another way to specify username and password is in the @sc{url} itself
-(@pxref{URL Format}).  For more information about security issues with
-Wget, @xref{Security Considerations}.
+(@pxref{URL Format}).  Either method reveals your password to anyone who
+bothers to run @code{ps}.  To prevent the passwords from being seen,
+store them in @file{.wgetrc} or @file{.netrc}, and make sure to protect
+those files from other users with @code{chmod}.  If the passwords are
+really important, do not leave them lying in those files either---edit
+the files and delete them after Wget has started the download.
+
+For more information about security issues with Wget, @xref{Security
+Considerations}.
  
  @cindex proxy
  @cindex cache
@@ -974,6 +1079,9 @@ Specify the username @var{user} and password @var{password} for
  authentication on a proxy server.  Wget will encode them using the
  @code{basic} authentication scheme.
  
+Security considerations similar to those with @samp{--http-passwd}
+pertain here as well.
+
  @cindex http referer
  @cindex referer, http
  @item --referer=@var{url}
@@ -2002,6 +2110,10 @@ Top of directory tree---the same as @samp{-P}.
  Turning dirstruct on or off---the same as @samp{-x} or @samp{-nd},
  respectively.
  
+@item dns_cache = on/off
+Turn DNS caching on/off.  Since DNS caching is on by default, this
+option is normally used to turn it off.  Same as @samp{--dns-cache}.
+
  @item domains = @var{string}
  Same as @samp{-D} (@pxref{Spanning Hosts}).
  
@@ -2084,6 +2196,10 @@ Consider data longer than specified in content-length header as invalid
  as there is, provided there is more than or equal to the value in
  @code{Content-Length}.
  
+@item limit_rate = @var{rate}
+Limit the download speed to no more than @var{rate} bytes per second.
+The same as @samp{--limit-rate}.
+
  @item logfile = @var{string}
  Set logfile---the same as @samp{-o}.
  
@@ -2166,21 +2282,25 @@ Links}).
  If set to on, remove @sc{ftp} listings downloaded by Wget.  Setting it
  to off is the same as @samp{-nr}.
  
+@item restrict_file_names = unix/windows
+Restrict the file names generated by Wget from URLs.  See
+@samp{--restrict-file-names} for a more detailed description.
+
  @item retr_symlinks = on/off
  When set to on, retrieve symbolic links as if they were plain files; the
  same as @samp{--retr-symlinks}.
  
  @item robots = on/off
-Use (or not) @file{/robots.txt} file (@pxref{Robots}).  Be sure to know
-what you are doing before changing the default (which is @samp{on}).
+Specify whether the norobots convention is respected by Wget, ``on'' by
+default.  This switch controls both the @file{/robots.txt} and the
+@samp{nofollow} aspect of the spec.  @xref{Robot Exclusion}, for more
+details about this.  Be sure you know what you are doing before turning
+this off.
  
  @item server_response = on/off
  Choose whether or not to print the @sc{http} and @sc{ftp} server
  responses---the same as @samp{-S}.
  
-@item simple_host_check = on/off
-Same as @samp{-nh} (@pxref{Spanning Hosts}).
-
  @item span_hosts = on/off
  Same as @samp{-H}.
  
@@ -2411,6 +2531,10 @@ If you want to encode your own username and password to @sc{http} or
  wget ftp://hniksic:mypassword@@unix.server.com/.emacs
  @end example
  
+Note, however, that this usage is not advisable on multi-user systems
+because it reveals your password to anyone who looks at the output of
+@code{ps}.
+
  @cindex redirecting output
  @item
  You would like the output documents to go to standard output instead of
@@ -2695,14 +2819,14 @@ Other than that, Wget will not try to interfere with signals in any way.
  This chapter contains some references I consider useful.
  
  @menu
-* Robots::                  Wget as a WWW robot.
+* Robot Exclusion::         Wget's support for RES.
  * Security Considerations:: Security with Wget.
  * Contributors::            People who helped.
  @end menu
  
-@node Robots, Security Considerations, Appendices, Appendices
-@section Robots
-@cindex robots
+@node Robot Exclusion, Security Considerations, Appendices, Appendices
+@section Robot Exclusion
+@cindex robot exclusion
  @cindex robots.txt
  @cindex server maintenance
  
@@ -2710,26 +2834,35 @@ It is extremely easy to make Wget wander aimlessly around a web site,
  sucking all the available data in progress.  @samp{wget -r @var{site}},
  and you're set.  Great?  Not for the server admin.
  
-While Wget is retrieving static pages, there's not much of a problem.
-But for Wget, there is no real difference between a static page and the
-most demanding CGI.  For instance, a site I know has a section handled
-by an, uh, @dfn{bitchin'} CGI script that converts all the Info files to
-HTML.  The script can and does bring the machine to its knees without
-providing anything useful to the downloader.
-
-For such and similar cases various robot exclusion schemes have been
-devised as a means for the server administrators and document authors to
-protect chosen portions of their sites from the wandering of robots.
-
-The more popular mechanism is the @dfn{Robots Exclusion Standard}, or
-@sc{res}, written by Martijn Koster et al. in 1994.  It specifies the
-format of a text file containing directives that instruct the robots
-which URL paths to avoid.  To be found by the robots, the specifications
-must be placed in @file{/robots.txt} in the server root, which the
-robots are supposed to download and parse.
-
-Wget supports @sc{res} when downloading recursively.  So, when you
-issue:
+As long as Wget is only retrieving static pages, and doing it at a
+reasonable rate (see the @samp{--wait} option), there's not much of a
+problem.  The trouble is that Wget can't tell the difference between the
+smallest static page and the most demanding CGI.  A site I know has a
+section handled by an, uh, @dfn{bitchin'} CGI Perl script that converts
+Info files to HTML on the fly.  The script is slow, but works well
+enough for human users viewing an occasional Info file.  However, when
+someone's recursive Wget download stumbles upon the index page that
+links to all the Info files through the script, the system is brought to
+its knees without providing anything useful to the downloader.
+
+To avoid this kind of accident, as well as to preserve privacy for
+documents that need to be protected from well-behaved robots, the
+concept of @dfn{robot exclusion} has been invented.  The idea is that
+the server administrators and document authors can specify which
+portions of the site they wish to protect from the robots.
+
+The most popular mechanism, and the de facto standard supported by all
+the major robots, is the ``Robots Exclusion Standard'' (RES) written by
+Martijn Koster et al. in 1994.  It specifies the format of a text file
+containing directives that instruct the robots which URL paths to avoid.
+To be found by the robots, the specifications must be placed in
+@file{/robots.txt} in the server root, which the robots are supposed to
+download and parse.
+
+Although Wget is not a web robot in the strictest sense of the word, it
+can downloads large parts of the site without the user's intervention to
+download an individual page.  Because of that, Wget honors RES when
+downloading recursively.  For instance, when you issue:
  
  @example
  wget -r http://www.server.com/
@@ -2743,12 +2876,12 @@ server.
  
  Until version 1.8, Wget supported the first version of the standard,
  written by Martijn Koster in 1994 and available at
-@url{http://info.webcrawler.com/mak/projects/robots/norobots.html}.  As
-of version 1.8, Wget has supported the additional directives specified
-in the internet draft @samp{<draft-koster-robots-00.txt>} titled ``A
-Method for Web Robots Control''.  The draft, which has as far as I know
-never made to an @sc{rfc}, is available at
-@url{http://info.webcrawler.com/mak/projects/robots/norobots-rfc.html}.
+@url{http://www.robotstxt.org/wc/norobots.html}.  As of version 1.8,
+Wget has supported the additional directives specified in the internet
+draft @samp{<draft-koster-robots-00.txt>} titled ``A Method for Web
+Robots Control''.  The draft, which has as far as I know never made to
+an @sc{rfc}, is available at
+@url{http://www.robotstxt.org/wc/norobots-rfc.txt}.
  
  This manual no longer includes the text of the Robot Exclusion Standard.
  
@@ -2762,11 +2895,16 @@ this:
  @end example
  
  This is explained in some detail at
-@url{http://info.webcrawler.com/mak/projects/robots/meta-user.html}.
-Wget supports this method of robot exclusion in addition to the usual
-@file{/robots.txt} exclusion.
+@url{http://www.robotstxt.org/wc/meta-user.html}.  Wget supports this
+method of robot exclusion in addition to the usual @file{/robots.txt}
+exclusion.
  
-@node Security Considerations, Contributors, Robots, Appendices
+If you know what you are doing and really really wish to turn off the
+robot exclusion, set the @code{robots} variable to @samp{off} in your
+@file{.wgetrc}.  You can achieve the same effect from the command line
+using the @code{-e} switch, e.g. @samp{wget -e robots=off @var{url}...}.
+
+@node Security Considerations, Contributors, Robot Exclusion, Appendices
  @section Security Considerations
  @cindex security
  
@@ -2775,10 +2913,12 @@ through the network, which may present a security problem.  Here are the
  main issues, and some solutions.
  
  @enumerate
-@item
-The passwords on the command line are visible using @code{ps}.  If this
-is a problem, avoid putting passwords from the command line---e.g. you
-can use @file{.netrc} for this.
+@item The passwords on the command line are visible using @code{ps}.
+The best way around it is to use @code{wget -i -} and feed the @sc{url}s
+to Wget's standard input, each on a separate line, terminated by
+@kbd{C-d}.  Another workaround is to use @file{.netrc} to store
+passwords; however, storing unencrypted passwords is also considered a
+security risk.
  
  @item
  Using the insecure @dfn{basic} authentication scheme, unencrypted