[svn] Introduce non-strict comment parsing.

[wget] / doc / wget.texi
diff --git a/doc/wget.texi b/doc/wget.texi

index dda34bcccde4a4ded5d607c9c6549eabe1916f1c..78bc20dca3ac3055b03f63453bd22aded4cc5012 100644 (file)
--- a/doc/wget.texi
+++ b/doc/wget.texi
@@ -14,10 +14,15 @@
  @afourpaper
  @end iftex
  
+@c Title for man page.  The weird way texi2pod.pl is written requires
+@c the preceding @set.
+@set Wget Wget
+@c man title Wget The non-interactive network downloader.
+
  @c This should really be generated automatically, possibly by including
  @c an auto-generated file.
-@set VERSION 1.8.1+cvs
-@set UPDATED December 2001
+@set VERSION 1.9-cvs
+@set UPDATED September 2003
  
  @dircategory Net Utilities
  @dircategory World Wide Web
@@ -30,8 +35,8 @@ This file documents the the GNU Wget utility for downloading network
  data.
  
  @c man begin COPYRIGHT
-Copyright @copyright{} 1996, 1997, 1998, 2000, 2001 Free Software
-Foundation, Inc.
+Copyright @copyright{} 1996, 1997, 1998, 2000, 2001, 2002, 2003 Free
+Software Foundation, Inc.
  
  Permission is granted to make and distribute verbatim copies of
  this manual provided the copyright notice and this permission notice
@@ -507,7 +512,9 @@ IPs.
  @item -t @var{number}
  @itemx --tries=@var{number}
  Set number of retries to @var{number}.  Specify 0 or @samp{inf} for
-infinite retrying.
+infinite retrying.  The default is to retry 20 times, with the exception
+of fatal errors like ``connection refused'' or ``not found'' (404),
+which are not retried.
  
  @item -O @var{file}
  @itemx --output-document=@var{file}
@@ -695,17 +702,15 @@ Limit the download speed to @var{amount} bytes per second.  Amount may
  be expressed in bytes, kilobytes with the @samp{k} suffix, or megabytes
  with the @samp{m} suffix.  For example, @samp{--limit-rate=20k} will
  limit the retrieval rate to 20KB/s.  This kind of thing is useful when,
-for whatever reason, you don't want Wget to consume the entire evailable
+for whatever reason, you don't want Wget to consume the entire available
  bandwidth.
  
-Note that Wget implementeds the limiting by sleeping the appropriate
+Note that Wget implements the limiting by sleeping the appropriate
  amount of time after a network read that took less time than specified
  by the rate.  Eventually this strategy causes the TCP transfer to slow
  down to approximately the specified rate.  However, it takes some time
  for this balance to be achieved, so don't be surprised if limiting the
-rate doesn't work with very small files.  Also, the "sleeping" strategy
-will misfire when an extremely small bandwidth, say less than 1.5KB/s,
-is specified.
+rate doesn't work well with very small files.
  
  @cindex pause
  @cindex wait
@@ -737,7 +742,7 @@ Note that this option is turned on by default in the global
  
  @cindex wait, random
  @cindex random wait
-@itemx --random-wait
+@item --random-wait
  Some web sites may perform log analysis to identify retrieval programs
  such as Wget by looking for statistically significant similarities in
  the time between requests. This option causes the time between requests
@@ -759,7 +764,9 @@ actions of one.
  @item -Y on/off
  @itemx --proxy=on/off
  Turn proxy support on or off.  The proxy is on by default if the
-appropriate environmental variable is defined.
+appropriate environment variable is defined.
+
+For more information about the use of proxies with Wget, @xref{Proxies}.
  
  @cindex quota
  @item -Q @var{quota}
@@ -777,6 +784,65 @@ Thus you may safely type @samp{wget -Q2m -i sites}---download will be
  aborted when the quota is exceeded.
  
  Setting quota to 0 or to @samp{inf} unlimits the download quota.
+
+@cindex DNS cache
+@cindex caching of DNS lookups
+@item --dns-cache=off
+Turn off caching of DNS lookups.  Normally, Wget remembers the addresses
+it looked up from DNS so it doesn't have to repeatedly contact the DNS
+server for the same (typically small) set of addresses it retrieves
+from.  This cache exists in memory only; a new Wget run will contact DNS
+again.
+
+However, in some cases it is not desirable to cache host names, even for
+the duration of a short-running application like Wget.  For example,
+some HTTP servers are hosted on machines with dynamically allocated IP
+addresses that change from time to time.  Their DNS entries are updated
+along with each change.  When Wget's download from such a host gets
+interrupted by IP address change, Wget retries the download, but (due to
+DNS caching) it contacts the old address.  With the DNS cache turned
+off, Wget will repeat the DNS lookup for every connect and will thus get
+the correct dynamic address every time---at the cost of additional DNS
+lookups where they're probably not needed.
+
+If you don't understand the above description, you probably won't need
+this option.
+
+@cindex file names, restrict
+@cindex Windows file names
+@item --restrict-file-names=@var{mode}
+Change which characters found in remote URLs may show up in local file
+names generated from those URLs.  Characters that are @dfn{restricted}
+by this option are escaped, i.e. replaced with @samp{%HH}, where
+@samp{HH} is the hexadecimal number that corresponds to the restricted
+character.
+
+By default, Wget escapes the characters that are not valid as part of
+file names on your operating system, as well as control characters that
+are typically unprintable.  This option is useful for changing these
+defaults, either because you are downloading to a non-native partition,
+or because you want to disable escaping of the control characters.
+
+When mode is set to ``unix'', Wget escapes the character @samp{/} and
+the control characters in the ranges 0--31 and 128--159.  This is the
+default on Unix-like OS'es.
+
+When mode is seto to ``windows'', Wget escapes the characters @samp{\},
+@samp{|}, @samp{/}, @samp{:}, @samp{?}, @samp{"}, @samp{*}, @samp{<},
+@samp{>}, and the control characters in the ranges 0--31 and 128--159.
+In addition to this, Wget in Windows mode uses @samp{+} instead of
+@samp{:} to separate host and port in local file names, and uses
+@samp{@@} instead of @samp{?} to separate the query portion of the file
+name from the rest.  Therefore, a URL that would be saved as
+@samp{www.xemacs.org:4300/search.pl?input=blah} in Unix mode would be
+saved as @samp{www.xemacs.org+4300/search.pl@@input=blah} in Windows
+mode.  This mode is the default on Windows.
+
+If you append @samp{,nocontrol} to the mode, as in
+@samp{unix,nocontrol}, escaping of the control characters is also
+switched off.  You can use @samp{--restrict-file-names=nocontrol} to
+turn off escaping of control characters without affecting the choice of
+the OS to use as file name restriction mode.
  @end table
  
  @node Directory Options, HTTP Options, Download Options, Invoking
@@ -1299,6 +1365,40 @@ To finish off this topic, it's worth knowing that Wget's idea of an
  external document link is any URL specified in an @code{<A>} tag, an
  @code{<AREA>} tag, or a @code{<LINK>} tag other than @code{<LINK
  REL="stylesheet">}.
+
+@cindex HTML comments
+@cindex comments, HTML
+@item --strict-comments
+Turn on strict parsing of HTML comments.  The default is to terminate
+comments at the first occurrence of @samp{-->}.
+
+According to specifications, HTML comments are expressed as SGML
+@dfn{declarations}.  Declaration is special markup that begins with
+@samp{<!} and ends with @samp{>}, such as @samp{<!DOCTYPE ...>}, that
+may contain comments between a pair of @samp{--} delimiters.  HTML
+comments are ``empty declarations'', SGML declarations without any
+non-comment text.  Therefore, @samp{<!--foo-->} is a valid comment, and
+so is @samp{<!--one-- --two-->}, but @samp{<!--1--2-->} is not.
+
+On the other hand, most HTML writers don't perceive comments as anything
+other than text delimited with @samp{<!--} and @samp{-->}, which is not
+quite the same.  For example, something like @samp{<!------------>}
+works as a valid comment as long as the number of dashes is a multiple
+of four (!).  If not, the comment technically lasts until the next
+@samp{--}, which may be at the other end of the document.  Because of
+this, many popular browsers completely ignore the specification and
+implement what users have come to expect: comments delimited with
+@samp{<!--} and @samp{-->}.
+
+Until version 1.9, Wget interpreted comments strictly, which resulted in
+missing links in many web pages that displayed fine in browsers, but had
+the misfortune of containing non-compliant comments.  Beginning with
+version 1.9, Wget has joined the ranks of clients that implements
+``naive'' comments, terminating each comment at the first occurrence of
+@samp{-->}.
+
+If, for whatever reason, you want strict comment parsing, use this
+option to turn it on.
  @end table
  
  @node Recursive Accept/Reject Options,  , Recursive Retrieval Options, Invoking
@@ -2046,6 +2146,10 @@ Top of directory tree---the same as @samp{-P}.
  Turning dirstruct on or off---the same as @samp{-x} or @samp{-nd},
  respectively.
  
+@item dns_cache = on/off
+Turn DNS caching on/off.  Since DNS caching is on by default, this
+option is normally used to turn it off.  Same as @samp{--dns-cache}.
+
  @item domains = @var{string}
  Same as @samp{-D} (@pxref{Spanning Hosts}).
  
@@ -2214,13 +2318,20 @@ Links}).
  If set to on, remove @sc{ftp} listings downloaded by Wget.  Setting it
  to off is the same as @samp{-nr}.
  
+@item restrict_file_names = unix/windows
+Restrict the file names generated by Wget from URLs.  See
+@samp{--restrict-file-names} for a more detailed description.
+
  @item retr_symlinks = on/off
  When set to on, retrieve symbolic links as if they were plain files; the
  same as @samp{--retr-symlinks}.
  
  @item robots = on/off
-Use (or not) @file{/robots.txt} file (@pxref{Robots}).  Be sure to know
-what you are doing before changing the default (which is @samp{on}).
+Specify whether the norobots convention is respected by Wget, ``on'' by
+default.  This switch controls both the @file{/robots.txt} and the
+@samp{nofollow} aspect of the spec.  @xref{Robot Exclusion}, for more
+details about this.  Be sure you know what you are doing before turning
+this off.
  
  @item server_response = on/off
  Choose whether or not to print the @sc{http} and @sc{ftp} server
@@ -2229,6 +2340,9 @@ responses---the same as @samp{-S}.
  @item span_hosts = on/off
  Same as @samp{-H}.
  
+@item strict_comments = on/off
+Same as @samp{--strict-comments}.
+
  @item timeout = @var{n}
  Set timeout value---the same as @samp{-T}.
  
@@ -2744,14 +2858,14 @@ Other than that, Wget will not try to interfere with signals in any way.
  This chapter contains some references I consider useful.
  
  @menu
-* Robots::                  Wget as a WWW robot.
+* Robot Exclusion::         Wget's support for RES.
  * Security Considerations:: Security with Wget.
  * Contributors::            People who helped.
  @end menu
  
-@node Robots, Security Considerations, Appendices, Appendices
-@section Robots
-@cindex robots
+@node Robot Exclusion, Security Considerations, Appendices, Appendices
+@section Robot Exclusion
+@cindex robot exclusion
  @cindex robots.txt
  @cindex server maintenance
  
@@ -2759,26 +2873,35 @@ It is extremely easy to make Wget wander aimlessly around a web site,
  sucking all the available data in progress.  @samp{wget -r @var{site}},
  and you're set.  Great?  Not for the server admin.
  
-While Wget is retrieving static pages, there's not much of a problem.
-But for Wget, there is no real difference between a static page and the
-most demanding CGI.  For instance, a site I know has a section handled
-by an, uh, @dfn{bitchin'} CGI script that converts all the Info files to
-HTML.  The script can and does bring the machine to its knees without
-providing anything useful to the downloader.
-
-For such and similar cases various robot exclusion schemes have been
-devised as a means for the server administrators and document authors to
-protect chosen portions of their sites from the wandering of robots.
-
-The more popular mechanism is the @dfn{Robots Exclusion Standard}, or
-@sc{res}, written by Martijn Koster et al. in 1994.  It specifies the
-format of a text file containing directives that instruct the robots
-which URL paths to avoid.  To be found by the robots, the specifications
-must be placed in @file{/robots.txt} in the server root, which the
-robots are supposed to download and parse.
-
-Wget supports @sc{res} when downloading recursively.  So, when you
-issue:
+As long as Wget is only retrieving static pages, and doing it at a
+reasonable rate (see the @samp{--wait} option), there's not much of a
+problem.  The trouble is that Wget can't tell the difference between the
+smallest static page and the most demanding CGI.  A site I know has a
+section handled by an, uh, @dfn{bitchin'} CGI Perl script that converts
+Info files to HTML on the fly.  The script is slow, but works well
+enough for human users viewing an occasional Info file.  However, when
+someone's recursive Wget download stumbles upon the index page that
+links to all the Info files through the script, the system is brought to
+its knees without providing anything useful to the downloader.
+
+To avoid this kind of accident, as well as to preserve privacy for
+documents that need to be protected from well-behaved robots, the
+concept of @dfn{robot exclusion} has been invented.  The idea is that
+the server administrators and document authors can specify which
+portions of the site they wish to protect from the robots.
+
+The most popular mechanism, and the de facto standard supported by all
+the major robots, is the ``Robots Exclusion Standard'' (RES) written by
+Martijn Koster et al. in 1994.  It specifies the format of a text file
+containing directives that instruct the robots which URL paths to avoid.
+To be found by the robots, the specifications must be placed in
+@file{/robots.txt} in the server root, which the robots are supposed to
+download and parse.
+
+Although Wget is not a web robot in the strictest sense of the word, it
+can downloads large parts of the site without the user's intervention to
+download an individual page.  Because of that, Wget honors RES when
+downloading recursively.  For instance, when you issue:
  
  @example
  wget -r http://www.server.com/
@@ -2815,7 +2938,12 @@ This is explained in some detail at
  method of robot exclusion in addition to the usual @file{/robots.txt}
  exclusion.
  
-@node Security Considerations, Contributors, Robots, Appendices
+If you know what you are doing and really really wish to turn off the
+robot exclusion, set the @code{robots} variable to @samp{off} in your
+@file{.wgetrc}.  You can achieve the same effect from the command line
+using the @code{-e} switch, e.g. @samp{wget -e robots=off @var{url}...}.
+
+@node Security Considerations, Contributors, Robot Exclusion, Appendices
  @section Security Considerations
  @cindex security