[svn] Manually applied Rob Mayoff <mayoff@dqd.com>'s patch (vs. 1.5.3, not 1.5.3...

[wget] / doc / wget.texi
diff --git a/doc/wget.texi b/doc/wget.texi

index 7bae18de2b558c26215ba103b4b26516d079ae0d..0310cf639e55e2c2a862038133e63082489bfbd1 100644 (file)
--- a/doc/wget.texi
+++ b/doc/wget.texi
@@ -431,12 +431,27 @@ file.  This enables you to retrieve relative links from existing
  @sc{html} files on your local disk, by adding @code{<base
  href="@var{url}">} to @sc{html}, or using the @samp{--base} command-line
  option.
+
+@cindex base for relative links in input file
+@item -B @var{URL}
+@itemx --base=@var{URL}
+When used in conjunction with @samp{-F}, prepends @var{URL} to relative
+links in the file specified by @samp{-i}.
  @end table
  
  @node Download Options, Directory Options, Logging and Input File Options, Invoking
  @section Download Options
  
  @table @samp
+@cindex bind() address
+@cindex client IP address
+@cindex IP address, client
+@item --bind-address=@var{ADDRESS}
+When making client TCP/IP connections, @code{bind()} to @var{ADDRESS} on
+the local machine.  @var{ADDRESS} may be specified as a hostname or IP
+address.  This option can be useful if your machine is bound to multiple
+IPs.
+
  @cindex retries
  @cindex tries
  @cindex number of retries
@@ -453,15 +468,42 @@ already exists, it will be overwritten.  If the @var{file} is @samp{-},
  the documents will be written to standard output.  Including this option
  automatically sets the number of tries to 1.
  
+@cindex clobbering, file
+@cindex downloading multiple times
  @cindex no-clobber
  @item -nc
  @itemx --no-clobber
-Do not clobber existing files when saving to directory hierarchy within
-recursive retrieval of several files. This option is @emph{extremely}
-useful when you wish to continue where you left off with retrieval of
-many files.  If the files have the @samp{.html} or (yuck) @samp{.htm}
-suffix, they will be loaded from the local disk, and parsed as if they
-have been retrieved from the Web.
+If a file is downloaded more than once in the same directory, wget's
+behavior depends on a few options, including @samp{-nc}.  In certain
+cases, the local file will be "clobbered", or overwritten, upon repeated
+download.  In other cases it will be preserved.
+
+When running wget without @samp{-N}, @samp{-nc}, or @samp{-r},
+downloading the same file in the same directory will result in the
+original copy of @samp{@var{file}} being preserved and the second copy
+being named @samp{@var{file}.1}.  If that file is downloaded yet again,
+the third copy will be named @samp{@var{file}.2}, and so on.  When
+@samp{-nc} is specified, this behavior is suppressed, and wget will
+refuse to download newer copies of @samp{@var{file}}.  Therefore,
+"no-clobber" is actually a misnomer in this mode -- it's not clobbering
+that's prevented (as the numeric suffixes were already preventing
+clobbering), but rather the multiple version saving that's prevented.
+
+When running wget with @samp{-r}, but without @samp{-N} or @samp{-nc},
+re-downloading a file will result in the new copy simply overwriting the
+old.  Adding @samp{-nc} will prevent this behavior, instead causing the
+original version to be preserved and any newer copies on the server to
+be ignored.
+
+When running wget with @samp{-N}, with or without @samp{-r}, the
+decision as to whether or not to download a newer copy of a file depends
+on the local and remote timestamp and size of the file
+(@xref{Time-Stamping}).  @samp{-nc} may not be specified at the same
+time as @samp{-N}.
+
+Note that when @samp{-nc} is specified, files with the suffixes
+@samp{.html} or (yuck) @samp{.htm} will be loaded from the local disk
+and parsed as if they had been retrieved from the Web.
  
  @cindex continue retrieval
  @item -c
@@ -489,6 +531,10 @@ Without @samp{-c}, the previous example would just begin to download the
  remote file to @file{ls-lR.Z.1}.  The @samp{-c} option is also
  applicable for @sc{http} servers that support the @code{Range} header.
  
+Note that if you use @samp{-c} on a file that's already downloaded
+completely, @samp{@var{file}} will not be changed, nor will a second
+@samp{@var{file}.1} copy be created.
+
  @cindex dot style
  @cindex retrieval tracing style
  @item --dot-style=@var{style}
@@ -669,6 +715,27 @@ current directory).
  @section HTTP Options
  
  @table @samp
+@cindex .html extension
+@item -E
+@itemx --html-extension
+If a file of type @samp{text/html} is downloaded and the URL does not
+end with the regexp "\.[Hh][Tt][Mm][Ll]?", this option will cause the
+suffix @samp{.html} to be appended to the local filename.  This is
+useful, for instance, when you're mirroring a remote site that uses
+@samp{.asp} pages, but you want the mirrored pages to be viewable on
+your stock Apache server.  Another good use for this is when you're
+downloading the output of CGIs.  A URL like
+@samp{http://site.com/article.cgi?25} will be saved as
+@file{article.cgi?25.html}.
+
+Note that filenames changed in this way will be re-downloaded every time
+you re-mirror a site, because wget can't tell that the local
+@file{@var{X}.html} file corresponds to remote URL @samp{@var{X}} (since
+it doesn't yet know that the URL produces output of type
+@samp{text/html}.  To prevent this re-downloading, you must use
+@samp{-k} and @samp{-K} so that the original version of the file will be
+saved as @file{@var{X}.orig} (@xref{Recursive Retrieval Options}).
+
  @cindex http user
  @cindex http password
  @cindex authentication
@@ -770,21 +837,30 @@ servers denying information to clients other than @code{Mozilla} or
  Microsoft @code{Internet Explorer}.  This option allows you to change
  the @code{User-Agent} line issued by Wget.  Use of this option is
  discouraged, unless you really know what you are doing.
-
-@strong{NOTE} that Netscape Communications Corp. has claimed that false
-transmissions of @samp{Mozilla} as the @code{User-Agent} are a copyright
-infringement, which will be prosecuted.  @strong{DO NOT} misrepresent
-Wget as Mozilla.
  @end table
  
  @node FTP Options, Recursive Retrieval Options, HTTP Options, Invoking
  @section FTP Options
  
  @table @samp
-@cindex retrieve symbolic links
+@cindex symbolic links, retrieving
  @item --retr-symlinks
-Retrieve symbolic links on @sc{ftp} sites as if they were plain files,
-i.e. don't just create links locally.
+Usually, when retrieving @sc{ftp} directories recursively and a symbolic
+link is encountered, the linked-to file is not downloaded.  Instead, a
+matching symbolic link is created on the local filesystem.  The
+pointed-to file will not be downloaded unless this recursive retrieval
+would have encountered it separately and downloaded it anyway.
+
+When @samp{--retr-symlinks} is specified, however, symbolic links are
+traversed and the pointed-to files are retrieved.  At this time, this
+option does not cause wget to traverse symlinks to directories and
+recurse through them, but in the future it should be enhanced to do
+this.
+
+Note that when retrieving a file (not a directory) because it was
+specified on the commandline, rather than because it was recursed to,
+this option has no effect.  Symbolic links are always traversed in this
+case.
  
  @cindex globbing, toggle
  @item -g on/off
@@ -834,14 +910,20 @@ Retrieval}).  The default maximum depth is 5.
  @item --delete-after
  This option tells Wget to delete every single file it downloads,
  @emph{after} having done so.  It is useful for pre-fetching popular
-pages through proxy, e.g.:
+pages through a proxy, e.g.:
  
  @example
  wget -r -nd --delete-after http://whatever.com/~popular/page/
  @end example
  
-The @samp{-r} option is to retrieve recursively, and @samp{-nd} not to
-create directories.
+The @samp{-r} option is to retrieve recursively, and @samp{-nd} to not
+create directories.  
+
+Note that @samp{--delete-after} deletes files on the local machine.  It
+does not issue the @samp{DELE} command to remote FTP sites, for
+instance.  Also note that when @samp{--delete-after} is specified,
+@samp{--convert-links} is ignored, so @samp{.orig} files are simply not
+created in the first place.
  
  @cindex conversion of links
  @cindex link conversion
@@ -876,6 +958,82 @@ retrievals.  Normally, these files contain the raw directory listings
  received from @sc{ftp} servers.  Not removing them can be useful to
  access the full remote file list when running a mirror, or for debugging
  purposes.
+
+@cindex page requisites
+@cindex required images, downloading
+@item -p
+@itemx --page-requisites
+This option causes wget to download all the files that are necessary to
+properly display a given HTML page.  This includes such things as
+inlined images, sounds, and referenced stylesheets.
+
+Ordinarily, when downloading a single HTML page, any requisite documents
+that may be needed to display it properly are not downloaded.  Using
+@samp{-r} together with @samp{-l} can help, but since wget does not
+ordinarily distinguish between external and inlined documents, one is
+generally left with "leaf documents" that are missing their requisites.
+
+For instance, say document @file{1.html} contains an @code{<IMG>} tag
+referencing @file{1.gif} and an @code{<A>} tag pointing to external
+document @file{2.html}.  Say that @file{2.html} is the same but that its
+image is @file{2.gif} and it links to @file{3.html}.  Say this
+continues up to some arbitrarily high number.
+
+If one executes the command:
+
+@example
+wget -r -l 2 http://@var{site}/1.html
+@end example
+
+then @file{1.html}, @file{1.gif}, @file{2.html}, @file{2.gif}, and
+@file{3.html} will be downloaded.  As you can see, @file{3.html} is
+without its requisite @file{3.gif} because wget is simply counting the
+number of hops (up to 2) away from @file{1.html} in order to determine
+where to stop the recursion.  However, with this command:
+
+@example
+wget -r -l 2 -p http://@var{site}/1.html
+@end example
+
+all the above files @emph{and} @file{3.html}'s requisite @file{3.gif}
+will be downloaded.  Similarly,
+
+@example
+wget -r -l 1 -p http://@var{site}/1.html
+@end example
+
+will cause @file{1.html}, @file{1.gif}, @file{2.html}, and @file{2.gif}
+to be downloaded.  One might think that:
+
+@example
+wget -r -l 0 -p http://@var{site}/1.html
+@end example
+
+would download just @file{1.html} and @file{1.gif}, but unfortunately
+this is not the case, because @samp{-l 0} is equivalent to @samp{-l inf}
+-- that is, infinite recursion.  To download a single HTML page (or a
+handful of them, all specified on the commandline or in a @samp{-i} @sc{url}
+input file) and its requisites, simply leave off @samp{-p} and @samp{-l}:
+
+@example
+wget -p http://@var{site}/1.html
+@end example
+
+Note that wget will behave as if @samp{-r} had been specified, but only
+that single page and its requisites will be downloaded.  Links from that
+page to external documents will not be followed.  Actually, to download
+a single page and all its requisites (even if they exist on separate
+websites), and make sure the lot displays properly locally, this author
+likes to use a few options in addition to @samp{-p}:
+
+@example
+wget -E -H -k -K -nh -p http://@var{site}/@var{document}
+@end example
+
+To finish off this topic, it's worth knowing that wget's idea of an
+external document link is any URL specified in an @code{<A>} tag, an
+@code{<AREA>} tag, or a @code{<LINK>} tag other than @code{<LINK
+REL="stylesheet">}.
  @end table
  
  @node Recursive Accept/Reject Options,  , Recursive Retrieval Options, Invoking
@@ -915,14 +1073,22 @@ comma-separated @var{list} with this option.
  @itemx --ignore-tags=@var{list}
  This is the opposite of the @samp{--follow-tags} option.  To skip
  certain HTML tags when recursively looking for documents to download,
-specify them in a comma-separated @var{list}.  The author of this option
-likes to use the following command to download a single HTML page and
-all documents necessary to display it properly:
+specify them in a comma-separated @var{list}.  
+
+In the past, the @samp{-G} option was the best bet for downloading a
+single page and its requisites, using a commandline like:
  
  @example
  wget -Ga,area -H -k -K -nh -r http://@var{site}/@var{document}
  @end example
  
+However, the author of this option came across a page with tags like
+@code{<LINK REL="home" HREF="/">} and came to the realization that
+@samp{-G} was not enough.  One can't just tell wget to ignore
+@code{<LINK>}, because then stylesheets will not be downloaded.  Now the
+best bet for downloading a single page and its requisites is the
+dedicated @samp{--page-requisites} option.
+
  @item -H
  @itemx --span-hosts
  Enable spanning across hosts when doing recursive retrieving (@xref{All
@@ -1541,16 +1707,22 @@ reject =
  @section Wgetrc Commands
  @cindex wgetrc commands
  
-The complete set of commands is listed below, the letter after @samp{=}
-denoting the value the command takes. It is @samp{on/off} for @samp{on}
-or @samp{off} (which can also be @samp{1} or @samp{0}), @var{string} for
-any non-empty string or @var{n} for a positive integer.  For example,
-you may specify @samp{use_proxy = off} to disable use of proxy servers
-by default. You may use @samp{inf} for infinite values, where
-appropriate.
+The complete set of commands is listed below.  Legal values are listed
+after the @samp{=}.  Simple Boolean values can be set or unset using
+@samp{on} and @samp{off} or @samp{1} and @samp{0}.  A fancier kind of
+Boolean allowed in some cases is the "lockable" Boolean, which may be
+set to @samp{on}, @samp{off}, @samp{always}, or @samp{never}.  If an
+option is set to @samp{always} or @samp{never}, that value will be
+locked in for the duration of the wget invocation -- commandline options
+will not override.
  
-Most of the commands have their equivalent command-line option
-(@xref{Invoking}), except some more obscure or rarely used ones.
+Some commands take pseudo-arbitrary values.  @var{address} values can be
+hostnames or dotted-quad IP addresses.  @var{n} can be any positive
+integer, or @samp{inf} for infinity, where appropriate.  @var{string}
+values can be any non-empty string.
+
+Most of these commands have commandline equivalents (@xref{Invoking}),
+though some of the more obscure or rarely used ones do not.
  
  @table @asis
  @item accept/reject = @var{string}
@@ -1560,11 +1732,11 @@ Same as @samp{-A}/@samp{-R} (@xref{Types of Files}).
  Enable/disable host-prefixed file names.  @samp{-nH} disables it.
  
  @item continue = on/off
-Enable/disable continuation of the retrieval, the same as @samp{-c}
+Enable/disable continuation of the retrieval -- the same as @samp{-c}
  (which enables it).
  
  @item background = on/off
-Enable/disable going to background, the same as @samp{-b} (which enables
+Enable/disable going to background -- the same as @samp{-b} (which enables
  it).
  
  @item backup_converted = on/off
@@ -1573,8 +1745,14 @@ Enable/disable saving pre-converted files with the suffix @samp{.orig}
  
  @c @item backups = @var{number}
  @c #### Document me!
+@c
  @item base = @var{string}
-Set base for relative @sc{url}s, the same as @samp{-B}.
+Consider relative @sc{url}s in @sc{url} input files forced to be
+interpreted as @sc{html} as being relative to @var{string} -- the same
+as @samp{-B}.
+
+@item bind_address = @var{address}
+Bind to @var{address}, like the @samp{--bind-address} option.
  
  @item cache = on/off
  When set to off, disallow server-caching.  See the @samp{-C} option.
@@ -1589,13 +1767,13 @@ Ignore @var{n} remote directory components.
  Debug mode, same as @samp{-d}.
  
  @item delete_after = on/off
-Delete after download, the same as @samp{--delete-after}.
+Delete after download -- the same as @samp{--delete-after}.
  
  @item dir_prefix = @var{string}
-Top of directory tree, the same as @samp{-P}.
+Top of directory tree -- the same as @samp{-P}.
  
  @item dirstruct = on/off
-Turning dirstruct on or off, the same as @samp{-x} or @samp{-nd},
+Turning dirstruct on or off -- the same as @samp{-x} or @samp{-nd},
  respectively.
  
  @item domains = @var{string}
@@ -1621,13 +1799,13 @@ Specify the dot retrieval @dfn{style}, as with @samp{--dot-style}.
  
  @item exclude_directories = @var{string}
  Specify a comma-separated list of directories you wish to exclude from
-download, the same as @samp{-X} (@xref{Directory-Based Limits}).
+download -- the same as @samp{-X} (@xref{Directory-Based Limits}).
  
  @item exclude_domains = @var{string}
  Same as @samp{--exclude-domains} (@xref{Domain Acceptance}).
  
  @item follow_ftp = on/off
-Follow @sc{ftp} links from @sc{html} documents, the same as @samp{-f}.
+Follow @sc{ftp} links from @sc{html} documents -- the same as @samp{-f}.
  
  @item follow_tags = @var{string}
  Only follow certain HTML tags when doing a recursive retrieval, just like
@@ -1635,18 +1813,22 @@ Only follow certain HTML tags when doing a recursive retrieval, just like
  
  @item force_html = on/off
  If set to on, force the input filename to be regarded as an @sc{html}
-document, the same as @samp{-F}.
+document -- the same as @samp{-F}.
  
  @item ftp_proxy = @var{string}
  Use @var{string} as @sc{ftp} proxy, instead of the one specified in
  environment.
  
  @item glob = on/off
-Turn globbing on/off, the same as @samp{-g}.
+Turn globbing on/off -- the same as @samp{-g}.
  
  @item header = @var{string}
  Define an additional header, like @samp{--header}.
  
+@item html_extension = on/off
+Add a @samp{.html} extension to @samp{text/html} files without it, like
+@samp{-E}.
+
  @item http_passwd = @var{string}
  Set @sc{http} password.
  
@@ -1667,7 +1849,7 @@ Ignore certain HTML tags when doing a recursive retrieval, just like
  
  @item include_directories = @var{string}
  Specify a comma-separated list of directories you wish to follow when
-downloading, the same as @samp{-I}.
+downloading -- the same as @samp{-I}.
  
  @item input = @var{string}
  Read the @sc{url}s from @var{string}, like @samp{-i}.
@@ -1679,7 +1861,7 @@ as much data as there is, provided there is more than or equal
  to the value in @code{Content-Length}.
  
  @item logfile = @var{string}
-Set logfile, the same as @samp{-o}.
+Set logfile -- the same as @samp{-o}.
  
  @item login = @var{string}
  Your user name on the remote machine, for @sc{ftp}.  Defaults to
@@ -1703,25 +1885,35 @@ Use @var{string} as the comma-separated list of domains to avoid in
  proxy loading, instead of the one specified in environment.
  
  @item output_document = @var{string}
-Set the output filename, the same as @samp{-O}.
+Set the output filename -- the same as @samp{-O}.
  
-@item passive_ftp = on/off
-Set passive @sc{ftp}, the same as @samp{--passive-ftp}.
+@item page_requisites = on/off
+Download all ancillary documents necessary for a single HTML page to
+display properly -- the same as @samp{-p}.
+
+@item passive_ftp = on/off/always/never
+Set passive @sc{ftp} -- the same as @samp{--passive-ftp}.  Some scripts
+and @samp{.pm} (Perl module) files download files using @samp{wget
+--passive-ftp}.  If your firewall does not allow this, you can set
+@samp{passive_ftp = never} to override the commandline.
  
  @item passwd = @var{string}
  Set your @sc{ftp} password to @var{password}.  Without this setting, the
  password defaults to @samp{username@@hostname.domainname}.
  
  @item proxy_user = @var{string}
-Set proxy authentication user name to @var{string}, like
-@samp{--proxy-user}.
+Set proxy authentication user name to @var{string}, like @samp{--proxy-user}.
  
  @item proxy_passwd = @var{string}
-Set proxy authentication password to @var{string}, like
-@samp{--proxy-passwd}.
+Set proxy authentication password to @var{string}, like @samp{--proxy-passwd}.
+
+@item referer = @var{string}
+Set HTTP @samp{Referer:} header just like @samp{--referer}.  (Note it
+was the folks who wrote the @sc{http} spec who got the spelling of
+"referrer" wrong.)
  
  @item quiet = on/off
-Quiet mode, the same as @samp{-q}.
+Quiet mode -- the same as @samp{-q}.
  
  @item quota = @var{quota}
  Specify the download quota, which is useful to put in the global
@@ -1732,13 +1924,13 @@ specified in bytes (default), kbytes @samp{k} appended) or mbytes
  mbytes. Note that the user's startup file overrides system settings.
  
  @item reclevel = @var{n}
-Recursion level, the same as @samp{-l}.
+Recursion level -- the same as @samp{-l}.
  
  @item recursive = on/off
-Recursive on/off, the same as @samp{-r}.
+Recursive on/off -- the same as @samp{-r}.
  
  @item relative_only = on/off
-Follow only relative links, the same as @samp{-L} (@xref{Relative
+Follow only relative links -- the same as @samp{-L} (@xref{Relative
  Links}).
  
  @item remove_listing = on/off
@@ -1755,7 +1947,7 @@ what you are doing before changing the default (which is @samp{on}).
  
  @item server_response = on/off
  Choose whether or not to print the @sc{http} and @sc{ftp} server
-responses, the same as @samp{-S}.
+responses -- the same as @samp{-S}.
  
  @item simple_host_check = on/off
  Same as @samp{-nh} (@xref{Host Checking}).
@@ -1764,22 +1956,22 @@ Same as @samp{-nh} (@xref{Host Checking}).
  Same as @samp{-H}.
  
  @item timeout = @var{n}
-Set timeout value, the same as @samp{-T}.
+Set timeout value -- the same as @samp{-T}.
  
  @item timestamping = on/off
  Turn timestamping on/off. The same as @samp{-N} (@xref{Time-Stamping}).
  
  @item tries = @var{n}
-Set number of retries per @sc{url}, the same as @samp{-t}.
+Set number of retries per @sc{url} -- the same as @samp{-t}.
  
  @item use_proxy = on/off
  Turn proxy support on/off. The same as @samp{-Y}.
  
  @item verbose = on/off
-Turn verbose on/off, the same as @samp{-v}/@samp{-nv}.
+Turn verbose on/off -- the same as @samp{-v}/@samp{-nv}.
  
  @item wait = @var{n}
-Wait @var{n} seconds between retrievals, the same as @samp{-w}.
+Wait @var{n} seconds between retrievals -- the same as @samp{-w}.
  
  @item waitretry = @var{n}
  Wait up to @var{n} seconds between retries of failed retrievals only --
@@ -2510,10 +2702,10 @@ me).
  @cindex contributors
  
  @iftex
-GNU Wget was written by Hrvoje Nik@v{s}i@'{c} @email{hniksic@@iskon.hr}.
+GNU Wget was written by Hrvoje Nik@v{s}i@'{c} @email{hniksic@@arsdigita.com}.
  @end iftex
  @ifinfo
-GNU Wget was written by Hrvoje Niksic @email{hniksic@@iskon.hr}.
+GNU Wget was written by Hrvoje Niksic @email{hniksic@@arsdigita.com}.
  @end ifinfo
  However, its development could never have gone as far as it has, were it
  not for the help of many people, either with bug reports, feature