@c man end
@end ignore
@c man begin DESCRIPTION
-Wget can follow links in @sc{html} pages and create local versions of
-remote web sites, fully recreating the directory structure of the
-original site. This is sometimes referred to as ``recursive
+Wget can follow links in @sc{html} and @sc{xhtml} pages and create local
+versions of remote web sites, fully recreating the directory structure of
+the original site. This is sometimes referred to as ``recursive
downloading.'' While doing that, Wget respects the Robot Exclusion
Standard (@file{/robots.txt}). Wget can be instructed to convert the
links in downloaded @sc{html} files to the local files for offline
@cindex timeout
@item -T seconds
@itemx --timeout=@var{seconds}
-Set the network timeout to @var{seconds} seconds. Whenever Wget
-connects to or reads from a remote host, it checks for a timeout and
-aborts the operation if the time expires. This prevents anomalous
-occurrences such as hanging reads or infinite connects. The default
-timeout is 900 seconds (fifteen minutes). Setting timeout to 0 will
-disable checking for timeouts.
-
-Please do not lower the default timeout value with this option unless
-you know what you are doing.
+Set the network timeouts to @var{seconds} seconds. This is equivalent
+to specifying @samp{--dns-timeout}, @samp{--connect-timeout}, and
+@samp{--read-timeout}, all at the same time.
+
+Whenever Wget connects to or reads from a remote host, it checks for a
+timeout and aborts the operation if the time expires. This prevents
+anomalous occurrences such as hanging reads or infinite connects. The
+only timeout enabled by default is a 900-second timeout for reading.
+Setting timeout to 0 disables checking for timeouts.
+
+Unless you know what you are doing, it is best not to set any of the
+timeout-related options.
+
+@cindex DNS timeout
+@cindex timeout, DNS
+@item --dns-timeout=@var{seconds}
+Set the DNS lookup timeout to @var{seconds} seconds. DNS lookups that
+don't complete within the specified time will fail. By default, there
+is no timeout on DNS lookups, other than that implemented by system
+libraries.
+
+@cindex connect timeout
+@cindex timeout, connect
+@item --connect-timeout=@var{seconds}
+Set the connect timeout to @var{seconds} seconds. TCP connections that
+take longer to establish will be aborted. By default, there is no
+connect timeout, other than that implemented by system libraries.
+
+@cindex read timeout
+@cindex timeout, read
+@item --read-timeout=@var{seconds}
+Set the read (and write) timeout to @var{seconds} seconds. Reads that
+take longer will fail. The default value for read timeout is 900
+seconds.
@cindex bandwidth, limit
@cindex rate, limit
Note that Wget implements the limiting by sleeping the appropriate
amount of time after a network read that took less time than specified
by the rate. Eventually this strategy causes the TCP transfer to slow
-down to approximately the specified rate. However, it takes some time
-for this balance to be achieved, so don't be surprised if limiting the
-rate doesn't work well with very small files.
+down to approximately the specified rate. However, it may take some
+time for this balance to be achieved, so don't be surprised if limiting
+the rate doesn't work well with very small files.
@cindex pause
@cindex wait
@cindex .html extension
@item -E
@itemx --html-extension
-If a file of type @samp{text/html} is downloaded and the URL does not
-end with the regexp @samp{\.[Hh][Tt][Mm][Ll]?}, this option will cause
-the suffix @samp{.html} to be appended to the local filename. This is
-useful, for instance, when you're mirroring a remote site that uses
-@samp{.asp} pages, but you want the mirrored pages to be viewable on
-your stock Apache server. Another good use for this is when you're
-downloading the output of CGIs. A URL like
-@samp{http://site.com/article.cgi?25} will be saved as
+If a file of type @samp{application/xhtml+xml} or @samp{text/html} is
+downloaded and the URL does not end with the regexp
+@samp{\.[Hh][Tt][Mm][Ll]?}, this option will cause the suffix @samp{.html}
+to be appended to the local filename. This is useful, for instance, when
+you're mirroring a remote site that uses @samp{.asp} pages, but you want
+the mirrored pages to be viewable on your stock Apache server. Another
+good use for this is when you're downloading the output of CGIs. A URL
+like @samp{http://site.com/article.cgi?25} will be saved as
@file{article.cgi?25.html}.
Note that filenames changed in this way will be re-downloaded every time
you re-mirror a site, because Wget can't tell that the local
@file{@var{X}.html} file corresponds to remote URL @samp{@var{X}} (since
it doesn't yet know that the URL produces output of type
-@samp{text/html}. To prevent this re-downloading, you must use
-@samp{-k} and @samp{-K} so that the original version of the file will be
-saved as @file{@var{X}.orig} (@pxref{Recursive Retrieval Options}).
+@samp{text/html} or @samp{application/xhtml+xml}. To prevent this
+re-downloading, you must use @samp{-k} and @samp{-K} so that the original
+version of the file will be saved as @file{@var{X}.orig} (@pxref{Recursive
+Retrieval Options}).
@cindex http user
@cindex http password
external document link is any URL specified in an @code{<A>} tag, an
@code{<AREA>} tag, or a @code{<LINK>} tag other than @code{<LINK
REL="stylesheet">}.
+
+@cindex HTML comments
+@cindex comments, HTML
+@item --strict-comments
+Turn on strict parsing of HTML comments. The default is to terminate
+comments at the first occurrence of @samp{-->}.
+
+According to specifications, HTML comments are expressed as SGML
+@dfn{declarations}. Declaration is special markup that begins with
+@samp{<!} and ends with @samp{>}, such as @samp{<!DOCTYPE ...>}, that
+may contain comments between a pair of @samp{--} delimiters. HTML
+comments are ``empty declarations'', SGML declarations without any
+non-comment text. Therefore, @samp{<!--foo-->} is a valid comment, and
+so is @samp{<!--one-- --two-->}, but @samp{<!--1--2-->} is not.
+
+On the other hand, most HTML writers don't perceive comments as anything
+other than text delimited with @samp{<!--} and @samp{-->}, which is not
+quite the same. For example, something like @samp{<!------------>}
+works as a valid comment as long as the number of dashes is a multiple
+of four (!). If not, the comment technically lasts until the next
+@samp{--}, which may be at the other end of the document. Because of
+this, many popular browsers completely ignore the specification and
+implement what users have come to expect: comments delimited with
+@samp{<!--} and @samp{-->}.
+
+Until version 1.9, Wget interpreted comments strictly, which resulted in
+missing links in many web pages that displayed fine in browsers, but had
+the misfortune of containing non-compliant comments. Beginning with
+version 1.9, Wget has joined the ranks of clients that implements
+``naive'' comments, terminating each comment at the first occurrence of
+@samp{-->}.
+
+If, for whatever reason, you want strict comment parsing, use this
+option to turn it on.
@end table
@node Recursive Accept/Reject Options, , Recursive Retrieval Options, Invoking
the given @sc{url}, documents, retrieving the files the @sc{html}
document was referring to, through markups like @code{href}, or
@code{src}. If the freshly downloaded file is also of type
-@code{text/html}, it will be parsed and followed further.
+@code{text/html} or @code{application/xhtml+xml}, it will be parsed and
+followed further.
Recursive retrieval of @sc{http} and @sc{html} content is
@dfn{breadth-first}. This means that Wget first downloads the requested
@item save_cookies = @var{file}
Save cookies to @var{file}. See @samp{--save-cookies}.
+@item connect_timeout = @var{n}
+Set the connect timeout---the same as @samp{--connect-timeout}.
+
@item cut_dirs = @var{n}
Ignore @var{n} remote directory components.
Turn DNS caching on/off. Since DNS caching is on by default, this
option is normally used to turn it off. Same as @samp{--dns-cache}.
+@item dns_timeout = @var{n}
+Set the DNS timeout---the same as @samp{--dns-timeout}.
+
@item domains = @var{string}
Same as @samp{-D} (@pxref{Spanning Hosts}).
Define an additional header, like @samp{--header}.
@item html_extension = on/off
-Add a @samp{.html} extension to @samp{text/html} files without it, like
+Add a @samp{.html} extension to @samp{text/html} or
+@samp{application/xhtml+xml} files without it, like
@samp{-E}.
@item http_passwd = @var{string}
to 5 mbytes. Note that the user's startup file overrides system
settings.
+@item read_timeout = @var{n}
+Set the read (and write) timeout---the same as @samp{--read-timeout}.
+
@item reclevel = @var{n}
Recursion level---the same as @samp{-l}.
@item span_hosts = on/off
Same as @samp{-H}.
+@item strict_comments = on/off
+Same as @samp{--strict-comments}.
+
@item timeout = @var{n}
Set timeout value---the same as @samp{-T}.
when HTML files are saved under extensions other than @samp{.html},
perhaps because they were served as @file{index.cgi}. So you'd like
Wget to rename all the files served with content-type @samp{text/html}
-to @file{@var{name}.html}.
+or @samp{application/xhtml+xml} to @file{@var{name}.html}.
@example
wget --mirror --convert-links --backup-converted \