X-Git-Url: http://sjero.net/git/?a=blobdiff_plain;f=doc%2Fwget.texi;h=0310cf639e55e2c2a862038133e63082489bfbd1;hb=1396b300550d9b5ca3b4b3e4e3776a1456024f73;hp=86e23cf848a2dd9e406fe92e4599940d8381ecb1;hpb=f4fcbd194b20d03fbe2178baa052fbe64c86dfdc;p=wget diff --git a/doc/wget.texi b/doc/wget.texi index 86e23cf8..0310cf63 100644 --- a/doc/wget.texi +++ b/doc/wget.texi @@ -443,6 +443,15 @@ links in the file specified by @samp{-i}. @section Download Options @table @samp +@cindex bind() address +@cindex client IP address +@cindex IP address, client +@item --bind-address=@var{ADDRESS} +When making client TCP/IP connections, @code{bind()} to @var{ADDRESS} on +the local machine. @var{ADDRESS} may be specified as a hostname or IP +address. This option can be useful if your machine is bound to multiple +IPs. + @cindex retries @cindex tries @cindex number of retries @@ -706,6 +715,27 @@ current directory). @section HTTP Options @table @samp +@cindex .html extension +@item -E +@itemx --html-extension +If a file of type @samp{text/html} is downloaded and the URL does not +end with the regexp "\.[Hh][Tt][Mm][Ll]?", this option will cause the +suffix @samp{.html} to be appended to the local filename. This is +useful, for instance, when you're mirroring a remote site that uses +@samp{.asp} pages, but you want the mirrored pages to be viewable on +your stock Apache server. Another good use for this is when you're +downloading the output of CGIs. A URL like +@samp{http://site.com/article.cgi?25} will be saved as +@file{article.cgi?25.html}. + +Note that filenames changed in this way will be re-downloaded every time +you re-mirror a site, because wget can't tell that the local +@file{@var{X}.html} file corresponds to remote URL @samp{@var{X}} (since +it doesn't yet know that the URL produces output of type +@samp{text/html}. To prevent this re-downloading, you must use +@samp{-k} and @samp{-K} so that the original version of the file will be +saved as @file{@var{X}.orig} (@xref{Recursive Retrieval Options}). + @cindex http user @cindex http password @cindex authentication @@ -807,21 +837,30 @@ servers denying information to clients other than @code{Mozilla} or Microsoft @code{Internet Explorer}. This option allows you to change the @code{User-Agent} line issued by Wget. Use of this option is discouraged, unless you really know what you are doing. - -@strong{NOTE} that Netscape Communications Corp. has claimed that false -transmissions of @samp{Mozilla} as the @code{User-Agent} are a copyright -infringement, which will be prosecuted. @strong{DO NOT} misrepresent -Wget as Mozilla. @end table @node FTP Options, Recursive Retrieval Options, HTTP Options, Invoking @section FTP Options @table @samp -@cindex retrieve symbolic links +@cindex symbolic links, retrieving @item --retr-symlinks -Retrieve symbolic links on @sc{ftp} sites as if they were plain files, -i.e. don't just create links locally. +Usually, when retrieving @sc{ftp} directories recursively and a symbolic +link is encountered, the linked-to file is not downloaded. Instead, a +matching symbolic link is created on the local filesystem. The +pointed-to file will not be downloaded unless this recursive retrieval +would have encountered it separately and downloaded it anyway. + +When @samp{--retr-symlinks} is specified, however, symbolic links are +traversed and the pointed-to files are retrieved. At this time, this +option does not cause wget to traverse symlinks to directories and +recurse through them, but in the future it should be enhanced to do +this. + +Note that when retrieving a file (not a directory) because it was +specified on the commandline, rather than because it was recursed to, +this option has no effect. Symbolic links are always traversed in this +case. @cindex globbing, toggle @item -g on/off @@ -871,14 +910,20 @@ Retrieval}). The default maximum depth is 5. @item --delete-after This option tells Wget to delete every single file it downloads, @emph{after} having done so. It is useful for pre-fetching popular -pages through proxy, e.g.: +pages through a proxy, e.g.: @example wget -r -nd --delete-after http://whatever.com/~popular/page/ @end example -The @samp{-r} option is to retrieve recursively, and @samp{-nd} not to -create directories. +The @samp{-r} option is to retrieve recursively, and @samp{-nd} to not +create directories. + +Note that @samp{--delete-after} deletes files on the local machine. It +does not issue the @samp{DELE} command to remote FTP sites, for +instance. Also note that when @samp{--delete-after} is specified, +@samp{--convert-links} is ignored, so @samp{.orig} files are simply not +created in the first place. @cindex conversion of links @cindex link conversion @@ -913,6 +958,82 @@ retrievals. Normally, these files contain the raw directory listings received from @sc{ftp} servers. Not removing them can be useful to access the full remote file list when running a mirror, or for debugging purposes. + +@cindex page requisites +@cindex required images, downloading +@item -p +@itemx --page-requisites +This option causes wget to download all the files that are necessary to +properly display a given HTML page. This includes such things as +inlined images, sounds, and referenced stylesheets. + +Ordinarily, when downloading a single HTML page, any requisite documents +that may be needed to display it properly are not downloaded. Using +@samp{-r} together with @samp{-l} can help, but since wget does not +ordinarily distinguish between external and inlined documents, one is +generally left with "leaf documents" that are missing their requisites. + +For instance, say document @file{1.html} contains an @code{} tag +referencing @file{1.gif} and an @code{} tag pointing to external +document @file{2.html}. Say that @file{2.html} is the same but that its +image is @file{2.gif} and it links to @file{3.html}. Say this +continues up to some arbitrarily high number. + +If one executes the command: + +@example +wget -r -l 2 http://@var{site}/1.html +@end example + +then @file{1.html}, @file{1.gif}, @file{2.html}, @file{2.gif}, and +@file{3.html} will be downloaded. As you can see, @file{3.html} is +without its requisite @file{3.gif} because wget is simply counting the +number of hops (up to 2) away from @file{1.html} in order to determine +where to stop the recursion. However, with this command: + +@example +wget -r -l 2 -p http://@var{site}/1.html +@end example + +all the above files @emph{and} @file{3.html}'s requisite @file{3.gif} +will be downloaded. Similarly, + +@example +wget -r -l 1 -p http://@var{site}/1.html +@end example + +will cause @file{1.html}, @file{1.gif}, @file{2.html}, and @file{2.gif} +to be downloaded. One might think that: + +@example +wget -r -l 0 -p http://@var{site}/1.html +@end example + +would download just @file{1.html} and @file{1.gif}, but unfortunately +this is not the case, because @samp{-l 0} is equivalent to @samp{-l inf} +-- that is, infinite recursion. To download a single HTML page (or a +handful of them, all specified on the commandline or in a @samp{-i} @sc{url} +input file) and its requisites, simply leave off @samp{-p} and @samp{-l}: + +@example +wget -p http://@var{site}/1.html +@end example + +Note that wget will behave as if @samp{-r} had been specified, but only +that single page and its requisites will be downloaded. Links from that +page to external documents will not be followed. Actually, to download +a single page and all its requisites (even if they exist on separate +websites), and make sure the lot displays properly locally, this author +likes to use a few options in addition to @samp{-p}: + +@example +wget -E -H -k -K -nh -p http://@var{site}/@var{document} +@end example + +To finish off this topic, it's worth knowing that wget's idea of an +external document link is any URL specified in an @code{} tag, an +@code{} tag, or a @code{} tag other than @code{}. @end table @node Recursive Accept/Reject Options, , Recursive Retrieval Options, Invoking @@ -952,15 +1073,22 @@ comma-separated @var{list} with this option. @itemx --ignore-tags=@var{list} This is the opposite of the @samp{--follow-tags} option. To skip certain HTML tags when recursively looking for documents to download, -specify them in a comma-separated @var{list}. The author of this option -likes to use the following command to download a single HTML page and -all files (e.g. images, sounds, and stylesheets) necessary to display it -properly: +specify them in a comma-separated @var{list}. + +In the past, the @samp{-G} option was the best bet for downloading a +single page and its requisites, using a commandline like: @example wget -Ga,area -H -k -K -nh -r http://@var{site}/@var{document} @end example +However, the author of this option came across a page with tags like +@code{} and came to the realization that +@samp{-G} was not enough. One can't just tell wget to ignore +@code{}, because then stylesheets will not be downloaded. Now the +best bet for downloading a single page and its requisites is the +dedicated @samp{--page-requisites} option. + @item -H @itemx --span-hosts Enable spanning across hosts when doing recursive retrieving (@xref{All @@ -1579,16 +1707,22 @@ reject = @section Wgetrc Commands @cindex wgetrc commands -The complete set of commands is listed below, the letter after @samp{=} -denoting the value the command takes. It is @samp{on/off} for @samp{on} -or @samp{off} (which can also be @samp{1} or @samp{0}), @var{string} for -any non-empty string or @var{n} for a positive integer. For example, -you may specify @samp{use_proxy = off} to disable use of proxy servers -by default. You may use @samp{inf} for infinite values, where -appropriate. +The complete set of commands is listed below. Legal values are listed +after the @samp{=}. Simple Boolean values can be set or unset using +@samp{on} and @samp{off} or @samp{1} and @samp{0}. A fancier kind of +Boolean allowed in some cases is the "lockable" Boolean, which may be +set to @samp{on}, @samp{off}, @samp{always}, or @samp{never}. If an +option is set to @samp{always} or @samp{never}, that value will be +locked in for the duration of the wget invocation -- commandline options +will not override. + +Some commands take pseudo-arbitrary values. @var{address} values can be +hostnames or dotted-quad IP addresses. @var{n} can be any positive +integer, or @samp{inf} for infinity, where appropriate. @var{string} +values can be any non-empty string. -Most of the commands have their equivalent command-line option -(@xref{Invoking}), except some more obscure or rarely used ones. +Most of these commands have commandline equivalents (@xref{Invoking}), +though some of the more obscure or rarely used ones do not. @table @asis @item accept/reject = @var{string} @@ -1617,6 +1751,9 @@ Consider relative @sc{url}s in @sc{url} input files forced to be interpreted as @sc{html} as being relative to @var{string} -- the same as @samp{-B}. +@item bind_address = @var{address} +Bind to @var{address}, like the @samp{--bind-address} option. + @item cache = on/off When set to off, disallow server-caching. See the @samp{-C} option. @@ -1688,6 +1825,10 @@ Turn globbing on/off -- the same as @samp{-g}. @item header = @var{string} Define an additional header, like @samp{--header}. +@item html_extension = on/off +Add a @samp{.html} extension to @samp{text/html} files without it, like +@samp{-E}. + @item http_passwd = @var{string} Set @sc{http} password. @@ -1746,20 +1887,30 @@ proxy loading, instead of the one specified in environment. @item output_document = @var{string} Set the output filename -- the same as @samp{-O}. -@item passive_ftp = on/off -Set passive @sc{ftp} -- the same as @samp{--passive-ftp}. +@item page_requisites = on/off +Download all ancillary documents necessary for a single HTML page to +display properly -- the same as @samp{-p}. + +@item passive_ftp = on/off/always/never +Set passive @sc{ftp} -- the same as @samp{--passive-ftp}. Some scripts +and @samp{.pm} (Perl module) files download files using @samp{wget +--passive-ftp}. If your firewall does not allow this, you can set +@samp{passive_ftp = never} to override the commandline. @item passwd = @var{string} Set your @sc{ftp} password to @var{password}. Without this setting, the password defaults to @samp{username@@hostname.domainname}. @item proxy_user = @var{string} -Set proxy authentication user name to @var{string}, like -@samp{--proxy-user}. +Set proxy authentication user name to @var{string}, like @samp{--proxy-user}. @item proxy_passwd = @var{string} -Set proxy authentication password to @var{string}, like -@samp{--proxy-passwd}. +Set proxy authentication password to @var{string}, like @samp{--proxy-passwd}. + +@item referer = @var{string} +Set HTTP @samp{Referer:} header just like @samp{--referer}. (Note it +was the folks who wrote the @sc{http} spec who got the spelling of +"referrer" wrong.) @item quiet = on/off Quiet mode -- the same as @samp{-q}. @@ -2551,10 +2702,10 @@ me). @cindex contributors @iftex -GNU Wget was written by Hrvoje Nik@v{s}i@'{c} @email{hniksic@@iskon.hr}. +GNU Wget was written by Hrvoje Nik@v{s}i@'{c} @email{hniksic@@arsdigita.com}. @end iftex @ifinfo -GNU Wget was written by Hrvoje Niksic @email{hniksic@@iskon.hr}. +GNU Wget was written by Hrvoje Niksic @email{hniksic@@arsdigita.com}. @end ifinfo However, its development could never have gone as far as it has, were it not for the help of many people, either with bug reports, feature