X-Git-Url: http://sjero.net/git/?a=blobdiff_plain;f=doc%2Fwget.info-1;h=75f49368b3e453c2bee026642784d7e0927d0286;hb=1396b300550d9b5ca3b4b3e4e3776a1456024f73;hp=2f6af01f87ddf456bde35cdd2245d9a156bb605b;hpb=4454f6ce0a4ffde97887adf2abb36833924124fe;p=wget diff --git a/doc/wget.info-1 b/doc/wget.info-1 index 2f6af01f..75f49368 100644 --- a/doc/wget.info-1 +++ b/doc/wget.info-1 @@ -336,12 +336,23 @@ Logging and Input File Options HTML files on your local disk, by adding `' to HTML, or using the `--base' command-line option. +`-B URL' +`--base=URL' + When used in conjunction with `-F', prepends URL to relative links + in the file specified by `-i'. +  File: wget.info, Node: Download Options, Next: Directory Options, Prev: Logging and Input File Options, Up: Invoking Download Options ================ +`--bind-address=ADDRESS' + When making client TCP/IP connections, `bind()' to ADDRESS on the + local machine. ADDRESS may be specified as a hostname or IP + address. This option can be useful if your machine is bound to + multiple IPs. + `-t NUMBER' `--tries=NUMBER' Set number of retries to NUMBER. Specify 0 or `inf' for infinite @@ -357,12 +368,37 @@ Download Options `-nc' `--no-clobber' - Do not clobber existing files when saving to directory hierarchy - within recursive retrieval of several files. This option is - *extremely* useful when you wish to continue where you left off - with retrieval of many files. If the files have the `.html' or - (yuck) `.htm' suffix, they will be loaded from the local disk, and - parsed as if they have been retrieved from the Web. + If a file is downloaded more than once in the same directory, + wget's behavior depends on a few options, including `-nc'. In + certain cases, the local file will be "clobbered", or overwritten, + upon repeated download. In other cases it will be preserved. + + When running wget without `-N', `-nc', or `-r', downloading the + same file in the same directory will result in the original copy + of `FILE' being preserved and the second copy being named + `FILE.1'. If that file is downloaded yet again, the third copy + will be named `FILE.2', and so on. When `-nc' is specified, this + behavior is suppressed, and wget will refuse to download newer + copies of `FILE'. Therefore, "no-clobber" is actually a misnomer + in this mode - it's not clobbering that's prevented (as the + numeric suffixes were already preventing clobbering), but rather + the multiple version saving that's prevented. + + When running wget with `-r', but without `-N' or `-nc', + re-downloading a file will result in the new copy simply + overwriting the old. Adding `-nc' will prevent this behavior, + instead causing the original version to be preserved and any newer + copies on the server to be ignored. + + When running wget with `-N', with or without `-r', the decision as + to whether or not to download a newer copy of a file depends on + the local and remote timestamp and size of the file (*Note + Time-Stamping::). `-nc' may not be specified at the same time as + `-N'. + + Note that when `-nc' is specified, files with the suffixes `.html' + or (yuck) `.htm' will be loaded from the local disk and parsed as + if they had been retrieved from the Web. `-c' `--continue' @@ -388,6 +424,10 @@ Download Options remote file to `ls-lR.Z.1'. The `-c' option is also applicable for HTTP servers that support the `Range' header. + Note that if you use `-c' on a file that's already downloaded + completely, `FILE' will not be changed, nor will a second `FILE.1' + copy be created. + `--dot-style=STYLE' Set the retrieval style to STYLE. Wget traces the retrieval of each document by printing dots on the screen, each dot @@ -450,6 +490,18 @@ Download Options to reasonably expect the network error to be fixed before the retry. +`--waitretry=SECONDS' + If you don't want Wget to wait between *every* retrieval, but only + between retries of failed downloads, you can use this option. + Wget will use "linear backoff", waiting 1 second after the first + failure on a given file, then waiting 2 seconds after the second + failure on that file, up to the maximum number of SECONDS you + specify. Therefore, a value of 10 will actually make Wget wait up + to (1 + 2 + ... + 10) = 55 seconds per file. + + Note that this option is turned on by default in the global + `wgetrc' file. + `-Y on/off' `--proxy=on/off' Turn proxy support on or off. The proxy is on by default if the @@ -539,6 +591,26 @@ File: wget.info, Node: HTTP Options, Next: FTP Options, Prev: Directory Optio HTTP Options ============ +`-E' +`--html-extension' + If a file of type `text/html' is downloaded and the URL does not + end with the regexp "\.[Hh][Tt][Mm][Ll]?", this option will cause + the suffix `.html' to be appended to the local filename. This is + useful, for instance, when you're mirroring a remote site that uses + `.asp' pages, but you want the mirrored pages to be viewable on + your stock Apache server. Another good use for this is when you're + downloading the output of CGIs. A URL like + `http://site.com/article.cgi?25' will be saved as + `article.cgi?25.html'. + + Note that filenames changed in this way will be re-downloaded + every time you re-mirror a site, because wget can't tell that the + local `X.html' file corresponds to remote URL `X' (since it + doesn't yet know that the URL produces output of type `text/html'. + To prevent this re-downloading, you must use `-k' and `-K' so + that the original version of the file will be saved as `X.orig' + (*Note Recursive Retrieval Options::). + `--http-user=USER' `--http-passwd=PASSWORD' Specify the username USER and password PASSWORD on an HTTP server. @@ -592,6 +664,13 @@ HTTP Options on a proxy server. Wget will encode them using the `basic' authentication scheme. +`--referer=URL' + Include `Referer: URL' header in HTTP request. Useful for + retrieving documents with server-side processing that assume they + are always being retrieved by interactive web browsers and only + come out properly when Referer is set to one of the pages that + point to them. + `-s' `--save-headers' Save the headers sent by the HTTP server to the file, preceding the @@ -616,11 +695,6 @@ HTTP Options this option is discouraged, unless you really know what you are doing. - *NOTE* that Netscape Communications Corp. has claimed that false - transmissions of `Mozilla' as the `User-Agent' are a copyright - infringement, which will be prosecuted. *DO NOT* misrepresent - Wget as Mozilla. -  File: wget.info, Node: FTP Options, Next: Recursive Retrieval Options, Prev: HTTP Options, Up: Invoking @@ -628,8 +702,23 @@ FTP Options =========== `--retr-symlinks' - Retrieve symbolic links on FTP sites as if they were plain files, - i.e. don't just create links locally. + Usually, when retrieving FTP directories recursively and a symbolic + link is encountered, the linked-to file is not downloaded. + Instead, a matching symbolic link is created on the local + filesystem. The pointed-to file will not be downloaded unless + this recursive retrieval would have encountered it separately and + downloaded it anyway. + + When `--retr-symlinks' is specified, however, symbolic links are + traversed and the pointed-to files are retrieved. At this time, + this option does not cause wget to traverse symlinks to + directories and recurse through them, but in the future it should + be enhanced to do this. + + Note that when retrieving a file (not a directory) because it was + specified on the commandline, rather than because it was recursed + to, this option has no effect. Symbolic links are always + traversed in this case. `-g on/off' `--glob=on/off' @@ -673,13 +762,19 @@ Recursive Retrieval Options `--delete-after' This option tells Wget to delete every single file it downloads, *after* having done so. It is useful for pre-fetching popular - pages through proxy, e.g.: + pages through a proxy, e.g.: wget -r -nd --delete-after http://whatever.com/~popular/page/ - The `-r' option is to retrieve recursively, and `-nd' not to + The `-r' option is to retrieve recursively, and `-nd' to not create directories. + Note that `--delete-after' deletes files on the local machine. It + does not issue the `DELE' command to remote FTP sites, for + instance. Also note that when `--delete-after' is specified, + `--convert-links' is ignored, so `.orig' files are simply not + created in the first place. + `-k' `--convert-links' Convert the non-relative links to relative ones locally. Only the @@ -711,6 +806,70 @@ Recursive Retrieval Options useful to access the full remote file list when running a mirror, or for debugging purposes. +`-p' +`--page-requisites' + This option causes wget to download all the files that are + necessary to properly display a given HTML page. This includes + such things as inlined images, sounds, and referenced stylesheets. + + Ordinarily, when downloading a single HTML page, any requisite + documents that may be needed to display it properly are not + downloaded. Using `-r' together with `-l' can help, but since + wget does not ordinarily distinguish between external and inlined + documents, one is generally left with "leaf documents" that are + missing their requisites. + + For instance, say document `1.html' contains an `' tag + referencing `1.gif' and an `' tag pointing to external document + `2.html'. Say that `2.html' is the same but that its image is + `2.gif' and it links to `3.html'. Say this continues up to some + arbitrarily high number. + + If one executes the command: + + wget -r -l 2 http://SITE/1.html + + then `1.html', `1.gif', `2.html', `2.gif', and `3.html' will be + downloaded. As you can see, `3.html' is without its requisite + `3.gif' because wget is simply counting the number of hops (up to + 2) away from `1.html' in order to determine where to stop the + recursion. However, with this command: + + wget -r -l 2 -p http://SITE/1.html + + all the above files *and* `3.html''s requisite `3.gif' will be + downloaded. Similarly, + + wget -r -l 1 -p http://SITE/1.html + + will cause `1.html', `1.gif', `2.html', and `2.gif' to be + downloaded. One might think that: + + wget -r -l 0 -p http://SITE/1.html + + would download just `1.html' and `1.gif', but unfortunately this + is not the case, because `-l 0' is equivalent to `-l inf' - that + is, infinite recursion. To download a single HTML page (or a + handful of them, all specified on the commandline or in a `-i' URL + input file) and its requisites, simply leave off `-p' and `-l': + + wget -p http://SITE/1.html + + Note that wget will behave as if `-r' had been specified, but only + that single page and its requisites will be downloaded. Links + from that page to external documents will not be followed. + Actually, to download a single page and all its requisites (even + if they exist on separate websites), and make sure the lot + displays properly locally, this author likes to use a few options + in addition to `-p': + + wget -E -H -k -K -nh -p http://SITE/DOCUMENT + + To finish off this topic, it's worth knowing that wget's idea of an + external document link is any URL specified in an `' tag, an + `' tag, or a `' tag other than `'. +  File: wget.info, Node: Recursive Accept/Reject Options, Prev: Recursive Retrieval Options, Up: Invoking @@ -748,13 +907,20 @@ Recursive Accept/Reject Options `--ignore-tags=LIST' This is the opposite of the `--follow-tags' option. To skip certain HTML tags when recursively looking for documents to - download, specify them in a comma-separated LIST. The author of - this option likes to use the following command to download a - single HTML page and all documents necessary to display it - properly: + download, specify them in a comma-separated LIST. + + In the past, the `-G' option was the best bet for downloading a + single page and its requisites, using a commandline like: wget -Ga,area -H -k -K -nh -r http://SITE/DOCUMENT + However, the author of this option came across a page with tags + like `' and came to the realization that + `-G' was not enough. One can't just tell wget to ignore `', + because then stylesheets will not be downloaded. Now the best bet + for downloading a single page and its requisites is the dedicated + `--page-requisites' option. + `-H' `--span-hosts' Enable spanning across hosts when doing recursive retrieving @@ -1029,236 +1195,3 @@ name, but *not* the PostScript files. files; Wget must load all the HTMLs to know where to go at all--recursive retrieval would make no sense otherwise. - -File: wget.info, Node: Directory-Based Limits, Next: FTP Links, Prev: Types of Files, Up: Following Links - -Directory-Based Limits -====================== - - Regardless of other link-following facilities, it is often useful to -place the restriction of what files to retrieve based on the directories -those files are placed in. There can be many reasons for this--the -home pages may be organized in a reasonable directory structure; or some -directories may contain useless information, e.g. `/cgi-bin' or `/dev' -directories. - - Wget offers three different options to deal with this requirement. -Each option description lists a short name, a long name, and the -equivalent command in `.wgetrc'. - -`-I LIST' -`--include LIST' -`include_directories = LIST' - `-I' option accepts a comma-separated list of directories included - in the retrieval. Any other directories will simply be ignored. - The directories are absolute paths. - - So, if you wish to download from `http://host/people/bozo/' - following only links to bozo's colleagues in the `/people' - directory and the bogus scripts in `/cgi-bin', you can specify: - - wget -I /people,/cgi-bin http://host/people/bozo/ - -`-X LIST' -`--exclude LIST' -`exclude_directories = LIST' - `-X' option is exactly the reverse of `-I'--this is a list of - directories *excluded* from the download. E.g. if you do not want - Wget to download things from `/cgi-bin' directory, specify `-X - /cgi-bin' on the command line. - - The same as with `-A'/`-R', these two options can be combined to - get a better fine-tuning of downloading subdirectories. E.g. if - you want to load all the files from `/pub' hierarchy except for - `/pub/worthless', specify `-I/pub -X/pub/worthless'. - -`-np' -`--no-parent' -`no_parent = on' - The simplest, and often very useful way of limiting directories is - disallowing retrieval of the links that refer to the hierarchy - "above" than the beginning directory, i.e. disallowing ascent to - the parent directory/directories. - - The `--no-parent' option (short `-np') is useful in this case. - Using it guarantees that you will never leave the existing - hierarchy. Supposing you issue Wget with: - - wget -r --no-parent http://somehost/~luzer/my-archive/ - - You may rest assured that none of the references to - `/~his-girls-homepage/' or `/~luzer/all-my-mpegs/' will be - followed. Only the archive you are interested in will be - downloaded. Essentially, `--no-parent' is similar to - `-I/~luzer/my-archive', only it handles redirections in a more - intelligent fashion. - - -File: wget.info, Node: FTP Links, Prev: Directory-Based Limits, Up: Following Links - -Following FTP Links -=================== - - The rules for FTP are somewhat specific, as it is necessary for them -to be. FTP links in HTML documents are often included for purposes of -reference, and it is often inconvenient to download them by default. - - To have FTP links followed from HTML documents, you need to specify -the `--follow-ftp' option. Having done that, FTP links will span hosts -regardless of `-H' setting. This is logical, as FTP links rarely point -to the same host where the HTTP server resides. For similar reasons, -the `-L' options has no effect on such downloads. On the other hand, -domain acceptance (`-D') and suffix rules (`-A' and `-R') apply -normally. - - Also note that followed links to FTP directories will not be -retrieved recursively further. - - -File: wget.info, Node: Time-Stamping, Next: Startup File, Prev: Following Links, Up: Top - -Time-Stamping -************* - - One of the most important aspects of mirroring information from the -Internet is updating your archives. - - Downloading the whole archive again and again, just to replace a few -changed files is expensive, both in terms of wasted bandwidth and money, -and the time to do the update. This is why all the mirroring tools -offer the option of incremental updating. - - Such an updating mechanism means that the remote server is scanned in -search of "new" files. Only those new files will be downloaded in the -place of the old ones. - - A file is considered new if one of these two conditions are met: - - 1. A file of that name does not already exist locally. - - 2. A file of that name does exist, but the remote file was modified - more recently than the local file. - - To implement this, the program needs to be aware of the time of last -modification of both remote and local files. Such information are -called the "time-stamps". - - The time-stamping in GNU Wget is turned on using `--timestamping' -(`-N') option, or through `timestamping = on' directive in `.wgetrc'. -With this option, for each file it intends to download, Wget will check -whether a local file of the same name exists. If it does, and the -remote file is older, Wget will not download it. - - If the local file does not exist, or the sizes of the files do not -match, Wget will download the remote file no matter what the time-stamps -say. - -* Menu: - -* Time-Stamping Usage:: -* HTTP Time-Stamping Internals:: -* FTP Time-Stamping Internals:: - - -File: wget.info, Node: Time-Stamping Usage, Next: HTTP Time-Stamping Internals, Prev: Time-Stamping, Up: Time-Stamping - -Time-Stamping Usage -=================== - - The usage of time-stamping is simple. Say you would like to -download a file so that it keeps its date of modification. - - wget -S http://www.gnu.ai.mit.edu/ - - A simple `ls -l' shows that the time stamp on the local file equals -the state of the `Last-Modified' header, as returned by the server. As -you can see, the time-stamping info is preserved locally, even without -`-N'. - - Several days later, you would like Wget to check if the remote file -has changed, and download it if it has. - - wget -N http://www.gnu.ai.mit.edu/ - - Wget will ask the server for the last-modified date. If the local -file is newer, the remote file will not be re-fetched. However, if the -remote file is more recent, Wget will proceed fetching it normally. - - The same goes for FTP. For example: - - wget ftp://ftp.ifi.uio.no/pub/emacs/gnus/* - - `ls' will show that the timestamps are set according to the state on -the remote server. Reissuing the command with `-N' will make Wget -re-fetch *only* the files that have been modified. - - In both HTTP and FTP retrieval Wget will time-stamp the local file -correctly (with or without `-N') if it gets the stamps, i.e. gets the -directory listing for FTP or the `Last-Modified' header for HTTP. - - If you wished to mirror the GNU archive every week, you would use the -following command every week: - - wget --timestamping -r ftp://prep.ai.mit.edu/pub/gnu/ - - -File: wget.info, Node: HTTP Time-Stamping Internals, Next: FTP Time-Stamping Internals, Prev: Time-Stamping Usage, Up: Time-Stamping - -HTTP Time-Stamping Internals -============================ - - Time-stamping in HTTP is implemented by checking of the -`Last-Modified' header. If you wish to retrieve the file `foo.html' -through HTTP, Wget will check whether `foo.html' exists locally. If it -doesn't, `foo.html' will be retrieved unconditionally. - - If the file does exist locally, Wget will first check its local -time-stamp (similar to the way `ls -l' checks it), and then send a -`HEAD' request to the remote server, demanding the information on the -remote file. - - The `Last-Modified' header is examined to find which file was -modified more recently (which makes it "newer"). If the remote file is -newer, it will be downloaded; if it is older, Wget will give up.(1) - - When `--backup-converted' (`-K') is specified in conjunction with -`-N', server file `X' is compared to local file `X.orig', if extant, -rather than being compared to local file `X', which will always differ -if it's been converted by `--convert-links' (`-k'). - - Arguably, HTTP time-stamping should be implemented using the -`If-Modified-Since' request. - - ---------- Footnotes ---------- - - (1) As an additional check, Wget will look at the `Content-Length' -header, and compare the sizes; if they are not the same, the remote -file will be downloaded no matter what the time-stamp says. - - -File: wget.info, Node: FTP Time-Stamping Internals, Prev: HTTP Time-Stamping Internals, Up: Time-Stamping - -FTP Time-Stamping Internals -=========================== - - In theory, FTP time-stamping works much the same as HTTP, only FTP -has no headers--time-stamps must be received from the directory -listings. - - For each directory files must be retrieved from, Wget will use the -`LIST' command to get the listing. It will try to analyze the listing, -assuming that it is a Unix `ls -l' listing, and extract the -time-stamps. The rest is exactly the same as for HTTP. - - Assumption that every directory listing is a Unix-style listing may -sound extremely constraining, but in practice it is not, as many -non-Unix FTP servers use the Unixoid listing format because most (all?) -of the clients understand it. Bear in mind that RFC959 defines no -standard way to get a file list, let alone the time-stamps. We can -only hope that a future standard will define this. - - Another non-standard solution includes the use of `MDTM' command -that is supported by some FTP servers (including the popular -`wu-ftpd'), which returns the exact time of the specified file. Wget -may support this command in the future. -