-\1f
-File: wget.info, Node: Directory-Based Limits, Next: FTP Links, Prev: Types of Files, Up: Following Links
-
-Directory-Based Limits
-======================
-
- Regardless of other link-following facilities, it is often useful to
-place the restriction of what files to retrieve based on the directories
-those files are placed in. There can be many reasons for this--the
-home pages may be organized in a reasonable directory structure; or some
-directories may contain useless information, e.g. `/cgi-bin' or `/dev'
-directories.
-
- Wget offers three different options to deal with this requirement.
-Each option description lists a short name, a long name, and the
-equivalent command in `.wgetrc'.
-
-`-I LIST'
-`--include LIST'
-`include_directories = LIST'
- `-I' option accepts a comma-separated list of directories included
- in the retrieval. Any other directories will simply be ignored.
- The directories are absolute paths.
-
- So, if you wish to download from `http://host/people/bozo/'
- following only links to bozo's colleagues in the `/people'
- directory and the bogus scripts in `/cgi-bin', you can specify:
-
- wget -I /people,/cgi-bin http://host/people/bozo/
-
-`-X LIST'
-`--exclude LIST'
-`exclude_directories = LIST'
- `-X' option is exactly the reverse of `-I'--this is a list of
- directories *excluded* from the download. E.g. if you do not want
- Wget to download things from `/cgi-bin' directory, specify `-X
- /cgi-bin' on the command line.
-
- The same as with `-A'/`-R', these two options can be combined to
- get a better fine-tuning of downloading subdirectories. E.g. if
- you want to load all the files from `/pub' hierarchy except for
- `/pub/worthless', specify `-I/pub -X/pub/worthless'.
-
-`-np'
-`--no-parent'
-`no_parent = on'
- The simplest, and often very useful way of limiting directories is
- disallowing retrieval of the links that refer to the hierarchy
- "above" than the beginning directory, i.e. disallowing ascent to
- the parent directory/directories.
-
- The `--no-parent' option (short `-np') is useful in this case.
- Using it guarantees that you will never leave the existing
- hierarchy. Supposing you issue Wget with:
-
- wget -r --no-parent http://somehost/~luzer/my-archive/
-
- You may rest assured that none of the references to
- `/~his-girls-homepage/' or `/~luzer/all-my-mpegs/' will be
- followed. Only the archive you are interested in will be
- downloaded. Essentially, `--no-parent' is similar to
- `-I/~luzer/my-archive', only it handles redirections in a more
- intelligent fashion.
-
-\1f
-File: wget.info, Node: FTP Links, Prev: Directory-Based Limits, Up: Following Links
-
-Following FTP Links
-===================
-
- The rules for FTP are somewhat specific, as it is necessary for them
-to be. FTP links in HTML documents are often included for purposes of
-reference, and it is often inconvenient to download them by default.
-
- To have FTP links followed from HTML documents, you need to specify
-the `--follow-ftp' option. Having done that, FTP links will span hosts
-regardless of `-H' setting. This is logical, as FTP links rarely point
-to the same host where the HTTP server resides. For similar reasons,
-the `-L' options has no effect on such downloads. On the other hand,
-domain acceptance (`-D') and suffix rules (`-A' and `-R') apply
-normally.
-
- Also note that followed links to FTP directories will not be
-retrieved recursively further.
-
-\1f
-File: wget.info, Node: Time-Stamping, Next: Startup File, Prev: Following Links, Up: Top
-
-Time-Stamping
-*************
-
- One of the most important aspects of mirroring information from the
-Internet is updating your archives.
-
- Downloading the whole archive again and again, just to replace a few
-changed files is expensive, both in terms of wasted bandwidth and money,
-and the time to do the update. This is why all the mirroring tools
-offer the option of incremental updating.
-
- Such an updating mechanism means that the remote server is scanned in
-search of "new" files. Only those new files will be downloaded in the
-place of the old ones.
-
- A file is considered new if one of these two conditions are met:
-
- 1. A file of that name does not already exist locally.
-
- 2. A file of that name does exist, but the remote file was modified
- more recently than the local file.
-
- To implement this, the program needs to be aware of the time of last
-modification of both remote and local files. Such information are
-called the "time-stamps".
-
- The time-stamping in GNU Wget is turned on using `--timestamping'
-(`-N') option, or through `timestamping = on' directive in `.wgetrc'.
-With this option, for each file it intends to download, Wget will check
-whether a local file of the same name exists. If it does, and the
-remote file is older, Wget will not download it.
-
- If the local file does not exist, or the sizes of the files do not
-match, Wget will download the remote file no matter what the time-stamps
-say.
-
-* Menu:
-
-* Time-Stamping Usage::
-* HTTP Time-Stamping Internals::
-* FTP Time-Stamping Internals::
-
-\1f
-File: wget.info, Node: Time-Stamping Usage, Next: HTTP Time-Stamping Internals, Prev: Time-Stamping, Up: Time-Stamping
-
-Time-Stamping Usage
-===================
-
- The usage of time-stamping is simple. Say you would like to
-download a file so that it keeps its date of modification.
-
- wget -S http://www.gnu.ai.mit.edu/
-
- A simple `ls -l' shows that the time stamp on the local file equals
-the state of the `Last-Modified' header, as returned by the server. As
-you can see, the time-stamping info is preserved locally, even without
-`-N'.
-
- Several days later, you would like Wget to check if the remote file
-has changed, and download it if it has.
-
- wget -N http://www.gnu.ai.mit.edu/
-
- Wget will ask the server for the last-modified date. If the local
-file is newer, the remote file will not be re-fetched. However, if the
-remote file is more recent, Wget will proceed fetching it normally.
-
- The same goes for FTP. For example:
-
- wget ftp://ftp.ifi.uio.no/pub/emacs/gnus/*
-
- `ls' will show that the timestamps are set according to the state on
-the remote server. Reissuing the command with `-N' will make Wget
-re-fetch *only* the files that have been modified.
-
- In both HTTP and FTP retrieval Wget will time-stamp the local file
-correctly (with or without `-N') if it gets the stamps, i.e. gets the
-directory listing for FTP or the `Last-Modified' header for HTTP.
-
- If you wished to mirror the GNU archive every week, you would use the
-following command every week:
-
- wget --timestamping -r ftp://prep.ai.mit.edu/pub/gnu/
-
-\1f
-File: wget.info, Node: HTTP Time-Stamping Internals, Next: FTP Time-Stamping Internals, Prev: Time-Stamping Usage, Up: Time-Stamping
-
-HTTP Time-Stamping Internals
-============================
-
- Time-stamping in HTTP is implemented by checking of the
-`Last-Modified' header. If you wish to retrieve the file `foo.html'
-through HTTP, Wget will check whether `foo.html' exists locally. If it
-doesn't, `foo.html' will be retrieved unconditionally.
-
- If the file does exist locally, Wget will first check its local
-time-stamp (similar to the way `ls -l' checks it), and then send a
-`HEAD' request to the remote server, demanding the information on the
-remote file.
-
- The `Last-Modified' header is examined to find which file was
-modified more recently (which makes it "newer"). If the remote file is
-newer, it will be downloaded; if it is older, Wget will give up.(1)
-
- When `--backup-converted' (`-K') is specified in conjunction with
-`-N', server file `X' is compared to local file `X.orig', if extant,
-rather than being compared to local file `X', which will always differ
-if it's been converted by `--convert-links' (`-k').
-
- Arguably, HTTP time-stamping should be implemented using the
-`If-Modified-Since' request.
-
- ---------- Footnotes ----------
-
- (1) As an additional check, Wget will look at the `Content-Length'
-header, and compare the sizes; if they are not the same, the remote
-file will be downloaded no matter what the time-stamp says.
-
-\1f
-File: wget.info, Node: FTP Time-Stamping Internals, Prev: HTTP Time-Stamping Internals, Up: Time-Stamping
-
-FTP Time-Stamping Internals
-===========================
-
- In theory, FTP time-stamping works much the same as HTTP, only FTP
-has no headers--time-stamps must be received from the directory
-listings.
-
- For each directory files must be retrieved from, Wget will use the
-`LIST' command to get the listing. It will try to analyze the listing,
-assuming that it is a Unix `ls -l' listing, and extract the
-time-stamps. The rest is exactly the same as for HTTP.
-
- Assumption that every directory listing is a Unix-style listing may
-sound extremely constraining, but in practice it is not, as many
-non-Unix FTP servers use the Unixoid listing format because most (all?)
-of the clients understand it. Bear in mind that RFC959 defines no
-standard way to get a file list, let alone the time-stamps. We can
-only hope that a future standard will define this.
-
- Another non-standard solution includes the use of `MDTM' command
-that is supported by some FTP servers (including the popular
-`wu-ftpd'), which returns the exact time of the specified file. Wget
-may support this command in the future.
-