Published in <sxsherjczw2.fsf@florida.arsdigita.de>.
+2001-11-25 Hrvoje Niksic <hniksic@arsdigita.com>
+
+ * TODO: Ditto.
+
+ * NEWS: Updated with the latest stuff.
+
2001-11-23 Hrvoje Niksic <hniksic@arsdigita.com>
* po/hr.po: A major overhaul.
\f
* Changes in Wget 1.8.
+** "Recursive retrieval" now uses a breadth-first algorithm.
+Recursive downloads are faster and consume *significantly* less memory
+than before.
+
** A new progress indicator is now available. Try it with
--progress=bar or using `progress = bar' in `.wgetrc'.
+** Host directories now contain port information if the URL is at a
+non-standard port.
+
+** Wget now supports the robots.txt directives specified in
+<http://info.webcrawler.com/mak/projects/robots/norobots-rfc.html>.
+
** URL parser has been fixed, especially the infamous overzealous
quoting bug. Wget no longer dequotes reserved characters, e.g. `%3F'
is no longer translated to `?', nor `%2B' to `+'. Unsafe characters
file, though forcibly disconnecting from the server at the desired endpoint
might be workable).
-* RFC 1738 says that if logging on to an FTP server puts you in a directory
- other than '/', the way to specify a file relative to '/' in a URL (let's use
- "/bin/ls" in this example) is "ftp://host/%2Fbin/ls". Wget needs to support
- this (and ideally not consider "ftp://host//bin/ls" to be equivalent, as that
- would equate to the command "CWD " rather than "CWD /"). To accomodate people
- used to broken FTP clients like Internet Explorer and Netscape, if
- "ftp://host/bin/ls" doesn't exist, Wget should try again (perhaps under
- control of an option), acting as if the user had typed "ftp://host/%2Fbin/ls".
-
* If multiple FTP URLs are specified that are on the same host, Wget should
re-use the connection rather than opening a new one for each file.
* Limit the number of successive redirection to max. 20 or so.
-* If -c used on a file that's already completely downloaded, don't re-download
- it (unless normal --timestamping processing would cause you to do so).
-
* If -c used with -N, check to make sure a file hasn't changed on the server
before "continuing" to download it (preventing a bogus hybrid file).
-* Take a look at
- <http://info.webcrawler.com/mak/projects/robots/norobots-rfc.html>
- and support the new directives.
-
* Generalize --html-extension to something like --mime-extensions and have it
look at mime.types/mimecap file for preferred extension. Non-HTML files with
filenames changed this way would be re-downloaded each time despite -N unless
turning it off. Get rid of `--foo=no' stuff. Short options would
be handled as `-x' vs. `-nx'.
-* Implement "thermometer" display (not all that hard; use an
- alternative show_progress() if the output goes to a terminal.)
-
* Add option to only list wildcard matches without doing the download.
* Add case-insensitivity as an option.
* Allow time-stamping by arbitrary date.
-* Fix Unix directory parser to allow for spaces in file names.
-
* Allow size limit to files (perhaps with an option to download oversize files
up through the limit or not at all, to get more functionality than [u]limit.
-* Implement breadth-first retrieval.
-
* Download to .in* when mirroring.
* Add an option to delete or move no-longer-existent files when mirroring.
-* Implement a switch to avoid downloading multiple files (e.g. x and x.gz).
-
* Implement uploading (--upload URL?) in FTP and HTTP.
* Rewrite FTP code to allow for easy addition of new commands. It
* Implement a concept of "packages" a la mirror.
-* Implement correct RFC1808 URL parsing.
-
-* Implement more HTTP/1.1 bells and whistles (ETag, Content-MD5 etc.)
-
-* Add a "rollback" option to have --continue throw away a configurable number of
- bytes at the end of a file before resuming download. Apparently, some stupid
- proxies insert a "transfer interrupted" string we need to get rid of.
+* Add a "rollback" option to have continued retrieval throw away a
+ configurable number of bytes at the end of a file before resuming
+ download. Apparently, some stupid proxies insert a "transfer
+ interrupted" string we need to get rid of.
* When using --accept and --reject, you can end up with empty directories. Have
Wget any such at the end.
+2001-11-25 Hrvoje Niksic <hniksic@arsdigita.com>
+
+ * url.c (reencode_string): Use unsigned char, not char --
+ otherwise the hex digits come out wrong for 8-bit chars such as
+ nbsp.
+ (lowercase_str): New function.
+ (url_parse): Canonicalize u->url if needed.
+ (get_urls_file): Parse each URL, and return only the valid ones.
+ (free_urlpos): Call url_free.
+ (mkstruct): Add :port if the port is non-standard.
+ (mkstruct): Append the query string to the file name, if any.
+ (urlpath_length): Use strpbrk_or_eos.
+ (uri_merge_1): Handle the cases where LINK is an empty string,
+ where LINK consists only of query, and where LINK consists only of
+ fragment.
+ (convert_links): Count and report both kinds of conversion.
+ (downloaded_file): Use a hash table, not a list.
+ (downloaded_files_free): Free the hash table.
+
+ * retr.c (retrieve_from_file): Ditto.
+
+ * main.c (main): Call either retrieve_url or retrieve_tree
+ for each URL, not both.
+
+ * retr.c (register_all_redirections): New function.
+ (register_redirections_mapper): Ditto.
+ (retrieve_url): Register the redirections.
+ (retrieve_url): Make the string "Error parsing proxy ..."
+ translatable.
+
+ * res.c (add_path): Strip leading slash from robots.txt paths so
+ that the path representations are "compatible".
+ (free_specs): Free each individual path, too.
+ (res_cleanup): New function.
+ (cleanup_hash_table_mapper): Ditto.
+
+ * recur.c (url_queue_new): New function.
+ (url_queue_delete): Ditto.
+ (url_enqueue): Ditto.
+ (url_dequeue): Ditto.
+ (retrieve_tree): New function, replacement for recursive_retrieve.
+ (descend_url_p): New function.
+ (register_redirection): New function.
+
+ * progress.c (create_image): Cosmetic changes.
+
+ * init.c (cleanup): Do all those complex cleanups only if
+ DEBUG_MALLOC is defined.
+
+ * main.c: Removed --simple-check and the corresponding
+ simple_host_check in init.c.
+
+ * html-url.c (handle_link): Parse the URL here, and propagate the
+ parsed URL to the caller, who would otherwise have to parse it
+ again.
+
+ * host.c (xstrdup_lower): Moved to utils.c.
+ (realhost): Removed.
+ (same_host): Ditto.
+
+2001-11-24 Hrvoje Niksic <hniksic@arsdigita.com>
+
+ * utils.c (path_simplify): Preserver the (non-)existence of
+ leading slash. Return non-zero if changes were made.
+
2001-11-24 Hrvoje Niksic <hniksic@arsdigita.com>
* progress.c (bar_update): Don't modify bp->total_length if it is
gnu-md5$o: wget.h gnu-md5.h
mswindows$o: wget.h url.h
netrc$o: wget.h utils.h netrc.h init.h
+progress$o: wget.h progress.h utils.h retr.h
rbuf$o: wget.h rbuf.h connect.h
recur$o: wget.h url.h recur.h utils.h retr.h ftp.h fnmatch.h host.h hash.h
+res$o: wget.h utils.h hash.h url.h retr.h res.h
retr$o: wget.h utils.h retr.h url.h recur.h ftp.h host.h connect.h hash.h
snprintf$o:
safe-ctype$o: safe-ctype.h
#endif
/* Mapping between all known hosts to their addresses (n.n.n.n). */
+
+/* #### We should map to *lists* of IP addresses. */
+
struct hash_table *host_name_address_map;
+/* The following two tables are obsolete, since we no longer do host
+ canonicalization. */
+
/* Mapping between all known addresses (n.n.n.n) to their hosts. This
is the inverse of host_name_address_map. These two tables share
the strdup'ed strings. */
/* Mapping between auxilliary (slave) and master host names. */
struct hash_table *host_slave_master_map;
-/* Utility function: like xstrdup(), but also lowercases S. */
-
-static char *
-xstrdup_lower (const char *s)
-{
- char *copy = xstrdup (s);
- char *p = copy;
- for (; *p; p++)
- *p = TOLOWER (*p);
- return copy;
-}
-
/* The same as gethostbyname, but supports internet addresses of the
form `N.N.N.N'. On some systems gethostbyname() knows how to do
this automatically. */
return 1;
}
-/* Determine the "real" name of HOST, as perceived by Wget. If HOST
- is referenced by more than one name, "real" name is considered to
- be the first one encountered in the past. */
-char *
-realhost (const char *host)
-{
- struct in_addr in;
- struct hostent *hptr;
- char *master_name;
-
- DEBUGP (("Checking for %s in host_name_address_map.\n", host));
- if (hash_table_contains (host_name_address_map, host))
- {
- DEBUGP (("Found; %s was already used, by that name.\n", host));
- return xstrdup_lower (host);
- }
-
- DEBUGP (("Checking for %s in host_slave_master_map.\n", host));
- master_name = hash_table_get (host_slave_master_map, host);
- if (master_name)
- {
- has_master:
- DEBUGP (("Found; %s was already used, by the name %s.\n",
- host, master_name));
- return xstrdup (master_name);
- }
-
- DEBUGP (("First time I hear about %s by that name; looking it up.\n",
- host));
- hptr = ngethostbyname (host);
- if (hptr)
- {
- char *inet_s;
- /* Originally, we copied to in.s_addr, but it appears to be
- missing on some systems. */
- memcpy (&in, *hptr->h_addr_list, sizeof (in));
- inet_s = inet_ntoa (in);
-
- add_host_to_cache (host, inet_s);
-
- /* add_host_to_cache() can establish a slave-master mapping. */
- DEBUGP (("Checking again for %s in host_slave_master_map.\n", host));
- master_name = hash_table_get (host_slave_master_map, host);
- if (master_name)
- goto has_master;
- }
-
- return xstrdup_lower (host);
-}
-
-/* Compare two hostnames (out of URL-s if the arguments are URL-s),
- taking care of aliases. It uses realhost() to determine a unique
- hostname for each of two hosts. If simple_check is non-zero, only
- strcmp() is used for comparison. */
-int
-same_host (const char *u1, const char *u2)
-{
- const char *s;
- char *p1, *p2;
- char *real1, *real2;
-
- /* Skip protocol, if present. */
- u1 += url_skip_scheme (u1);
- u2 += url_skip_scheme (u2);
-
- /* Skip username ans password, if present. */
- u1 += url_skip_uname (u1);
- u2 += url_skip_uname (u2);
-
- for (s = u1; *u1 && *u1 != '/' && *u1 != ':'; u1++);
- p1 = strdupdelim (s, u1);
- for (s = u2; *u2 && *u2 != '/' && *u2 != ':'; u2++);
- p2 = strdupdelim (s, u2);
- DEBUGP (("Comparing hosts %s and %s...\n", p1, p2));
- if (strcasecmp (p1, p2) == 0)
- {
- xfree (p1);
- xfree (p2);
- DEBUGP (("They are quite alike.\n"));
- return 1;
- }
- else if (opt.simple_check)
- {
- xfree (p1);
- xfree (p2);
- DEBUGP (("Since checking is simple, I'd say they are not the same.\n"));
- return 0;
- }
- real1 = realhost (p1);
- real2 = realhost (p2);
- xfree (p1);
- xfree (p2);
- if (strcasecmp (real1, real2) == 0)
- {
- DEBUGP (("They are alike, after realhost()->%s.\n", real1));
- xfree (real1);
- xfree (real2);
- return 1;
- }
- else
- {
- DEBUGP (("They are not the same (%s, %s).\n", real1, real2));
- xfree (real1);
- xfree (real2);
- return 0;
- }
-}
-
/* Determine whether a URL is acceptable to be followed, according to
a list of domains to accept. */
int
}
void
-clean_hosts (void)
+host_cleanup (void)
{
/* host_name_address_map and host_address_name_map share the
strings. Because of that, calling free_keys_and_values once
struct hostent *ngethostbyname PARAMS ((const char *));
int store_hostaddress PARAMS ((unsigned char *, const char *));
-void clean_hosts PARAMS ((void));
+void host_cleanup PARAMS ((void));
-char *realhost PARAMS ((const char *));
-int same_host PARAMS ((const char *, const char *));
int accept_domain PARAMS ((struct url *));
int sufmatch PARAMS ((const char **, const char *));
-char *ftp_getaddress PARAMS ((void));
-
char *herrmsg PARAMS ((int));
#endif /* HOST_H */
char *text; /* HTML text. */
char *base; /* Base URI of the document, possibly
changed through <base href=...>. */
- urlpos *head, *tail; /* List of URLs */
+ struct urlpos *head, *tail; /* List of URLs */
const char *parent_base; /* Base of the current document. */
const char *document_file; /* File name of this document. */
int dash_p_leaf_HTML; /* Whether -p is specified, and this
handle_link (struct collect_urls_closure *closure, const char *link_uri,
struct taginfo *tag, int attrid)
{
- int no_scheme = !url_has_scheme (link_uri);
- urlpos *newel;
-
+ int link_has_scheme = url_has_scheme (link_uri);
+ struct urlpos *newel;
const char *base = closure->base ? closure->base : closure->parent_base;
- char *complete_uri;
-
- char *fragment = strrchr (link_uri, '#');
-
- if (fragment)
- {
- /* Nullify the fragment identifier, i.e. everything after the
- last occurrence of `#', inclusive. This copying is
- relatively inefficient, but it doesn't matter because
- fragment identifiers don't come up all that often. */
- int hashlen = fragment - link_uri;
- char *p = alloca (hashlen + 1);
- memcpy (p, link_uri, hashlen);
- p[hashlen] = '\0';
- link_uri = p;
- }
+ struct url *url;
if (!base)
{
- if (no_scheme)
+ DEBUGP (("%s: no base, merge will use \"%s\".\n",
+ closure->document_file, link_uri));
+
+ if (!link_has_scheme)
{
/* We have no base, and the link does not have a host
attached to it. Nothing we can do. */
/* #### Should we print a warning here? Wget 1.5.x used to. */
return;
}
- else
- complete_uri = xstrdup (link_uri);
+
+ url = url_parse (link_uri, NULL);
+ if (!url)
+ {
+ DEBUGP (("%s: link \"%s\" doesn't parse.\n",
+ closure->document_file, link_uri));
+ return;
+ }
}
else
- complete_uri = uri_merge (base, link_uri);
+ {
+ /* Merge BASE with LINK_URI, but also make sure the result is
+ canonicalized, i.e. that "../" have been resolved.
+ (parse_url will do that for us.) */
+
+ char *complete_uri = uri_merge (base, link_uri);
+
+ DEBUGP (("%s: merge(\"%s\", \"%s\") -> %s\n",
+ closure->document_file, base, link_uri, complete_uri));
- DEBUGP (("%s: merge(\"%s\", \"%s\") -> %s\n",
- closure->document_file, base ? base : "(null)",
- link_uri, complete_uri));
+ url = url_parse (complete_uri, NULL);
+ if (!url)
+ {
+ DEBUGP (("%s: merged link \"%s\" doesn't parse.\n",
+ closure->document_file, complete_uri));
+ xfree (complete_uri);
+ return;
+ }
+ xfree (complete_uri);
+ }
- newel = (urlpos *)xmalloc (sizeof (urlpos));
+ newel = (struct urlpos *)xmalloc (sizeof (struct urlpos));
memset (newel, 0, sizeof (*newel));
newel->next = NULL;
- newel->url = complete_uri;
+ newel->url = url;
newel->pos = tag->attrs[attrid].value_raw_beginning - closure->text;
newel->size = tag->attrs[attrid].value_raw_size;
/* A URL is relative if the host is not named, and the name does not
start with `/'. */
- if (no_scheme && *link_uri != '/')
+ if (!link_has_scheme && *link_uri != '/')
newel->link_relative_p = 1;
- else if (!no_scheme)
+ else if (link_has_scheme)
newel->link_complete_p = 1;
if (closure->tail)
If dash_p_leaf_HTML is non-zero, only the elements needed to render
FILE ("non-external" links) will be returned. */
-urlpos *
+struct urlpos *
get_urls_html (const char *file, const char *this_url, int dash_p_leaf_HTML,
int *meta_disallow_follow)
{
if (((suf = suffix (*hstat.local_file)) != NULL)
&& (!strcmp (suf, "html") || !strcmp (suf, "htm")))
*dt |= TEXTHTML;
- xfree (suf);
+ FREE_MAYBE (suf);
FREE_MAYBE (dummy);
return RETROK;
}
{ "savecookies", &opt.cookies_output, cmd_file },
{ "saveheaders", &opt.save_headers, cmd_boolean },
{ "serverresponse", &opt.server_response, cmd_boolean },
- { "simplehostcheck", &opt.simple_check, cmd_boolean },
{ "spanhosts", &opt.spanhost, cmd_boolean },
{ "spider", &opt.spider, cmd_boolean },
#ifdef HAVE_SSL
}
\f
void cleanup_html_url PARAMS ((void));
+void res_cleanup PARAMS ((void));
void downloaded_files_free PARAMS ((void));
void
cleanup (void)
{
- extern acc_t *netrc_list;
+ /* Free external resources, close files, etc. */
- recursive_cleanup ();
- clean_hosts ();
- free_netrc (netrc_list);
if (opt.dfp)
fclose (opt.dfp);
+
+ /* We're exiting anyway so there's no real need to call free()
+ hundreds of times. Skipping the frees will make Wget exit
+ faster.
+
+ However, when detecting leaks, it's crucial to free() everything
+ because then you can find the real leaks, i.e. the allocated
+ memory which grows with the size of the program. */
+
+#ifdef DEBUG_MALLOC
+ recursive_cleanup ();
+ res_cleanup ();
+ host_cleanup ();
+ {
+ extern acc_t *netrc_list;
+ free_netrc (netrc_list);
+ }
cleanup_html_url ();
downloaded_files_free ();
cookies_cleanup ();
free_vec (opt.domains);
free_vec (opt.follow_tags);
free_vec (opt.ignore_tags);
+ FREE_MAYBE (opt.progress_type);
xfree (opt.ftp_acc);
FREE_MAYBE (opt.ftp_pass);
FREE_MAYBE (opt.ftp_proxy);
FREE_MAYBE (opt.bind_address);
FREE_MAYBE (opt.cookies_input);
FREE_MAYBE (opt.cookies_output);
+#endif
}
case 149:
setval ("removelisting", "off");
break;
- case 150:
- setval ("simplehostcheck", "on");
- break;
case 155:
setval ("bindaddress", optarg);
break;
break;
case 'n':
{
- /* #### The n? options are utter crock! */
+ /* #### What we really want here is --no-foo. */
char *p;
for (p = optarg; *p; p++)
case 'v':
setval ("verbose", "off");
break;
- case 'h':
- setval ("simplehostcheck", "on");
- break;
case 'H':
setval ("addhostdir", "off");
break;
#endif /* HAVE_SIGNAL */
status = RETROK; /* initialize it, just-in-case */
- recursive_reset ();
+ /*recursive_reset ();*/
/* Retrieve the URLs from argument list. */
for (t = url; *t; t++)
{
- char *filename, *redirected_URL;
+ char *filename = NULL, *redirected_URL = NULL;
int dt;
- status = retrieve_url (*t, &filename, &redirected_URL, NULL, &dt);
- if (opt.recursive && status == RETROK && (dt & TEXTHTML))
- status = recursive_retrieve (filename,
- redirected_URL ? redirected_URL : *t);
+ if (opt.recursive && url_scheme (*t) != SCHEME_FTP)
+ status = retrieve_tree (*t);
+ else
+ status = retrieve_url (*t, &filename, &redirected_URL, NULL, &dt);
if (opt.delete_after && file_exists_p(filename))
{
int relative_only; /* Follow only relative links. */
int no_parent; /* Restrict access to the parent
directory. */
- int simple_check; /* Should we use simple checking
- (strcmp) or do we create a host
- hash and call gethostbyname? */
int reclevel; /* Maximum level of recursion */
int dirstruct; /* Do we build the directory structure
as we go along? */
# include <strings.h>
#endif /* HAVE_STRING_H */
#include <assert.h>
+#ifdef HAVE_UNISTD_H
+# include <unistd.h>
+#endif
#include "wget.h"
#include "progress.h"
Calculate its geometry:
"xxx% " - percentage - 5 chars
- "| ... | " - progress bar decorations - 3 chars
+ "| ... |" - progress bar decorations - 2 chars
"1012.56 K/s " - dl rate - 12 chars
"nnnn " - downloaded bytes - 11 chars
"ETA: xx:xx:xx" - ETA - 13 chars
"=====>..." - progress bar content - the rest
*/
- int progress_len = screen_width - (5 + 3 + 12 + 11 + 13);
+ int progress_len = screen_width - (5 + 2 + 12 + 11 + 13);
if (progress_len < 7)
progress_len = 0;
}
else
{
- strcpy (p, "----.-- K/s ");
+ strcpy (p, " --.-- K/s ");
p += 12;
}
/* Handling of recursive HTTP retrieving.
- Copyright (C) 1995, 1996, 1997, 2000 Free Software Foundation, Inc.
+ Copyright (C) 1995, 1996, 1997, 2000, 2001 Free Software Foundation, Inc.
This file is part of GNU Wget.
static struct hash_table *dl_url_file_map;
/* List of HTML files downloaded in this Wget run. Used for link
- conversion after Wget is done. */
+ conversion after Wget is done. This list should only be traversed
+ in order. If you need to check whether a file has been downloaded,
+ use a hash table, e.g. dl_file_url_map. */
static slist *downloaded_html_files;
+\f
+/* Functions for maintaining the URL queue. */
-/* List of undesirable-to-load URLs. */
-static struct hash_table *undesirable_urls;
+struct queue_element {
+ const char *url;
+ const char *referer;
+ int depth;
+ struct queue_element *next;
+};
-/* Current recursion depth. */
-static int depth;
+struct url_queue {
+ struct queue_element *head;
+ struct queue_element *tail;
+ int count, maxcount;
+};
-/* Base directory we're recursing from (used by no_parent). */
-static char *base_dir;
+/* Create a URL queue. */
-static int first_time = 1;
+static struct url_queue *
+url_queue_new (void)
+{
+ struct url_queue *queue = xmalloc (sizeof (*queue));
+ memset (queue, '\0', sizeof (*queue));
+ return queue;
+}
+/* Delete a URL queue. */
-/* Cleanup the data structures associated with recursive retrieving
- (the variables above). */
-void
-recursive_cleanup (void)
+static void
+url_queue_delete (struct url_queue *queue)
{
- if (undesirable_urls)
- {
- string_set_free (undesirable_urls);
- undesirable_urls = NULL;
- }
- if (dl_file_url_map)
- {
- free_keys_and_values (dl_file_url_map);
- hash_table_destroy (dl_file_url_map);
- dl_file_url_map = NULL;
- }
- if (dl_url_file_map)
- {
- free_keys_and_values (dl_url_file_map);
- hash_table_destroy (dl_url_file_map);
- dl_url_file_map = NULL;
- }
- undesirable_urls = NULL;
- slist_free (downloaded_html_files);
- downloaded_html_files = NULL;
- FREE_MAYBE (base_dir);
- first_time = 1;
+ xfree (queue);
}
-/* Reset FIRST_TIME to 1, so that some action can be taken in
- recursive_retrieve(). */
-void
-recursive_reset (void)
+/* Enqueue a URL in the queue. The queue is FIFO: the items will be
+ retrieved ("dequeued") from the queue in the order they were placed
+ into it. */
+
+static void
+url_enqueue (struct url_queue *queue,
+ const char *url, const char *referer, int depth)
{
- first_time = 1;
+ struct queue_element *qel = xmalloc (sizeof (*qel));
+ qel->url = url;
+ qel->referer = referer;
+ qel->depth = depth;
+ qel->next = NULL;
+
+ ++queue->count;
+ if (queue->count > queue->maxcount)
+ queue->maxcount = queue->count;
+
+ DEBUGP (("Enqueuing %s at depth %d\n", url, depth));
+ DEBUGP (("Queue count %d, maxcount %d.\n", queue->count, queue->maxcount));
+
+ if (queue->tail)
+ queue->tail->next = qel;
+ queue->tail = qel;
+
+ if (!queue->head)
+ queue->head = queue->tail;
}
-/* The core of recursive retrieving. Endless recursion is avoided by
- having all URLs stored to a linked list of URLs, which is checked
- before loading any URL. That way no URL can get loaded twice.
+/* Take a URL out of the queue. Return 1 if this operation succeeded,
+ or 0 if the queue is empty. */
+
+static int
+url_dequeue (struct url_queue *queue,
+ const char **url, const char **referer, int *depth)
+{
+ struct queue_element *qel = queue->head;
+
+ if (!qel)
+ return 0;
+
+ queue->head = queue->head->next;
+ if (!queue->head)
+ queue->tail = NULL;
+
+ *url = qel->url;
+ *referer = qel->referer;
+ *depth = qel->depth;
+
+ --queue->count;
+
+ DEBUGP (("Dequeuing %s at depth %d\n", qel->url, qel->depth));
+ DEBUGP (("Queue count %d, maxcount %d.\n", queue->count, queue->maxcount));
+
+ xfree (qel);
+ return 1;
+}
+
+static int descend_url_p PARAMS ((const struct urlpos *, struct url *, int,
+ struct url *, struct hash_table *));
+
+/* Retrieve a part of the web beginning with START_URL. This used to
+ be called "recursive retrieval", because the old function was
+ recursive and implemented depth-first search. retrieve_tree on the
+ other hand implements breadth-search traversal of the tree, which
+ results in much nicer ordering of downloads.
+
+ The algorithm this function uses is simple:
+
+ 1. put START_URL in the queue.
+ 2. while there are URLs in the queue:
+
+ 3. get next URL from the queue.
+ 4. download it.
+ 5. if the URL is HTML and its depth does not exceed maximum depth,
+ get the list of URLs embedded therein.
+ 6. for each of those URLs do the following:
+
+ 7. if the URL is not one of those downloaded before, and if it
+ satisfies the criteria specified by the various command-line
+ options, add it to the queue. */
- The function also supports specification of maximum recursion depth
- and a number of other goodies. */
uerr_t
-recursive_retrieve (const char *file, const char *this_url)
+retrieve_tree (const char *start_url)
{
- char *constr, *filename, *newloc;
- char *canon_this_url = NULL;
- int dt, inl, dash_p_leaf_HTML = FALSE;
- int meta_disallow_follow;
- int this_url_ftp; /* See below the explanation */
- urlpos *url_list, *cur_url;
- struct url *u;
-
- assert (this_url != NULL);
- assert (file != NULL);
- /* If quota was exceeded earlier, bail out. */
- if (downloaded_exceeds_quota ())
- return QUOTEXC;
- /* Cache the current URL in the list. */
- if (first_time)
- {
- /* These three operations need to be done only once per Wget
- run. They should probably be at a different location. */
- if (!undesirable_urls)
- undesirable_urls = make_string_hash_table (0);
-
- hash_table_clear (undesirable_urls);
- string_set_add (undesirable_urls, this_url);
- /* Enter this_url to the hash table, in original and "enhanced" form. */
- u = url_parse (this_url, NULL);
- if (u)
- {
- string_set_add (undesirable_urls, u->url);
- if (opt.no_parent)
- base_dir = xstrdup (u->dir); /* Set the base dir. */
- /* Set the canonical this_url to be sent as referer. This
- problem exists only when running the first time. */
- canon_this_url = xstrdup (u->url);
- }
- else
- {
- DEBUGP (("Double yuck! The *base* URL is broken.\n"));
- base_dir = NULL;
- }
- url_free (u);
- depth = 1;
- first_time = 0;
- }
- else
- ++depth;
+ uerr_t status = RETROK;
- if (opt.reclevel != INFINITE_RECURSION && depth > opt.reclevel)
- /* We've exceeded the maximum recursion depth specified by the user. */
- {
- if (opt.page_requisites && depth <= opt.reclevel + 1)
- /* When -p is specified, we can do one more partial recursion from the
- "leaf nodes" on the HTML document tree. The recursion is partial in
- that we won't traverse any <A> or <AREA> tags, nor any <LINK> tags
- except for <LINK REL="stylesheet">. */
- dash_p_leaf_HTML = TRUE;
- else
- /* Either -p wasn't specified or it was and we've already gone the one
- extra (pseudo-)level that it affords us, so we need to bail out. */
- {
- DEBUGP (("Recursion depth %d exceeded max. depth %d.\n",
- depth, opt.reclevel));
- --depth;
- return RECLEVELEXC;
- }
- }
+ /* The queue of URLs we need to load. */
+ struct url_queue *queue = url_queue_new ();
- /* Determine whether this_url is an FTP URL. If it is, it means
- that the retrieval is done through proxy. In that case, FTP
- links will be followed by default and recursion will not be
- turned off when following them. */
- this_url_ftp = (url_scheme (this_url) == SCHEME_FTP);
+ /* The URLs we decided we don't want to load. */
+ struct hash_table *blacklist = make_string_hash_table (0);
- /* Get the URL-s from an HTML file: */
- url_list = get_urls_html (file, canon_this_url ? canon_this_url : this_url,
- dash_p_leaf_HTML, &meta_disallow_follow);
+ /* We'll need various components of this, so better get it over with
+ now. */
+ struct url *start_url_parsed = url_parse (start_url, NULL);
- if (opt.use_robots && meta_disallow_follow)
- {
- /* The META tag says we are not to follow this file. Respect
- that. */
- free_urlpos (url_list);
- url_list = NULL;
- }
+ url_enqueue (queue, xstrdup (start_url), NULL, 0);
+ string_set_add (blacklist, start_url);
- /* Decide what to do with each of the URLs. A URL will be loaded if
- it meets several requirements, discussed later. */
- for (cur_url = url_list; cur_url; cur_url = cur_url->next)
+ while (1)
{
- /* If quota was exceeded earlier, bail out. */
+ int descend = 0;
+ char *url, *referer, *file = NULL;
+ int depth;
+ boolean dash_p_leaf_HTML = FALSE;
+
if (downloaded_exceeds_quota ())
break;
- /* Parse the URL for convenient use in other functions, as well
- as to get the optimized form. It also checks URL integrity. */
- u = url_parse (cur_url->url, NULL);
- if (!u)
- {
- DEBUGP (("Yuck! A bad URL.\n"));
- continue;
- }
- assert (u->url != NULL);
- constr = xstrdup (u->url);
-
- /* Several checkings whether a file is acceptable to load:
- 1. check if URL is ftp, and we don't load it
- 2. check for relative links (if relative_only is set)
- 3. check for domain
- 4. check for no-parent
- 5. check for excludes && includes
- 6. check for suffix
- 7. check for same host (if spanhost is unset), with possible
- gethostbyname baggage
- 8. check for robots.txt
-
- Addendum: If the URL is FTP, and it is to be loaded, only the
- domain and suffix settings are "stronger".
-
- Note that .html and (yuck) .htm will get loaded regardless of
- suffix rules (but that is remedied later with unlink) unless
- the depth equals the maximum depth.
-
- More time- and memory- consuming tests should be put later on
- the list. */
-
- /* inl is set if the URL we are working on (constr) is stored in
- undesirable_urls. Using it is crucial to avoid unnecessary
- repeated continuous hits to the hash table. */
- inl = string_set_contains (undesirable_urls, constr);
-
- /* If it is FTP, and FTP is not followed, chuck it out. */
- if (!inl)
- if (u->scheme == SCHEME_FTP && !opt.follow_ftp && !this_url_ftp)
- {
- DEBUGP (("Uh, it is FTP but i'm not in the mood to follow FTP.\n"));
- string_set_add (undesirable_urls, constr);
- inl = 1;
- }
- /* If it is absolute link and they are not followed, chuck it
- out. */
- if (!inl && u->scheme != SCHEME_FTP)
- if (opt.relative_only && !cur_url->link_relative_p)
- {
- DEBUGP (("It doesn't really look like a relative link.\n"));
- string_set_add (undesirable_urls, constr);
- inl = 1;
- }
- /* If its domain is not to be accepted/looked-up, chuck it out. */
- if (!inl)
- if (!accept_domain (u))
+
+ if (status == FWRITEERR)
+ break;
+
+ /* Get the next URL from the queue. */
+
+ if (!url_dequeue (queue,
+ (const char **)&url, (const char **)&referer,
+ &depth))
+ break;
+
+ /* And download it. */
+
+ {
+ int dt = 0;
+ char *redirected = NULL;
+ int oldrec = opt.recursive;
+
+ opt.recursive = 0;
+ status = retrieve_url (url, &file, &redirected, NULL, &dt);
+ opt.recursive = oldrec;
+
+ if (redirected)
{
- DEBUGP (("I don't like the smell of that domain.\n"));
- string_set_add (undesirable_urls, constr);
- inl = 1;
+ xfree (url);
+ url = redirected;
}
- /* Check for parent directory. */
- if (!inl && opt.no_parent
- /* If the new URL is FTP and the old was not, ignore
- opt.no_parent. */
- && !(!this_url_ftp && u->scheme == SCHEME_FTP))
+ if (file && status == RETROK
+ && (dt & RETROKF) && (dt & TEXTHTML))
+ descend = 1;
+ }
+
+ if (descend
+ && depth >= opt.reclevel && opt.reclevel != INFINITE_RECURSION)
{
- /* Check for base_dir first. */
- if (!(base_dir && frontcmp (base_dir, u->dir)))
+ if (opt.page_requisites && depth == opt.reclevel)
+ /* When -p is specified, we can do one more partial
+ recursion from the "leaf nodes" on the HTML document
+ tree. The recursion is partial in that we won't
+ traverse any <A> or <AREA> tags, nor any <LINK> tags
+ except for <LINK REL="stylesheet">. */
+ /* #### This would be the place to implement the TODO
+ entry saying that -p should do two more hops on
+ framesets. */
+ dash_p_leaf_HTML = TRUE;
+ else
{
- /* Failing that, check for parent dir. */
- struct url *ut = url_parse (this_url, NULL);
- if (!ut)
- DEBUGP (("Double yuck! The *base* URL is broken.\n"));
- else if (!frontcmp (ut->dir, u->dir))
- {
- /* Failing that too, kill the URL. */
- DEBUGP (("Trying to escape parental guidance with no_parent on.\n"));
- string_set_add (undesirable_urls, constr);
- inl = 1;
- }
- url_free (ut);
+ /* Either -p wasn't specified or it was and we've
+ already gone the one extra (pseudo-)level that it
+ affords us, so we need to bail out. */
+ DEBUGP (("Not descending further; at depth %d, max. %d.\n",
+ depth, opt.reclevel));
+ descend = 0;
}
}
- /* If the file does not match the acceptance list, or is on the
- rejection list, chuck it out. The same goes for the
- directory exclude- and include- lists. */
- if (!inl && (opt.includes || opt.excludes))
+
+ /* If the downloaded document was HTML, parse it and enqueue the
+ links it contains. */
+
+ if (descend)
{
- if (!accdir (u->dir, ALLABS))
+ int meta_disallow_follow = 0;
+ struct urlpos *children = get_urls_html (file, url, dash_p_leaf_HTML,
+ &meta_disallow_follow);
+
+ if (opt.use_robots && meta_disallow_follow)
{
- DEBUGP (("%s (%s) is excluded/not-included.\n", constr, u->dir));
- string_set_add (undesirable_urls, constr);
- inl = 1;
+ free_urlpos (children);
+ children = NULL;
}
- }
- if (!inl)
- {
- char *suf = NULL;
- /* We check for acceptance/rejection rules only for non-HTML
- documents. Since we don't know whether they really are
- HTML, it will be deduced from (an OR-ed list):
-
- 1) u->file is "" (meaning it is a directory)
- 2) suffix exists, AND:
- a) it is "html", OR
- b) it is "htm"
-
- If the file *is* supposed to be HTML, it will *not* be
- subject to acc/rej rules, unless a finite maximum depth has
- been specified and the current depth is the maximum depth. */
- if (!
- (!*u->file
- || (((suf = suffix (constr)) != NULL)
- && ((!strcmp (suf, "html") || !strcmp (suf, "htm"))
- && ((opt.reclevel != INFINITE_RECURSION) &&
- (depth != opt.reclevel))))))
+
+ if (children)
{
- if (!acceptable (u->file))
+ struct urlpos *child = children;
+ struct url *url_parsed = url_parsed = url_parse (url, NULL);
+ assert (url_parsed != NULL);
+
+ for (; child; child = child->next)
{
- DEBUGP (("%s (%s) does not match acc/rej rules.\n",
- constr, u->file));
- string_set_add (undesirable_urls, constr);
- inl = 1;
+ if (descend_url_p (child, url_parsed, depth, start_url_parsed,
+ blacklist))
+ {
+ url_enqueue (queue, xstrdup (child->url->url),
+ xstrdup (url), depth + 1);
+ /* We blacklist the URL we have enqueued, because we
+ don't want to enqueue (and hence download) the
+ same URL twice. */
+ string_set_add (blacklist, child->url->url);
+ }
}
+
+ url_free (url_parsed);
+ free_urlpos (children);
}
- FREE_MAYBE (suf);
}
- /* Optimize the URL (which includes possible DNS lookup) only
- after all other possibilities have been exhausted. */
- if (!inl)
+
+ if (opt.delete_after || (file && !acceptable (file)))
{
- if (!opt.simple_check)
- {
- /* Find the "true" host. */
- char *host = realhost (u->host);
- xfree (u->host);
- u->host = host;
-
- /* Refresh the printed representation of the URL. */
- xfree (u->url);
- u->url = url_string (u, 0);
- }
- else
- {
- char *p;
- /* Just lowercase the hostname. */
- for (p = u->host; *p; p++)
- *p = TOLOWER (*p);
- xfree (u->url);
- u->url = url_string (u, 0);
- }
- xfree (constr);
- constr = xstrdup (u->url);
- /* After we have canonicalized the URL, check if we have it
- on the black list. */
- if (string_set_contains (undesirable_urls, constr))
- inl = 1;
- /* This line is bogus. */
- /*string_set_add (undesirable_urls, constr);*/
-
- if (!inl && !((u->scheme == SCHEME_FTP) && !this_url_ftp))
- if (!opt.spanhost && this_url && !same_host (this_url, constr))
- {
- DEBUGP (("This is not the same hostname as the parent's.\n"));
- string_set_add (undesirable_urls, constr);
- inl = 1;
- }
+ /* Either --delete-after was specified, or we loaded this
+ otherwise rejected (e.g. by -R) HTML file just so we
+ could harvest its hyperlinks -- in either case, delete
+ the local file. */
+ DEBUGP (("Removing file due to %s in recursive_retrieve():\n",
+ opt.delete_after ? "--delete-after" :
+ "recursive rejection criteria"));
+ logprintf (LOG_VERBOSE,
+ (opt.delete_after ? _("Removing %s.\n")
+ : _("Removing %s since it should be rejected.\n")),
+ file);
+ if (unlink (file))
+ logprintf (LOG_NOTQUIET, "unlink: %s\n", strerror (errno));
}
- /* What about robots.txt? */
- if (!inl && opt.use_robots && u->scheme == SCHEME_HTTP)
+
+ xfree (url);
+ FREE_MAYBE (referer);
+ FREE_MAYBE (file);
+ }
+
+ /* If anything is left of the queue due to a premature exit, free it
+ now. */
+ {
+ char *d1, *d2;
+ int d3;
+ while (url_dequeue (queue, (const char **)&d1, (const char **)&d2, &d3))
+ {
+ xfree (d1);
+ FREE_MAYBE (d2);
+ }
+ }
+ url_queue_delete (queue);
+
+ if (start_url_parsed)
+ url_free (start_url_parsed);
+ string_set_free (blacklist);
+
+ if (downloaded_exceeds_quota ())
+ return QUOTEXC;
+ else if (status == FWRITEERR)
+ return FWRITEERR;
+ else
+ return RETROK;
+}
+
+/* Based on the context provided by retrieve_tree, decide whether a
+ URL is to be descended to. This is only ever called from
+ retrieve_tree, but is in a separate function for clarity. */
+
+static int
+descend_url_p (const struct urlpos *upos, struct url *parent, int depth,
+ struct url *start_url_parsed, struct hash_table *blacklist)
+{
+ struct url *u = upos->url;
+ const char *url = u->url;
+
+ DEBUGP (("Deciding whether to enqueue \"%s\".\n", url));
+
+ if (string_set_contains (blacklist, url))
+ {
+ DEBUGP (("Already on the black list.\n"));
+ goto out;
+ }
+
+ /* Several things to check for:
+ 1. if scheme is not http, and we don't load it
+ 2. check for relative links (if relative_only is set)
+ 3. check for domain
+ 4. check for no-parent
+ 5. check for excludes && includes
+ 6. check for suffix
+ 7. check for same host (if spanhost is unset), with possible
+ gethostbyname baggage
+ 8. check for robots.txt
+
+ Addendum: If the URL is FTP, and it is to be loaded, only the
+ domain and suffix settings are "stronger".
+
+ Note that .html files will get loaded regardless of suffix rules
+ (but that is remedied later with unlink) unless the depth equals
+ the maximum depth.
+
+ More time- and memory- consuming tests should be put later on
+ the list. */
+
+ /* 1. Schemes other than HTTP are normally not recursed into. */
+ if (u->scheme != SCHEME_HTTP
+ && !(u->scheme == SCHEME_FTP && opt.follow_ftp))
+ {
+ DEBUGP (("Not following non-HTTP schemes.\n"));
+ goto blacklist;
+ }
+
+ /* 2. If it is an absolute link and they are not followed, throw it
+ out. */
+ if (u->scheme == SCHEME_HTTP)
+ if (opt.relative_only && !upos->link_relative_p)
+ {
+ DEBUGP (("It doesn't really look like a relative link.\n"));
+ goto blacklist;
+ }
+
+ /* 3. If its domain is not to be accepted/looked-up, chuck it
+ out. */
+ if (!accept_domain (u))
+ {
+ DEBUGP (("The domain was not accepted.\n"));
+ goto blacklist;
+ }
+
+ /* 4. Check for parent directory.
+
+ If we descended to a different host or changed the scheme, ignore
+ opt.no_parent. Also ignore it for -p leaf retrievals. */
+ if (opt.no_parent
+ && u->scheme == parent->scheme
+ && 0 == strcasecmp (u->host, parent->host)
+ && u->port == parent->port)
+ {
+ if (!frontcmp (parent->dir, u->dir))
{
- struct robot_specs *specs = res_get_specs (u->host, u->port);
- if (!specs)
- {
- char *rfile;
- if (res_retrieve_file (constr, &rfile))
- {
- specs = res_parse_from_file (rfile);
- xfree (rfile);
- }
- else
- {
- /* If we cannot get real specs, at least produce
- dummy ones so that we can register them and stop
- trying to retrieve them. */
- specs = res_parse ("", 0);
- }
- res_register_specs (u->host, u->port, specs);
- }
+ DEBUGP (("Trying to escape the root directory with no_parent in effect.\n"));
+ goto blacklist;
+ }
+ }
- /* Now that we have (or don't have) robots.txt specs, we can
- check what they say. */
- if (!res_match_path (specs, u->path))
- {
- DEBUGP (("Not following %s because robots.txt forbids it.\n",
- constr));
- string_set_add (undesirable_urls, constr);
- inl = 1;
- }
+ /* 5. If the file does not match the acceptance list, or is on the
+ rejection list, chuck it out. The same goes for the directory
+ exclusion and inclusion lists. */
+ if (opt.includes || opt.excludes)
+ {
+ if (!accdir (u->dir, ALLABS))
+ {
+ DEBUGP (("%s (%s) is excluded/not-included.\n", url, u->dir));
+ goto blacklist;
}
+ }
- filename = NULL;
- /* If it wasn't chucked out, do something with it. */
- if (!inl)
+ /* 6. */
+ {
+ char *suf = NULL;
+ /* Check for acceptance/rejection rules. We ignore these rules
+ for HTML documents because they might lead to other files which
+ need to be downloaded. Of course, we don't know which
+ documents are HTML before downloading them, so we guess.
+
+ A file is subject to acceptance/rejection rules if:
+
+ * u->file is not "" (i.e. it is not a directory)
+ and either:
+ + there is no file suffix,
+ + or there is a suffix, but is not "html" or "htm",
+ + both:
+ - recursion is not infinite,
+ - and we are at its very end. */
+
+ if (u->file[0] != '\0'
+ && ((suf = suffix (url)) == NULL
+ || (0 != strcmp (suf, "html") && 0 != strcmp (suf, "htm"))
+ || (opt.reclevel == INFINITE_RECURSION && depth >= opt.reclevel)))
+ {
+ if (!acceptable (u->file))
+ {
+ DEBUGP (("%s (%s) does not match acc/rej rules.\n",
+ url, u->file));
+ FREE_MAYBE (suf);
+ goto blacklist;
+ }
+ }
+ FREE_MAYBE (suf);
+ }
+
+ /* 7. */
+ if (u->scheme == parent->scheme)
+ if (!opt.spanhost && 0 != strcasecmp (parent->host, u->host))
+ {
+ DEBUGP (("This is not the same hostname as the parent's (%s and %s).\n",
+ u->host, parent->host));
+ goto blacklist;
+ }
+
+ /* 8. */
+ if (opt.use_robots && u->scheme == SCHEME_HTTP)
+ {
+ struct robot_specs *specs = res_get_specs (u->host, u->port);
+ if (!specs)
{
- DEBUGP (("I've decided to load it -> "));
- /* Add it to the list of already-loaded URL-s. */
- string_set_add (undesirable_urls, constr);
- /* Automatically followed FTPs will *not* be downloaded
- recursively. */
- if (u->scheme == SCHEME_FTP)
+ char *rfile;
+ if (res_retrieve_file (url, &rfile))
{
- /* Don't you adore side-effects? */
- opt.recursive = 0;
- }
- /* Reset its type. */
- dt = 0;
- /* Retrieve it. */
- retrieve_url (constr, &filename, &newloc,
- canon_this_url ? canon_this_url : this_url, &dt);
- if (u->scheme == SCHEME_FTP)
- {
- /* Restore... */
- opt.recursive = 1;
- }
- if (newloc)
- {
- xfree (constr);
- constr = newloc;
- }
- /* If there was no error, and the type is text/html, parse
- it recursively. */
- if (dt & TEXTHTML)
- {
- if (dt & RETROKF)
- recursive_retrieve (filename, constr);
+ specs = res_parse_from_file (rfile);
+ xfree (rfile);
}
else
- DEBUGP (("%s is not text/html so we don't chase.\n",
- filename ? filename: "(null)"));
-
- if (opt.delete_after || (filename && !acceptable (filename)))
- /* Either --delete-after was specified, or we loaded this otherwise
- rejected (e.g. by -R) HTML file just so we could harvest its
- hyperlinks -- in either case, delete the local file. */
{
- DEBUGP (("Removing file due to %s in recursive_retrieve():\n",
- opt.delete_after ? "--delete-after" :
- "recursive rejection criteria"));
- logprintf (LOG_VERBOSE,
- (opt.delete_after ? _("Removing %s.\n")
- : _("Removing %s since it should be rejected.\n")),
- filename);
- if (unlink (filename))
- logprintf (LOG_NOTQUIET, "unlink: %s\n", strerror (errno));
- dt &= ~RETROKF;
+ /* If we cannot get real specs, at least produce
+ dummy ones so that we can register them and stop
+ trying to retrieve them. */
+ specs = res_parse ("", 0);
}
+ res_register_specs (u->host, u->port, specs);
+ }
- /* If everything was OK, and links are to be converted, let's
- store the local filename. */
- if (opt.convert_links && (dt & RETROKF) && (filename != NULL))
- {
- cur_url->convert = CO_CONVERT_TO_RELATIVE;
- cur_url->local_name = xstrdup (filename);
- }
+ /* Now that we have (or don't have) robots.txt specs, we can
+ check what they say. */
+ if (!res_match_path (specs, u->path))
+ {
+ DEBUGP (("Not following %s because robots.txt forbids it.\n", url));
+ goto blacklist;
}
- else
- DEBUGP (("%s already in list, so we don't load.\n", constr));
- /* Free filename and constr. */
- FREE_MAYBE (filename);
- FREE_MAYBE (constr);
- url_free (u);
- /* Increment the pbuf for the appropriate size. */
}
- if (opt.convert_links && !opt.delete_after)
- /* This is merely the first pass: the links that have been
- successfully downloaded are converted. In the second pass,
- convert_all_links() will also convert those links that have NOT
- been downloaded to their canonical form. */
- convert_links (file, url_list);
- /* Free the linked list of URL-s. */
- free_urlpos (url_list);
- /* Free the canonical this_url. */
- FREE_MAYBE (canon_this_url);
- /* Decrement the recursion depth. */
- --depth;
- if (downloaded_exceeds_quota ())
- return QUOTEXC;
- else
- return RETROK;
+
+ /* The URL has passed all the tests. It can be placed in the
+ download queue. */
+ DEBUGP (("Decided to load it.\n"));
+
+ return 1;
+
+ blacklist:
+ string_set_add (blacklist, url);
+
+ out:
+ DEBUGP (("Decided NOT to load it.\n"));
+
+ return 0;
}
\f
+/* Register that URL has been successfully downloaded to FILE. */
+
void
register_download (const char *url, const char *file)
{
return;
if (!dl_file_url_map)
dl_file_url_map = make_string_hash_table (0);
- hash_table_put (dl_file_url_map, xstrdup (file), xstrdup (url));
if (!dl_url_file_map)
dl_url_file_map = make_string_hash_table (0);
- hash_table_put (dl_url_file_map, xstrdup (url), xstrdup (file));
+
+ if (!hash_table_contains (dl_file_url_map, file))
+ hash_table_put (dl_file_url_map, xstrdup (file), xstrdup (url));
+ if (!hash_table_contains (dl_url_file_map, url))
+ hash_table_put (dl_url_file_map, xstrdup (url), xstrdup (file));
}
+/* Register that FROM has been redirected to TO. This assumes that TO
+ is successfully downloaded and already registered using
+ register_download() above. */
+
+void
+register_redirection (const char *from, const char *to)
+{
+ char *file;
+
+ if (!opt.convert_links)
+ return;
+
+ file = hash_table_get (dl_url_file_map, to);
+ assert (file != NULL);
+ if (!hash_table_contains (dl_url_file_map, from))
+ hash_table_put (dl_url_file_map, xstrdup (from), xstrdup (file));
+}
+
+/* Register that URL corresponds to the HTML file FILE. */
+
void
register_html (const char *url, const char *file)
{
for (html = downloaded_html_files; html; html = html->next)
{
- urlpos *urls, *cur_url;
+ struct urlpos *urls, *cur_url;
char *url;
DEBUGP (("Rescanning %s\n", html->string));
+
/* Determine the URL of the HTML file. get_urls_html will need
it. */
url = hash_table_get (dl_file_url_map, html->string);
DEBUGP (("It should correspond to %s.\n", url));
else
DEBUGP (("I cannot find the corresponding URL.\n"));
+
/* Parse the HTML file... */
urls = get_urls_html (html->string, url, FALSE, NULL);
+
/* We don't respect meta_disallow_follow here because, even if
the file is not followed, we might still want to convert the
links that have been followed from other files. */
+
for (cur_url = urls; cur_url; cur_url = cur_url->next)
{
char *local_name;
+ struct url *u = cur_url->url;
- /* The URL must be in canonical form to be compared. */
- struct url *u = url_parse (cur_url->url, NULL);
- if (!u)
- continue;
/* We decide the direction of conversion according to whether
a URL was downloaded. Downloaded URLs will be converted
ABS2REL, whereas non-downloaded will be converted REL2ABS. */
if (local_name)
DEBUGP (("%s marked for conversion, local %s\n",
u->url, local_name));
+
/* Decide on the conversion direction. */
if (local_name)
{
cur_url->convert = CO_CONVERT_TO_COMPLETE;
cur_url->local_name = NULL;
}
- url_free (u);
}
/* Convert the links in the file. */
convert_links (html->string, urls);
free_urlpos (urls);
}
}
+
+/* Cleanup the data structures associated with recursive retrieving
+ (the variables above). */
+void
+recursive_cleanup (void)
+{
+ if (dl_file_url_map)
+ {
+ free_keys_and_values (dl_file_url_map);
+ hash_table_destroy (dl_file_url_map);
+ dl_file_url_map = NULL;
+ }
+ if (dl_url_file_map)
+ {
+ free_keys_and_values (dl_url_file_map);
+ hash_table_destroy (dl_url_file_map);
+ dl_url_file_map = NULL;
+ }
+ slist_free (downloaded_html_files);
+ downloaded_html_files = NULL;
+}
#define RECUR_H
void recursive_cleanup PARAMS ((void));
-void recursive_reset PARAMS ((void));
-uerr_t recursive_retrieve PARAMS ((const char *, const char *));
+uerr_t retrieve_tree PARAMS ((const char *));
void register_download PARAMS ((const char *, const char *));
+void register_redirection PARAMS ((const char *, const char *));
void register_html PARAMS ((const char *, const char *));
void convert_all_links PARAMS ((void));
int allowedp, int exactp)
{
struct path_info pp;
+ if (path_b < path_e && *path_b == '/')
+ /* Our path representation doesn't use a leading slash, so remove
+ one from theirs. */
+ ++path_b;
pp.path = strdupdelim (path_b, path_e);
pp.allowedp = allowedp;
pp.user_agent_exact_p = exactp;
static void
free_specs (struct robot_specs *specs)
{
+ int i;
+ for (i = 0; i < specs->count; i++)
+ xfree (specs->paths[i].path);
FREE_MAYBE (specs->paths);
xfree (specs);
}
}
return err == RETROK;
}
+\f
+static int
+cleanup_hash_table_mapper (void *key, void *value, void *arg_ignored)
+{
+ xfree (key);
+ free_specs (value);
+ return 0;
+}
+
+void
+res_cleanup (void)
+{
+ if (registered_specs)
+ {
+ hash_table_map (registered_specs, cleanup_hash_table_mapper, NULL);
+ hash_table_destroy (registered_specs);
+ registered_specs = NULL;
+ }
+}
int res_retrieve_file PARAMS ((const char *, char **));
+void res_cleanup PARAMS ((void));
return res;
}
\f
+static int
+register_redirections_mapper (void *key, void *value, void *arg)
+{
+ const char *redirected_from = (const char *)key;
+ const char *redirected_to = (const char *)arg;
+ if (0 != strcmp (redirected_from, redirected_to))
+ register_redirection (redirected_from, redirected_to);
+ return 0;
+}
+
+/* Register the redirections that lead to the successful download of
+ this URL. This is necessary so that the link converter can convert
+ redirected URLs to the local file. */
+
+static void
+register_all_redirections (struct hash_table *redirections, const char *final)
+{
+ hash_table_map (redirections, register_redirections_mapper, (void *)final);
+}
+
#define USE_PROXY_P(u) (opt.use_proxy && getproxy((u)->scheme) \
&& no_proxy_match((u)->host, \
(const char **)opt.no_proxy))
proxy_url = url_parse (proxy, &up_error_code);
if (!proxy_url)
{
- logprintf (LOG_NOTQUIET, "Error parsing proxy URL %s: %s.\n",
+ logprintf (LOG_NOTQUIET, _("Error parsing proxy URL %s: %s.\n"),
proxy, url_error (up_error_code));
if (redirections)
string_set_free (redirections);
if (location_changed)
{
char *construced_newloc;
- struct url *newloc_struct;
+ struct url *newloc_parsed;
assert (mynewloc != NULL);
mynewloc = construced_newloc;
/* Now, see if this new location makes sense. */
- newloc_struct = url_parse (mynewloc, &up_error_code);
- if (!newloc_struct)
+ newloc_parsed = url_parse (mynewloc, &up_error_code);
+ if (!newloc_parsed)
{
logprintf (LOG_NOTQUIET, "%s: %s.\n", mynewloc,
url_error (up_error_code));
- url_free (newloc_struct);
url_free (u);
if (redirections)
string_set_free (redirections);
return result;
}
- /* Now mynewloc will become newloc_struct->url, because if the
+ /* Now mynewloc will become newloc_parsed->url, because if the
Location contained relative paths like .././something, we
don't want that propagating as url. */
xfree (mynewloc);
- mynewloc = xstrdup (newloc_struct->url);
+ mynewloc = xstrdup (newloc_parsed->url);
if (!redirections)
{
/* The new location is OK. Check for redirection cycle by
peeking through the history of redirections. */
- if (string_set_contains (redirections, newloc_struct->url))
+ if (string_set_contains (redirections, newloc_parsed->url))
{
logprintf (LOG_NOTQUIET, _("%s: Redirection cycle detected.\n"),
mynewloc);
- url_free (newloc_struct);
+ url_free (newloc_parsed);
url_free (u);
if (redirections)
string_set_free (redirections);
xfree (mynewloc);
return WRONGCODE;
}
- string_set_add (redirections, newloc_struct->url);
+ string_set_add (redirections, newloc_parsed->url);
xfree (url);
url = mynewloc;
url_free (u);
- u = newloc_struct;
+ u = newloc_parsed;
goto redirected;
}
if (*dt & RETROKF)
{
register_download (url, local_file);
+ if (redirections)
+ register_all_redirections (redirections, url);
if (*dt & TEXTHTML)
register_html (url, local_file);
}
retrieve_from_file (const char *file, int html, int *count)
{
uerr_t status;
- urlpos *url_list, *cur_url;
+ struct urlpos *url_list, *cur_url;
url_list = (html ? get_urls_html (file, NULL, FALSE, NULL)
: get_urls_file (file));
status = RETROK; /* Suppose everything is OK. */
*count = 0; /* Reset the URL count. */
- recursive_reset ();
+
for (cur_url = url_list; cur_url; cur_url = cur_url->next, ++*count)
{
- char *filename, *new_file;
+ char *filename = NULL, *new_file;
int dt;
if (downloaded_exceeds_quota ())
status = QUOTEXC;
break;
}
- status = retrieve_url (cur_url->url, &filename, &new_file, NULL, &dt);
- if (opt.recursive && status == RETROK && (dt & TEXTHTML))
- status = recursive_retrieve (filename, new_file ? new_file
- : cur_url->url);
+ if (opt.recursive && cur_url->url->scheme != SCHEME_FTP)
+ status = retrieve_tree (cur_url->url->url);
+ else
+ status = retrieve_url (cur_url->url->url, &filename, &new_file, NULL, &dt);
if (filename && opt.delete_after && file_exists_p (filename))
{
#include "utils.h"
#include "url.h"
#include "host.h"
+#include "hash.h"
#ifndef errno
extern int errno;
{
if (UNSAFE_CHAR (*p1))
{
- const unsigned char c = *p1++;
+ unsigned char c = *p1++;
*p2++ = '%';
*p2++ = XDIGIT_TO_XCHAR (c >> 4);
*p2++ = XDIGIT_TO_XCHAR (c & 0xf);
{
case CM_ENCODE:
{
- char c = *p1++;
+ unsigned char c = *p1++;
*p2++ = '%';
*p2++ = XDIGIT_TO_XCHAR (c >> 4);
*p2++ = XDIGIT_TO_XCHAR (c & 0xf);
return p;
}
+/* Turn STR into lowercase; return non-zero if a character was
+ actually changed. */
+
+static int
+lowercase_str (char *str)
+{
+ int change = 0;
+ for (; *str; str++)
+ if (!ISLOWER (*str))
+ {
+ change = 1;
+ *str = TOLOWER (*str);
+ }
+ return change;
+}
+
static char *parse_errors[] = {
#define PE_NO_ERROR 0
"No error",
{
struct url *u;
const char *p;
+ int path_modified, host_modified;
enum url_scheme scheme;
int port;
char *user = NULL, *passwd = NULL;
- const char *url_orig = url;
-
- p = url = reencode_string (url);
+ char *url_encoded;
scheme = url_scheme (url);
if (scheme == SCHEME_INVALID)
return NULL;
}
+ url_encoded = reencode_string (url);
+ p = url_encoded;
+
p += strlen (supported_schemes[scheme].leading_string);
uname_b = p;
p += url_skip_uname (p);
u = (struct url *)xmalloc (sizeof (struct url));
memset (u, 0, sizeof (*u));
- if (url == url_orig)
- u->url = xstrdup (url);
- else
- u->url = (char *)url;
-
u->scheme = scheme;
u->host = strdupdelim (host_b, host_e);
u->port = port;
u->passwd = passwd;
u->path = strdupdelim (path_b, path_e);
- path_simplify (u->path);
+ path_modified = path_simplify (u->path);
+ parse_path (u->path, &u->dir, &u->file);
+
+ host_modified = lowercase_str (u->host);
if (params_b)
u->params = strdupdelim (params_b, params_e);
if (fragment_b)
u->fragment = strdupdelim (fragment_b, fragment_e);
- parse_path (u->path, &u->dir, &u->file);
+
+ if (path_modified || u->fragment || host_modified)
+ {
+ /* If path_simplify modified the path, or if a fragment is
+ present, or if the original host name had caps in it, make
+ sure that u->url is equivalent to what would be printed by
+ url_string. */
+ u->url = url_string (u, 0);
+
+ if (url_encoded != url)
+ xfree ((char *) url_encoded);
+ }
+ else
+ {
+ if (url_encoded == url)
+ u->url = xstrdup (url);
+ else
+ u->url = url_encoded;
+ }
+ url_encoded = NULL;
return u;
}
FREE_MAYBE (url->fragment);
FREE_MAYBE (url->user);
FREE_MAYBE (url->passwd);
- FREE_MAYBE (url->dir);
- FREE_MAYBE (url->file);
+
+ xfree (url->dir);
+ xfree (url->file);
xfree (url);
}
\f
-urlpos *
+struct urlpos *
get_urls_file (const char *file)
{
struct file_memory *fm;
- urlpos *head, *tail;
+ struct urlpos *head, *tail;
const char *text, *text_end;
/* Load the file. */
--line_end;
if (line_end > line_beg)
{
- urlpos *entry = (urlpos *)xmalloc (sizeof (urlpos));
+ int up_error_code;
+ char *url_text;
+ struct urlpos *entry;
+ struct url *url;
+
+ /* We must copy the URL to a zero-terminated string. *sigh*. */
+ url_text = strdupdelim (line_beg, line_end);
+ url = url_parse (url_text, &up_error_code);
+ if (!url)
+ {
+ logprintf (LOG_NOTQUIET, "%s: Invalid URL %s: %s\n",
+ file, url_text, url_error (up_error_code));
+ xfree (url_text);
+ continue;
+ }
+ xfree (url_text);
+
+ entry = (struct urlpos *)xmalloc (sizeof (struct urlpos));
memset (entry, 0, sizeof (*entry));
entry->next = NULL;
- entry->url = strdupdelim (line_beg, line_end);
+ entry->url = url;
+
if (!head)
head = entry;
else
\f
/* Free the linked list of urlpos. */
void
-free_urlpos (urlpos *l)
+free_urlpos (struct urlpos *l)
{
while (l)
{
- urlpos *next = l->next;
- xfree (l->url);
+ struct urlpos *next = l->next;
+ if (l->url)
+ url_free (l->url);
FREE_MAYBE (l->local_name);
xfree (l);
l = next;
static char *
mkstruct (const struct url *u)
{
- char *host, *dir, *file, *res, *dirpref;
+ char *dir, *dir_preencoding;
+ char *file, *res, *dirpref;
+ char *query = u->query && *u->query ? u->query : NULL;
int l;
if (opt.cut_dirs)
else
dir = u->dir + (*u->dir == '/');
- host = xstrdup (u->host);
/* Check for the true name (or at least a consistent name for saving
to directory) of HOST, reusing the hlist if possible. */
- if (opt.add_hostdir && !opt.simple_check)
- {
- char *nhost = realhost (host);
- xfree (host);
- host = nhost;
- }
- /* Add dir_prefix and hostname (if required) to the beginning of
- dir. */
if (opt.add_hostdir)
{
+ /* Add dir_prefix and hostname (if required) to the beginning of
+ dir. */
+ dirpref = (char *)alloca (strlen (opt.dir_prefix) + 1
+ + strlen (u->host)
+ + 1 + numdigit (u->port)
+ + 1);
if (!DOTP (opt.dir_prefix))
+ sprintf (dirpref, "%s/%s", opt.dir_prefix, u->host);
+ else
+ strcpy (dirpref, u->host);
+
+ if (u->port != scheme_default_port (u->scheme))
{
- dirpref = (char *)alloca (strlen (opt.dir_prefix) + 1
- + strlen (host) + 1);
- sprintf (dirpref, "%s/%s", opt.dir_prefix, host);
+ int len = strlen (dirpref);
+ dirpref[len] = ':';
+ long_to_string (dirpref + len + 1, u->port);
}
- else
- STRDUP_ALLOCA (dirpref, host);
}
- else /* not add_hostdir */
+ else /* not add_hostdir */
{
if (!DOTP (opt.dir_prefix))
dirpref = opt.dir_prefix;
else
dirpref = "";
}
- xfree (host);
/* If there is a prefix, prepend it. */
if (*dirpref)
sprintf (newdir, "%s%s%s", dirpref, *dir == '/' ? "" : "/", dir);
dir = newdir;
}
- dir = encode_string (dir);
+
+ dir_preencoding = dir;
+ dir = reencode_string (dir_preencoding);
+
l = strlen (dir);
if (l && dir[l - 1] == '/')
dir[l - 1] = '\0';
file = u->file;
/* Finally, construct the full name. */
- res = (char *)xmalloc (strlen (dir) + 1 + strlen (file) + 1);
+ res = (char *)xmalloc (strlen (dir) + 1 + strlen (file)
+ + (query ? (1 + strlen (query)) : 0)
+ + 1);
sprintf (res, "%s%s%s", dir, *dir ? "/" : "", file);
- xfree (dir);
+ if (query)
+ {
+ strcat (res, "?");
+ strcat (res, query);
+ }
+ if (dir != dir_preencoding)
+ xfree (dir);
return res;
}
{
if (UNSAFE_CHAR (*from))
{
- const unsigned char c = *from++;
+ unsigned char c = *from++;
*to++ = '%';
*to++ = XDIGIT_TO_XCHAR (c >> 4);
*to++ = XDIGIT_TO_XCHAR (c & 0xf);
static int
urlpath_length (const char *url)
{
- const char *q = strchr (url, '?');
- if (q)
- return q - url;
- return strlen (url);
+ const char *q = strpbrk_or_eos (url, "?;#");
+ return q - url;
}
/* Find the last occurrence of character C in the range [b, e), or
{
const char *end = base + urlpath_length (base);
- if (*link != '/')
+ if (!*link)
{
- /* LINK is a relative URL: we need to replace everything
- after last slash (possibly empty) with LINK.
-
- So, if BASE is "whatever/foo/bar", and LINK is "qux/xyzzy",
- our result should be "whatever/foo/qux/xyzzy". */
- int need_explicit_slash = 0;
- int span;
- const char *start_insert;
- const char *last_slash = find_last_char (base, end, '/');
- if (!last_slash)
- {
- /* No slash found at all. Append LINK to what we have,
- but we'll need a slash as a separator.
-
- Example: if base == "foo" and link == "qux/xyzzy", then
- we cannot just append link to base, because we'd get
- "fooqux/xyzzy", whereas what we want is
- "foo/qux/xyzzy".
-
- To make sure the / gets inserted, we set
- need_explicit_slash to 1. We also set start_insert
- to end + 1, so that the length calculations work out
- correctly for one more (slash) character. Accessing
- that character is fine, since it will be the
- delimiter, '\0' or '?'. */
- /* example: "foo?..." */
- /* ^ ('?' gets changed to '/') */
- start_insert = end + 1;
- need_explicit_slash = 1;
- }
- else if (last_slash && last_slash != base && *(last_slash - 1) == '/')
- {
- /* example: http://host" */
- /* ^ */
- start_insert = end + 1;
- need_explicit_slash = 1;
- }
- else
- {
- /* example: "whatever/foo/bar" */
- /* ^ */
- start_insert = last_slash + 1;
- }
-
- span = start_insert - base;
- constr = (char *)xmalloc (span + linklength + 1);
- if (span)
- memcpy (constr, base, span);
- if (need_explicit_slash)
- constr[span - 1] = '/';
- if (linklength)
- memcpy (constr + span, link, linklength);
- constr[span + linklength] = '\0';
+ /* Empty LINK points back to BASE, query string and all. */
+ constr = xstrdup (base);
+ }
+ else if (*link == '?')
+ {
+ /* LINK points to the same location, but changes the query
+ string. Examples: */
+ /* uri_merge("path", "?new") -> "path?new" */
+ /* uri_merge("path?foo", "?new") -> "path?new" */
+ /* uri_merge("path?foo#bar", "?new") -> "path?new" */
+ /* uri_merge("path#foo", "?new") -> "path?new" */
+ int baselength = end - base;
+ constr = xmalloc (baselength + linklength + 1);
+ memcpy (constr, base, baselength);
+ memcpy (constr + baselength, link, linklength);
+ constr[baselength + linklength] = '\0';
}
- else /* *link == `/' */
+ else if (*link == '#')
+ {
+ /* uri_merge("path", "#new") -> "path#new" */
+ /* uri_merge("path#foo", "#new") -> "path#new" */
+ /* uri_merge("path?foo", "#new") -> "path?foo#new" */
+ /* uri_merge("path?foo#bar", "#new") -> "path?foo#new" */
+ int baselength;
+ const char *end1 = strchr (base, '#');
+ if (!end1)
+ end1 = base + strlen (base);
+ baselength = end1 - base;
+ constr = xmalloc (baselength + linklength + 1);
+ memcpy (constr, base, baselength);
+ memcpy (constr + baselength, link, linklength);
+ constr[baselength + linklength] = '\0';
+ }
+ else if (*link == '/')
{
/* LINK is an absolute path: we need to replace everything
after (and including) the FIRST slash with LINK.
memcpy (constr + span, link, linklength);
constr[span + linklength] = '\0';
}
+ else
+ {
+ /* LINK is a relative URL: we need to replace everything
+ after last slash (possibly empty) with LINK.
+
+ So, if BASE is "whatever/foo/bar", and LINK is "qux/xyzzy",
+ our result should be "whatever/foo/qux/xyzzy". */
+ int need_explicit_slash = 0;
+ int span;
+ const char *start_insert;
+ const char *last_slash = find_last_char (base, end, '/');
+ if (!last_slash)
+ {
+ /* No slash found at all. Append LINK to what we have,
+ but we'll need a slash as a separator.
+
+ Example: if base == "foo" and link == "qux/xyzzy", then
+ we cannot just append link to base, because we'd get
+ "fooqux/xyzzy", whereas what we want is
+ "foo/qux/xyzzy".
+
+ To make sure the / gets inserted, we set
+ need_explicit_slash to 1. We also set start_insert
+ to end + 1, so that the length calculations work out
+ correctly for one more (slash) character. Accessing
+ that character is fine, since it will be the
+ delimiter, '\0' or '?'. */
+ /* example: "foo?..." */
+ /* ^ ('?' gets changed to '/') */
+ start_insert = end + 1;
+ need_explicit_slash = 1;
+ }
+ else if (last_slash && last_slash != base && *(last_slash - 1) == '/')
+ {
+ /* example: http://host" */
+ /* ^ */
+ start_insert = end + 1;
+ need_explicit_slash = 1;
+ }
+ else
+ {
+ /* example: "whatever/foo/bar" */
+ /* ^ */
+ start_insert = last_slash + 1;
+ }
+
+ span = start_insert - base;
+ constr = (char *)xmalloc (span + linklength + 1);
+ if (span)
+ memcpy (constr, base, span);
+ if (need_explicit_slash)
+ constr[span - 1] = '/';
+ if (linklength)
+ memcpy (constr + span, link, linklength);
+ constr[span + linklength] = '\0';
+ }
}
else /* !no_scheme */
{
/* Change the links in an HTML document. Accepts a structure that
defines the positions of all the links. */
void
-convert_links (const char *file, urlpos *l)
+convert_links (const char *file, struct urlpos *l)
{
struct file_memory *fm;
FILE *fp;
const char *p;
downloaded_file_t downloaded_file_return;
+ int to_url_count = 0, to_file_count = 0;
logprintf (LOG_VERBOSE, _("Converting %s... "), file);
/* First we do a "dry run": go through the list L and see whether
any URL needs to be converted in the first place. If not, just
leave the file alone. */
- int count = 0;
- urlpos *dry = l;
+ int dry_count = 0;
+ struct urlpos *dry = l;
for (dry = l; dry; dry = dry->next)
if (dry->convert != CO_NOCONVERT)
- ++count;
- if (!count)
+ ++dry_count;
+ if (!dry_count)
{
logputs (LOG_VERBOSE, _("nothing to do.\n"));
return;
/* If the URL is not to be converted, skip it. */
if (l->convert == CO_NOCONVERT)
{
- DEBUGP (("Skipping %s at position %d.\n", l->url, l->pos));
+ DEBUGP (("Skipping %s at position %d.\n", l->url->url, l->pos));
continue;
}
char *quoted_newname = html_quote_string (newname);
replace_attr (&p, l->size, fp, quoted_newname);
DEBUGP (("TO_RELATIVE: %s to %s at position %d in %s.\n",
- l->url, newname, l->pos, file));
+ l->url->url, newname, l->pos, file));
xfree (newname);
xfree (quoted_newname);
+ ++to_file_count;
}
else if (l->convert == CO_CONVERT_TO_COMPLETE)
{
/* Convert the link to absolute URL. */
- char *newlink = l->url;
+ char *newlink = l->url->url;
char *quoted_newlink = html_quote_string (newlink);
replace_attr (&p, l->size, fp, quoted_newlink);
DEBUGP (("TO_COMPLETE: <something> to %s at position %d in %s.\n",
newlink, l->pos, file));
xfree (quoted_newlink);
+ ++to_url_count;
}
}
/* Output the rest of the file. */
fwrite (p, 1, fm->length - (p - fm->content), fp);
fclose (fp);
read_file_free (fm);
- logputs (LOG_VERBOSE, _("done.\n"));
+ logprintf (LOG_VERBOSE,
+ _("%d-%d\n"), to_file_count, to_url_count);
}
/* Construct and return a malloced copy of the relative link from two
return res;
}
\f
-/* Add URL to the head of the list L. */
-urlpos *
-add_url (urlpos *l, const char *url, const char *file)
-{
- urlpos *t;
-
- t = (urlpos *)xmalloc (sizeof (urlpos));
- memset (t, 0, sizeof (*t));
- t->url = xstrdup (url);
- t->local_name = xstrdup (file);
- t->next = l;
- return t;
-}
-
static void
write_backup_file (const char *file, downloaded_file_t downloaded_file_return)
{
-- Dan Harkless <wget@harkless.org>
This [adding a field to the urlpos structure] didn't work
- because convert_file() is called twice: once after all its
- sublinks have been retrieved in recursive_retrieve(), and
- once at the end of the day in convert_all_links(). The
- original linked list collected in recursive_retrieve() is
- lost after the first invocation of convert_links(), and
- convert_all_links() makes a new one (it calls get_urls_html()
- for each file it covers.) That's why your first approach didn't
- work. The way to make it work is perhaps to make this flag a
- field in the `urls_html' list.
+ because convert_file() is called from convert_all_links at
+ the end of the retrieval with a freshly built new urlpos
+ list.
-- Hrvoje Niksic <hniksic@arsdigita.com>
*/
converted_file_ptr = xmalloc(sizeof(*converted_file_ptr));
return 0;
}
-typedef struct _downloaded_file_list {
- char* file;
- downloaded_file_t download_type;
- struct _downloaded_file_list* next;
-} downloaded_file_list;
+/* We're storing "modes" of type downloaded_file_t in the hash table.
+ However, our hash tables only accept pointers for keys and values.
+ So when we need a pointer, we use the address of a
+ downloaded_file_t variable of static storage. */
+
+static downloaded_file_t *
+downloaded_mode_to_ptr (downloaded_file_t mode)
+{
+ static downloaded_file_t
+ v1 = FILE_NOT_ALREADY_DOWNLOADED,
+ v2 = FILE_DOWNLOADED_NORMALLY,
+ v3 = FILE_DOWNLOADED_AND_HTML_EXTENSION_ADDED,
+ v4 = CHECK_FOR_FILE;
+
+ switch (mode)
+ {
+ case FILE_NOT_ALREADY_DOWNLOADED:
+ return &v1;
+ case FILE_DOWNLOADED_NORMALLY:
+ return &v2;
+ case FILE_DOWNLOADED_AND_HTML_EXTENSION_ADDED:
+ return &v3;
+ case CHECK_FOR_FILE:
+ return &v4;
+ }
+ return NULL;
+}
+
+/* This should really be merged with dl_file_url_map and
+ downloaded_html_files in recur.c. This was originally a list, but
+ I changed it to a hash table beause it was actually taking a lot of
+ time to find things in it. */
-static downloaded_file_list *downloaded_files;
+static struct hash_table *downloaded_files_hash;
/* Remembers which files have been downloaded. In the standard case, should be
called with mode == FILE_DOWNLOADED_NORMALLY for each file we actually
it, call with mode == CHECK_FOR_FILE. Please be sure to call this function
with local filenames, not remote URLs. */
downloaded_file_t
-downloaded_file (downloaded_file_t mode, const char* file)
+downloaded_file (downloaded_file_t mode, const char *file)
{
- boolean found_file = FALSE;
- downloaded_file_list* rover = downloaded_files;
+ downloaded_file_t *ptr;
- while (rover != NULL)
- if (strcmp(rover->file, file) == 0)
- {
- found_file = TRUE;
- break;
- }
- else
- rover = rover->next;
-
- if (found_file)
- return rover->download_type; /* file had already been downloaded */
- else
+ if (mode == CHECK_FOR_FILE)
{
- if (mode != CHECK_FOR_FILE)
- {
- rover = xmalloc(sizeof(*rover));
- rover->file = xstrdup(file); /* use xstrdup() so die on out-of-mem. */
- rover->download_type = mode;
- rover->next = downloaded_files;
- downloaded_files = rover;
- }
-
- return FILE_NOT_ALREADY_DOWNLOADED;
+ if (!downloaded_files_hash)
+ return FILE_NOT_ALREADY_DOWNLOADED;
+ ptr = hash_table_get (downloaded_files_hash, file);
+ if (!ptr)
+ return FILE_NOT_ALREADY_DOWNLOADED;
+ return *ptr;
}
+
+ if (!downloaded_files_hash)
+ downloaded_files_hash = make_string_hash_table (0);
+
+ ptr = hash_table_get (downloaded_files_hash, file);
+ if (ptr)
+ return *ptr;
+
+ ptr = downloaded_mode_to_ptr (mode);
+ hash_table_put (downloaded_files_hash, xstrdup (file), &ptr);
+
+ return FILE_NOT_ALREADY_DOWNLOADED;
+}
+
+static int
+df_free_mapper (void *key, void *value, void *ignored)
+{
+ xfree (key);
+ return 0;
}
void
downloaded_files_free (void)
{
- downloaded_file_list* rover = downloaded_files;
- while (rover)
+ if (downloaded_files_hash)
{
- downloaded_file_list *next = rover->next;
- xfree (rover->file);
- xfree (rover);
- rover = next;
+ hash_table_map (downloaded_files_hash, df_free_mapper, NULL);
+ hash_table_destroy (downloaded_files_hash);
+ downloaded_files_hash = NULL;
}
}
/* A structure that defines the whereabouts of a URL, i.e. its
position in an HTML document, etc. */
-typedef struct _urlpos
-{
- char *url; /* linked URL, after it has been
- merged with the base */
- char *local_name; /* Local file to which it was saved */
+struct urlpos {
+ struct url *url; /* the URL of the link, after it has
+ been merged with the base */
+ char *local_name; /* local file to which it was saved
+ (used by convert_links) */
/* Information about the original link: */
int link_relative_p; /* was the link relative? */
/* URL's position in the buffer. */
int pos, size;
- struct _urlpos *next; /* Next struct in list */
-} urlpos;
+ struct urlpos *next; /* next list element */
+};
/* downloaded_file() takes a parameter of this type and returns this type. */
typedef enum
char *url_string PARAMS ((const struct url *, int));
-urlpos *get_urls_file PARAMS ((const char *));
-urlpos *get_urls_html PARAMS ((const char *, const char *, int, int *));
-void free_urlpos PARAMS ((urlpos *));
+struct urlpos *get_urls_file PARAMS ((const char *));
+struct urlpos *get_urls_html PARAMS ((const char *, const char *, int, int *));
+void free_urlpos PARAMS ((struct urlpos *));
char *uri_merge PARAMS ((const char *, const char *));
int mkalldirs PARAMS ((const char *));
char *url_filename PARAMS ((const struct url *));
-char *getproxy PARAMS ((uerr_t));
+char *getproxy PARAMS ((enum url_scheme));
int no_proxy_match PARAMS ((const char *, const char **));
-void convert_links PARAMS ((const char *, urlpos *));
-urlpos *add_url PARAMS ((urlpos *, const char *, const char *));
+void convert_links PARAMS ((const char *, struct urlpos *));
downloaded_file_t downloaded_file PARAMS ((downloaded_file_t, const char *));
#endif /* DEBUG_MALLOC */
\f
+/* Utility function: like xstrdup(), but also lowercases S. */
+
+char *
+xstrdup_lower (const char *s)
+{
+ char *copy = xstrdup (s);
+ char *p = copy;
+ for (; *p; p++)
+ *p = TOLOWER (*p);
+ return copy;
+}
+
/* Copy the string formed by two pointers (one on the beginning, other
on the char after the last char) to a new, malloc-ed location.
0-terminate it. */
}
#endif /* not WINDOWS */
\f
+#if 0
+/* debug */
char *
ps (char *orig)
{
path_simplify (r);
return r;
}
+#endif
/* Canonicalize PATH, and return a new path. The new path differs from PATH
in that:
Change the original string instead of strdup-ing.
React correctly when beginning with `./' and `../'.
Don't zip out trailing slashes. */
-void
+int
path_simplify (char *path)
{
- register int i, start, ddot;
+ register int i, start;
+ int changes = 0;
char stub_char;
if (!*path)
- return;
+ return 0;
- /*stub_char = (*path == '/') ? '/' : '.';*/
stub_char = '/';
- /* Addition: Remove all `./'-s preceding the string. If `../'-s
- precede, put `/' in front and remove them too. */
- i = 0;
- ddot = 0;
- while (1)
- {
- if (path[i] == '.' && path[i + 1] == '/')
- i += 2;
- else if (path[i] == '.' && path[i + 1] == '.' && path[i + 2] == '/')
- {
- i += 3;
- ddot = 1;
- }
- else
- break;
- }
- if (i)
- strcpy (path, path + i - ddot);
+ if (path[0] == '/')
+ /* Preserve initial '/'. */
+ ++path;
- /* Replace single `.' or `..' with `/'. */
+ /* Nix out leading `.' or `..' with. */
if ((path[0] == '.' && path[1] == '\0')
|| (path[0] == '.' && path[1] == '.' && path[2] == '\0'))
{
- path[0] = stub_char;
- path[1] = '\0';
- return;
+ path[0] = '\0';
+ changes = 1;
+ return changes;
}
+
/* Walk along PATH looking for things to compact. */
i = 0;
while (1)
{
strcpy (path + start + 1, path + i);
i = start + 1;
+ changes = 1;
}
/* Check for `../', `./' or trailing `.' by itself. */
if (!path[i + 1])
{
path[--i] = '\0';
+ changes = 1;
break;
}
{
strcpy (path + i, path + i + 1);
i = (start < 0) ? 0 : start;
+ changes = 1;
continue;
}
(path[i + 2] == '/' || !path[i + 2]))
{
while (--start > -1 && path[start] != '/');
- strcpy (path + start + 1, path + i + 2);
+ strcpy (path + start + 1, path + i + 2 + (start == -1 && path[i + 2]));
i = (start < 0) ? 0 : start;
+ changes = 1;
continue;
}
} /* path == '.' */
} /* while */
+
+ /* Addition: Remove all `./'-s and `../'-s preceding the string. */
+ i = 0;
+ while (1)
+ {
+ if (path[i] == '.' && path[i + 1] == '/')
+ i += 2;
+ else if (path[i] == '.' && path[i + 1] == '.' && path[i + 2] == '/')
+ i += 3;
+ else
+ break;
+ }
+ if (i)
+ {
+ strcpy (path, path + i - 0);
+ changes = 1;
+ }
+
+ return changes;
}
\f
/* "Touch" FILE, i.e. make its atime and mtime equal to the time
void print_malloc_debug_stats ();
#endif
+char *xstrdup_lower PARAMS ((const char *));
char *strdupdelim PARAMS ((const char *, const char *));
char **sepstring PARAMS ((const char *));
int frontcmp PARAMS ((const char *, const char *));
char *pwd_cuserid PARAMS ((char *));
void fork_to_background PARAMS ((void));
-void path_simplify PARAMS ((char *));
+int path_simplify PARAMS ((char *));
void touch PARAMS ((const char *, time_t));
int remove_link PARAMS ((const char *));
char *html_quote_string PARAMS ((const char *));
+int determine_screen_width PARAMS ((void));
+
#endif /* UTILS_H */
# define NDEBUG /* To kill off assertions */
#endif /* not DEBUG */
+/* Define this if you want primitive but extensive malloc debugging.
+ It will make Wget extremely slow, so only do it in development
+ builds. */
+#undef DEBUG_MALLOC
+
#ifndef PARAMS
# if PROTOTYPES
# define PARAMS(args) args
3) Finally, the debug messages are meant to be a clue for me to
debug problems with Wget. If I get them in a language I don't
- understand, debugging will become a new challenge of its own! :-) */
+ understand, debugging will become a new challenge of its own! */
/* Include these, so random files need not include them. */