From: hniksic Date: Sun, 25 Nov 2001 03:10:34 +0000 (-0800) Subject: [svn] Implemented breadth-first retrieval. X-Git-Tag: v1.13~2035 X-Git-Url: http://sjero.net/git/?p=wget;a=commitdiff_plain;h=222e9465b7cef12a75e5ce0d6cc7df60c934566f [svn] Implemented breadth-first retrieval. Published in . --- diff --git a/ChangeLog b/ChangeLog index 38a6ddaf..87c2c298 100644 --- a/ChangeLog +++ b/ChangeLog @@ -1,3 +1,9 @@ +2001-11-25 Hrvoje Niksic + + * TODO: Ditto. + + * NEWS: Updated with the latest stuff. + 2001-11-23 Hrvoje Niksic * po/hr.po: A major overhaul. diff --git a/NEWS b/NEWS index fb9a769e..8a3b0b70 100644 --- a/NEWS +++ b/NEWS @@ -7,9 +7,19 @@ Please send GNU Wget bug reports to . * Changes in Wget 1.8. +** "Recursive retrieval" now uses a breadth-first algorithm. +Recursive downloads are faster and consume *significantly* less memory +than before. + ** A new progress indicator is now available. Try it with --progress=bar or using `progress = bar' in `.wgetrc'. +** Host directories now contain port information if the URL is at a +non-standard port. + +** Wget now supports the robots.txt directives specified in +. + ** URL parser has been fixed, especially the infamous overzealous quoting bug. Wget no longer dequotes reserved characters, e.g. `%3F' is no longer translated to `?', nor `%2B' to `+'. Unsafe characters diff --git a/TODO b/TODO index fd5e00ec..9632add7 100644 --- a/TODO +++ b/TODO @@ -20,15 +20,6 @@ changes. file, though forcibly disconnecting from the server at the desired endpoint might be workable). -* RFC 1738 says that if logging on to an FTP server puts you in a directory - other than '/', the way to specify a file relative to '/' in a URL (let's use - "/bin/ls" in this example) is "ftp://host/%2Fbin/ls". Wget needs to support - this (and ideally not consider "ftp://host//bin/ls" to be equivalent, as that - would equate to the command "CWD " rather than "CWD /"). To accomodate people - used to broken FTP clients like Internet Explorer and Netscape, if - "ftp://host/bin/ls" doesn't exist, Wget should try again (perhaps under - control of an option), acting as if the user had typed "ftp://host/%2Fbin/ls". - * If multiple FTP URLs are specified that are on the same host, Wget should re-use the connection rather than opening a new one for each file. @@ -37,16 +28,9 @@ changes. * Limit the number of successive redirection to max. 20 or so. -* If -c used on a file that's already completely downloaded, don't re-download - it (unless normal --timestamping processing would cause you to do so). - * If -c used with -N, check to make sure a file hasn't changed on the server before "continuing" to download it (preventing a bogus hybrid file). -* Take a look at - - and support the new directives. - * Generalize --html-extension to something like --mime-extensions and have it look at mime.types/mimecap file for preferred extension. Non-HTML files with filenames changed this way would be re-downloaded each time despite -N unless @@ -87,9 +71,6 @@ changes. turning it off. Get rid of `--foo=no' stuff. Short options would be handled as `-x' vs. `-nx'. -* Implement "thermometer" display (not all that hard; use an - alternative show_progress() if the output goes to a terminal.) - * Add option to only list wildcard matches without doing the download. * Add case-insensitivity as an option. @@ -102,19 +83,13 @@ changes. * Allow time-stamping by arbitrary date. -* Fix Unix directory parser to allow for spaces in file names. - * Allow size limit to files (perhaps with an option to download oversize files up through the limit or not at all, to get more functionality than [u]limit. -* Implement breadth-first retrieval. - * Download to .in* when mirroring. * Add an option to delete or move no-longer-existent files when mirroring. -* Implement a switch to avoid downloading multiple files (e.g. x and x.gz). - * Implement uploading (--upload URL?) in FTP and HTTP. * Rewrite FTP code to allow for easy addition of new commands. It @@ -129,13 +104,10 @@ changes. * Implement a concept of "packages" a la mirror. -* Implement correct RFC1808 URL parsing. - -* Implement more HTTP/1.1 bells and whistles (ETag, Content-MD5 etc.) - -* Add a "rollback" option to have --continue throw away a configurable number of - bytes at the end of a file before resuming download. Apparently, some stupid - proxies insert a "transfer interrupted" string we need to get rid of. +* Add a "rollback" option to have continued retrieval throw away a + configurable number of bytes at the end of a file before resuming + download. Apparently, some stupid proxies insert a "transfer + interrupted" string we need to get rid of. * When using --accept and --reject, you can end up with empty directories. Have Wget any such at the end. diff --git a/src/ChangeLog b/src/ChangeLog index 2c58a116..68e13dfe 100644 --- a/src/ChangeLog +++ b/src/ChangeLog @@ -1,3 +1,68 @@ +2001-11-25 Hrvoje Niksic + + * url.c (reencode_string): Use unsigned char, not char -- + otherwise the hex digits come out wrong for 8-bit chars such as + nbsp. + (lowercase_str): New function. + (url_parse): Canonicalize u->url if needed. + (get_urls_file): Parse each URL, and return only the valid ones. + (free_urlpos): Call url_free. + (mkstruct): Add :port if the port is non-standard. + (mkstruct): Append the query string to the file name, if any. + (urlpath_length): Use strpbrk_or_eos. + (uri_merge_1): Handle the cases where LINK is an empty string, + where LINK consists only of query, and where LINK consists only of + fragment. + (convert_links): Count and report both kinds of conversion. + (downloaded_file): Use a hash table, not a list. + (downloaded_files_free): Free the hash table. + + * retr.c (retrieve_from_file): Ditto. + + * main.c (main): Call either retrieve_url or retrieve_tree + for each URL, not both. + + * retr.c (register_all_redirections): New function. + (register_redirections_mapper): Ditto. + (retrieve_url): Register the redirections. + (retrieve_url): Make the string "Error parsing proxy ..." + translatable. + + * res.c (add_path): Strip leading slash from robots.txt paths so + that the path representations are "compatible". + (free_specs): Free each individual path, too. + (res_cleanup): New function. + (cleanup_hash_table_mapper): Ditto. + + * recur.c (url_queue_new): New function. + (url_queue_delete): Ditto. + (url_enqueue): Ditto. + (url_dequeue): Ditto. + (retrieve_tree): New function, replacement for recursive_retrieve. + (descend_url_p): New function. + (register_redirection): New function. + + * progress.c (create_image): Cosmetic changes. + + * init.c (cleanup): Do all those complex cleanups only if + DEBUG_MALLOC is defined. + + * main.c: Removed --simple-check and the corresponding + simple_host_check in init.c. + + * html-url.c (handle_link): Parse the URL here, and propagate the + parsed URL to the caller, who would otherwise have to parse it + again. + + * host.c (xstrdup_lower): Moved to utils.c. + (realhost): Removed. + (same_host): Ditto. + +2001-11-24 Hrvoje Niksic + + * utils.c (path_simplify): Preserver the (non-)existence of + leading slash. Return non-zero if changes were made. + 2001-11-24 Hrvoje Niksic * progress.c (bar_update): Don't modify bp->total_length if it is diff --git a/src/Makefile.in b/src/Makefile.in index 942e0e1e..4d60cc1f 100644 --- a/src/Makefile.in +++ b/src/Makefile.in @@ -162,8 +162,10 @@ main$o: wget.h utils.h init.h retr.h recur.h host.h cookies.h gnu-md5$o: wget.h gnu-md5.h mswindows$o: wget.h url.h netrc$o: wget.h utils.h netrc.h init.h +progress$o: wget.h progress.h utils.h retr.h rbuf$o: wget.h rbuf.h connect.h recur$o: wget.h url.h recur.h utils.h retr.h ftp.h fnmatch.h host.h hash.h +res$o: wget.h utils.h hash.h url.h retr.h res.h retr$o: wget.h utils.h retr.h url.h recur.h ftp.h host.h connect.h hash.h snprintf$o: safe-ctype$o: safe-ctype.h diff --git a/src/host.c b/src/host.c index 82725d8b..4fb1197b 100644 --- a/src/host.c +++ b/src/host.c @@ -60,8 +60,14 @@ extern int errno; #endif /* Mapping between all known hosts to their addresses (n.n.n.n). */ + +/* #### We should map to *lists* of IP addresses. */ + struct hash_table *host_name_address_map; +/* The following two tables are obsolete, since we no longer do host + canonicalization. */ + /* Mapping between all known addresses (n.n.n.n) to their hosts. This is the inverse of host_name_address_map. These two tables share the strdup'ed strings. */ @@ -70,18 +76,6 @@ struct hash_table *host_address_name_map; /* Mapping between auxilliary (slave) and master host names. */ struct hash_table *host_slave_master_map; -/* Utility function: like xstrdup(), but also lowercases S. */ - -static char * -xstrdup_lower (const char *s) -{ - char *copy = xstrdup (s); - char *p = copy; - for (; *p; p++) - *p = TOLOWER (*p); - return copy; -} - /* The same as gethostbyname, but supports internet addresses of the form `N.N.N.N'. On some systems gethostbyname() knows how to do this automatically. */ @@ -216,114 +210,6 @@ store_hostaddress (unsigned char *where, const char *hostname) return 1; } -/* Determine the "real" name of HOST, as perceived by Wget. If HOST - is referenced by more than one name, "real" name is considered to - be the first one encountered in the past. */ -char * -realhost (const char *host) -{ - struct in_addr in; - struct hostent *hptr; - char *master_name; - - DEBUGP (("Checking for %s in host_name_address_map.\n", host)); - if (hash_table_contains (host_name_address_map, host)) - { - DEBUGP (("Found; %s was already used, by that name.\n", host)); - return xstrdup_lower (host); - } - - DEBUGP (("Checking for %s in host_slave_master_map.\n", host)); - master_name = hash_table_get (host_slave_master_map, host); - if (master_name) - { - has_master: - DEBUGP (("Found; %s was already used, by the name %s.\n", - host, master_name)); - return xstrdup (master_name); - } - - DEBUGP (("First time I hear about %s by that name; looking it up.\n", - host)); - hptr = ngethostbyname (host); - if (hptr) - { - char *inet_s; - /* Originally, we copied to in.s_addr, but it appears to be - missing on some systems. */ - memcpy (&in, *hptr->h_addr_list, sizeof (in)); - inet_s = inet_ntoa (in); - - add_host_to_cache (host, inet_s); - - /* add_host_to_cache() can establish a slave-master mapping. */ - DEBUGP (("Checking again for %s in host_slave_master_map.\n", host)); - master_name = hash_table_get (host_slave_master_map, host); - if (master_name) - goto has_master; - } - - return xstrdup_lower (host); -} - -/* Compare two hostnames (out of URL-s if the arguments are URL-s), - taking care of aliases. It uses realhost() to determine a unique - hostname for each of two hosts. If simple_check is non-zero, only - strcmp() is used for comparison. */ -int -same_host (const char *u1, const char *u2) -{ - const char *s; - char *p1, *p2; - char *real1, *real2; - - /* Skip protocol, if present. */ - u1 += url_skip_scheme (u1); - u2 += url_skip_scheme (u2); - - /* Skip username ans password, if present. */ - u1 += url_skip_uname (u1); - u2 += url_skip_uname (u2); - - for (s = u1; *u1 && *u1 != '/' && *u1 != ':'; u1++); - p1 = strdupdelim (s, u1); - for (s = u2; *u2 && *u2 != '/' && *u2 != ':'; u2++); - p2 = strdupdelim (s, u2); - DEBUGP (("Comparing hosts %s and %s...\n", p1, p2)); - if (strcasecmp (p1, p2) == 0) - { - xfree (p1); - xfree (p2); - DEBUGP (("They are quite alike.\n")); - return 1; - } - else if (opt.simple_check) - { - xfree (p1); - xfree (p2); - DEBUGP (("Since checking is simple, I'd say they are not the same.\n")); - return 0; - } - real1 = realhost (p1); - real2 = realhost (p2); - xfree (p1); - xfree (p2); - if (strcasecmp (real1, real2) == 0) - { - DEBUGP (("They are alike, after realhost()->%s.\n", real1)); - xfree (real1); - xfree (real2); - return 1; - } - else - { - DEBUGP (("They are not the same (%s, %s).\n", real1, real2)); - xfree (real1); - xfree (real2); - return 0; - } -} - /* Determine whether a URL is acceptable to be followed, according to a list of domains to accept. */ int @@ -383,7 +269,7 @@ herrmsg (int error) } void -clean_hosts (void) +host_cleanup (void) { /* host_name_address_map and host_address_name_map share the strings. Because of that, calling free_keys_and_values once diff --git a/src/host.h b/src/host.h index 090ef168..95566c58 100644 --- a/src/host.h +++ b/src/host.h @@ -27,15 +27,11 @@ struct url; struct hostent *ngethostbyname PARAMS ((const char *)); int store_hostaddress PARAMS ((unsigned char *, const char *)); -void clean_hosts PARAMS ((void)); +void host_cleanup PARAMS ((void)); -char *realhost PARAMS ((const char *)); -int same_host PARAMS ((const char *, const char *)); int accept_domain PARAMS ((struct url *)); int sufmatch PARAMS ((const char **, const char *)); -char *ftp_getaddress PARAMS ((void)); - char *herrmsg PARAMS ((int)); #endif /* HOST_H */ diff --git a/src/html-url.c b/src/html-url.c index 6ab54a9f..91877820 100644 --- a/src/html-url.c +++ b/src/html-url.c @@ -284,7 +284,7 @@ struct collect_urls_closure { char *text; /* HTML text. */ char *base; /* Base URI of the document, possibly changed through . */ - urlpos *head, *tail; /* List of URLs */ + struct urlpos *head, *tail; /* List of URLs */ const char *parent_base; /* Base of the current document. */ const char *document_file; /* File name of this document. */ int dash_p_leaf_HTML; /* Whether -p is specified, and this @@ -301,59 +301,67 @@ static void handle_link (struct collect_urls_closure *closure, const char *link_uri, struct taginfo *tag, int attrid) { - int no_scheme = !url_has_scheme (link_uri); - urlpos *newel; - + int link_has_scheme = url_has_scheme (link_uri); + struct urlpos *newel; const char *base = closure->base ? closure->base : closure->parent_base; - char *complete_uri; - - char *fragment = strrchr (link_uri, '#'); - - if (fragment) - { - /* Nullify the fragment identifier, i.e. everything after the - last occurrence of `#', inclusive. This copying is - relatively inefficient, but it doesn't matter because - fragment identifiers don't come up all that often. */ - int hashlen = fragment - link_uri; - char *p = alloca (hashlen + 1); - memcpy (p, link_uri, hashlen); - p[hashlen] = '\0'; - link_uri = p; - } + struct url *url; if (!base) { - if (no_scheme) + DEBUGP (("%s: no base, merge will use \"%s\".\n", + closure->document_file, link_uri)); + + if (!link_has_scheme) { /* We have no base, and the link does not have a host attached to it. Nothing we can do. */ /* #### Should we print a warning here? Wget 1.5.x used to. */ return; } - else - complete_uri = xstrdup (link_uri); + + url = url_parse (link_uri, NULL); + if (!url) + { + DEBUGP (("%s: link \"%s\" doesn't parse.\n", + closure->document_file, link_uri)); + return; + } } else - complete_uri = uri_merge (base, link_uri); + { + /* Merge BASE with LINK_URI, but also make sure the result is + canonicalized, i.e. that "../" have been resolved. + (parse_url will do that for us.) */ + + char *complete_uri = uri_merge (base, link_uri); + + DEBUGP (("%s: merge(\"%s\", \"%s\") -> %s\n", + closure->document_file, base, link_uri, complete_uri)); - DEBUGP (("%s: merge(\"%s\", \"%s\") -> %s\n", - closure->document_file, base ? base : "(null)", - link_uri, complete_uri)); + url = url_parse (complete_uri, NULL); + if (!url) + { + DEBUGP (("%s: merged link \"%s\" doesn't parse.\n", + closure->document_file, complete_uri)); + xfree (complete_uri); + return; + } + xfree (complete_uri); + } - newel = (urlpos *)xmalloc (sizeof (urlpos)); + newel = (struct urlpos *)xmalloc (sizeof (struct urlpos)); memset (newel, 0, sizeof (*newel)); newel->next = NULL; - newel->url = complete_uri; + newel->url = url; newel->pos = tag->attrs[attrid].value_raw_beginning - closure->text; newel->size = tag->attrs[attrid].value_raw_size; /* A URL is relative if the host is not named, and the name does not start with `/'. */ - if (no_scheme && *link_uri != '/') + if (!link_has_scheme && *link_uri != '/') newel->link_relative_p = 1; - else if (!no_scheme) + else if (link_has_scheme) newel->link_complete_p = 1; if (closure->tail) @@ -542,7 +550,7 @@ collect_tags_mapper (struct taginfo *tag, void *arg) If dash_p_leaf_HTML is non-zero, only the elements needed to render FILE ("non-external" links) will be returned. */ -urlpos * +struct urlpos * get_urls_html (const char *file, const char *this_url, int dash_p_leaf_HTML, int *meta_disallow_follow) { diff --git a/src/http.c b/src/http.c index 02331fa9..8d79de51 100644 --- a/src/http.c +++ b/src/http.c @@ -1452,8 +1452,8 @@ File `%s' already there, will not retrieve.\n"), *hstat.local_file); if (((suf = suffix (*hstat.local_file)) != NULL) && (!strcmp (suf, "html") || !strcmp (suf, "htm"))) *dt |= TEXTHTML; - xfree (suf); + FREE_MAYBE (suf); FREE_MAYBE (dummy); return RETROK; } diff --git a/src/init.c b/src/init.c index 09b86ea9..83cf070d 100644 --- a/src/init.c +++ b/src/init.c @@ -171,7 +171,6 @@ static struct { { "savecookies", &opt.cookies_output, cmd_file }, { "saveheaders", &opt.save_headers, cmd_boolean }, { "serverresponse", &opt.server_response, cmd_boolean }, - { "simplehostcheck", &opt.simple_check, cmd_boolean }, { "spanhosts", &opt.spanhost, cmd_boolean }, { "spider", &opt.spider, cmd_boolean }, #ifdef HAVE_SSL @@ -1009,6 +1008,7 @@ check_user_specified_header (const char *s) } void cleanup_html_url PARAMS ((void)); +void res_cleanup PARAMS ((void)); void downloaded_files_free PARAMS ((void)); @@ -1016,13 +1016,27 @@ void downloaded_files_free PARAMS ((void)); void cleanup (void) { - extern acc_t *netrc_list; + /* Free external resources, close files, etc. */ - recursive_cleanup (); - clean_hosts (); - free_netrc (netrc_list); if (opt.dfp) fclose (opt.dfp); + + /* We're exiting anyway so there's no real need to call free() + hundreds of times. Skipping the frees will make Wget exit + faster. + + However, when detecting leaks, it's crucial to free() everything + because then you can find the real leaks, i.e. the allocated + memory which grows with the size of the program. */ + +#ifdef DEBUG_MALLOC + recursive_cleanup (); + res_cleanup (); + host_cleanup (); + { + extern acc_t *netrc_list; + free_netrc (netrc_list); + } cleanup_html_url (); downloaded_files_free (); cookies_cleanup (); @@ -1037,6 +1051,7 @@ cleanup (void) free_vec (opt.domains); free_vec (opt.follow_tags); free_vec (opt.ignore_tags); + FREE_MAYBE (opt.progress_type); xfree (opt.ftp_acc); FREE_MAYBE (opt.ftp_pass); FREE_MAYBE (opt.ftp_proxy); @@ -1055,4 +1070,5 @@ cleanup (void) FREE_MAYBE (opt.bind_address); FREE_MAYBE (opt.cookies_input); FREE_MAYBE (opt.cookies_output); +#endif } diff --git a/src/main.c b/src/main.c index 54d3f318..79e17af0 100644 --- a/src/main.c +++ b/src/main.c @@ -402,9 +402,6 @@ hpVqvdkKsxmNWrHSLcFbEY:G:g:T:U:O:l:n:i:o:a:t:D:A:R:P:B:e:Q:X:I:w:C:", case 149: setval ("removelisting", "off"); break; - case 150: - setval ("simplehostcheck", "on"); - break; case 155: setval ("bindaddress", optarg); break; @@ -604,7 +601,7 @@ GNU General Public License for more details.\n")); break; case 'n': { - /* #### The n? options are utter crock! */ + /* #### What we really want here is --no-foo. */ char *p; for (p = optarg; *p; p++) @@ -613,9 +610,6 @@ GNU General Public License for more details.\n")); case 'v': setval ("verbose", "off"); break; - case 'h': - setval ("simplehostcheck", "on"); - break; case 'H': setval ("addhostdir", "off"); break; @@ -806,17 +800,17 @@ Can't timestamp and not clobber old files at the same time.\n")); #endif /* HAVE_SIGNAL */ status = RETROK; /* initialize it, just-in-case */ - recursive_reset (); + /*recursive_reset ();*/ /* Retrieve the URLs from argument list. */ for (t = url; *t; t++) { - char *filename, *redirected_URL; + char *filename = NULL, *redirected_URL = NULL; int dt; - status = retrieve_url (*t, &filename, &redirected_URL, NULL, &dt); - if (opt.recursive && status == RETROK && (dt & TEXTHTML)) - status = recursive_retrieve (filename, - redirected_URL ? redirected_URL : *t); + if (opt.recursive && url_scheme (*t) != SCHEME_FTP) + status = retrieve_tree (*t); + else + status = retrieve_url (*t, &filename, &redirected_URL, NULL, &dt); if (opt.delete_after && file_exists_p(filename)) { diff --git a/src/options.h b/src/options.h index 70746b15..4f75c8b9 100644 --- a/src/options.h +++ b/src/options.h @@ -36,9 +36,6 @@ struct options int relative_only; /* Follow only relative links. */ int no_parent; /* Restrict access to the parent directory. */ - int simple_check; /* Should we use simple checking - (strcmp) or do we create a host - hash and call gethostbyname? */ int reclevel; /* Maximum level of recursion */ int dirstruct; /* Do we build the directory structure as we go along? */ diff --git a/src/progress.c b/src/progress.c index 9c981bd8..e167b7bf 100644 --- a/src/progress.c +++ b/src/progress.c @@ -27,6 +27,9 @@ Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. */ # include #endif /* HAVE_STRING_H */ #include +#ifdef HAVE_UNISTD_H +# include +#endif #include "wget.h" #include "progress.h" @@ -470,14 +473,14 @@ create_image (struct bar_progress *bp, long dltime) Calculate its geometry: "xxx% " - percentage - 5 chars - "| ... | " - progress bar decorations - 3 chars + "| ... |" - progress bar decorations - 2 chars "1012.56 K/s " - dl rate - 12 chars "nnnn " - downloaded bytes - 11 chars "ETA: xx:xx:xx" - ETA - 13 chars "=====>..." - progress bar content - the rest */ - int progress_len = screen_width - (5 + 3 + 12 + 11 + 13); + int progress_len = screen_width - (5 + 2 + 12 + 11 + 13); if (progress_len < 7) progress_len = 0; @@ -530,7 +533,7 @@ create_image (struct bar_progress *bp, long dltime) } else { - strcpy (p, "----.-- K/s "); + strcpy (p, " --.-- K/s "); p += 12; } diff --git a/src/recur.c b/src/recur.c index 019f3926..2c261579 100644 --- a/src/recur.c +++ b/src/recur.c @@ -1,5 +1,5 @@ /* Handling of recursive HTTP retrieving. - Copyright (C) 1995, 1996, 1997, 2000 Free Software Foundation, Inc. + Copyright (C) 1995, 1996, 1997, 2000, 2001 Free Software Foundation, Inc. This file is part of GNU Wget. @@ -54,452 +54,480 @@ static struct hash_table *dl_file_url_map; static struct hash_table *dl_url_file_map; /* List of HTML files downloaded in this Wget run. Used for link - conversion after Wget is done. */ + conversion after Wget is done. This list should only be traversed + in order. If you need to check whether a file has been downloaded, + use a hash table, e.g. dl_file_url_map. */ static slist *downloaded_html_files; + +/* Functions for maintaining the URL queue. */ -/* List of undesirable-to-load URLs. */ -static struct hash_table *undesirable_urls; +struct queue_element { + const char *url; + const char *referer; + int depth; + struct queue_element *next; +}; -/* Current recursion depth. */ -static int depth; +struct url_queue { + struct queue_element *head; + struct queue_element *tail; + int count, maxcount; +}; -/* Base directory we're recursing from (used by no_parent). */ -static char *base_dir; +/* Create a URL queue. */ -static int first_time = 1; +static struct url_queue * +url_queue_new (void) +{ + struct url_queue *queue = xmalloc (sizeof (*queue)); + memset (queue, '\0', sizeof (*queue)); + return queue; +} +/* Delete a URL queue. */ -/* Cleanup the data structures associated with recursive retrieving - (the variables above). */ -void -recursive_cleanup (void) +static void +url_queue_delete (struct url_queue *queue) { - if (undesirable_urls) - { - string_set_free (undesirable_urls); - undesirable_urls = NULL; - } - if (dl_file_url_map) - { - free_keys_and_values (dl_file_url_map); - hash_table_destroy (dl_file_url_map); - dl_file_url_map = NULL; - } - if (dl_url_file_map) - { - free_keys_and_values (dl_url_file_map); - hash_table_destroy (dl_url_file_map); - dl_url_file_map = NULL; - } - undesirable_urls = NULL; - slist_free (downloaded_html_files); - downloaded_html_files = NULL; - FREE_MAYBE (base_dir); - first_time = 1; + xfree (queue); } -/* Reset FIRST_TIME to 1, so that some action can be taken in - recursive_retrieve(). */ -void -recursive_reset (void) +/* Enqueue a URL in the queue. The queue is FIFO: the items will be + retrieved ("dequeued") from the queue in the order they were placed + into it. */ + +static void +url_enqueue (struct url_queue *queue, + const char *url, const char *referer, int depth) { - first_time = 1; + struct queue_element *qel = xmalloc (sizeof (*qel)); + qel->url = url; + qel->referer = referer; + qel->depth = depth; + qel->next = NULL; + + ++queue->count; + if (queue->count > queue->maxcount) + queue->maxcount = queue->count; + + DEBUGP (("Enqueuing %s at depth %d\n", url, depth)); + DEBUGP (("Queue count %d, maxcount %d.\n", queue->count, queue->maxcount)); + + if (queue->tail) + queue->tail->next = qel; + queue->tail = qel; + + if (!queue->head) + queue->head = queue->tail; } -/* The core of recursive retrieving. Endless recursion is avoided by - having all URLs stored to a linked list of URLs, which is checked - before loading any URL. That way no URL can get loaded twice. +/* Take a URL out of the queue. Return 1 if this operation succeeded, + or 0 if the queue is empty. */ + +static int +url_dequeue (struct url_queue *queue, + const char **url, const char **referer, int *depth) +{ + struct queue_element *qel = queue->head; + + if (!qel) + return 0; + + queue->head = queue->head->next; + if (!queue->head) + queue->tail = NULL; + + *url = qel->url; + *referer = qel->referer; + *depth = qel->depth; + + --queue->count; + + DEBUGP (("Dequeuing %s at depth %d\n", qel->url, qel->depth)); + DEBUGP (("Queue count %d, maxcount %d.\n", queue->count, queue->maxcount)); + + xfree (qel); + return 1; +} + +static int descend_url_p PARAMS ((const struct urlpos *, struct url *, int, + struct url *, struct hash_table *)); + +/* Retrieve a part of the web beginning with START_URL. This used to + be called "recursive retrieval", because the old function was + recursive and implemented depth-first search. retrieve_tree on the + other hand implements breadth-search traversal of the tree, which + results in much nicer ordering of downloads. + + The algorithm this function uses is simple: + + 1. put START_URL in the queue. + 2. while there are URLs in the queue: + + 3. get next URL from the queue. + 4. download it. + 5. if the URL is HTML and its depth does not exceed maximum depth, + get the list of URLs embedded therein. + 6. for each of those URLs do the following: + + 7. if the URL is not one of those downloaded before, and if it + satisfies the criteria specified by the various command-line + options, add it to the queue. */ - The function also supports specification of maximum recursion depth - and a number of other goodies. */ uerr_t -recursive_retrieve (const char *file, const char *this_url) +retrieve_tree (const char *start_url) { - char *constr, *filename, *newloc; - char *canon_this_url = NULL; - int dt, inl, dash_p_leaf_HTML = FALSE; - int meta_disallow_follow; - int this_url_ftp; /* See below the explanation */ - urlpos *url_list, *cur_url; - struct url *u; - - assert (this_url != NULL); - assert (file != NULL); - /* If quota was exceeded earlier, bail out. */ - if (downloaded_exceeds_quota ()) - return QUOTEXC; - /* Cache the current URL in the list. */ - if (first_time) - { - /* These three operations need to be done only once per Wget - run. They should probably be at a different location. */ - if (!undesirable_urls) - undesirable_urls = make_string_hash_table (0); - - hash_table_clear (undesirable_urls); - string_set_add (undesirable_urls, this_url); - /* Enter this_url to the hash table, in original and "enhanced" form. */ - u = url_parse (this_url, NULL); - if (u) - { - string_set_add (undesirable_urls, u->url); - if (opt.no_parent) - base_dir = xstrdup (u->dir); /* Set the base dir. */ - /* Set the canonical this_url to be sent as referer. This - problem exists only when running the first time. */ - canon_this_url = xstrdup (u->url); - } - else - { - DEBUGP (("Double yuck! The *base* URL is broken.\n")); - base_dir = NULL; - } - url_free (u); - depth = 1; - first_time = 0; - } - else - ++depth; + uerr_t status = RETROK; - if (opt.reclevel != INFINITE_RECURSION && depth > opt.reclevel) - /* We've exceeded the maximum recursion depth specified by the user. */ - { - if (opt.page_requisites && depth <= opt.reclevel + 1) - /* When -p is specified, we can do one more partial recursion from the - "leaf nodes" on the HTML document tree. The recursion is partial in - that we won't traverse any or tags, nor any tags - except for . */ - dash_p_leaf_HTML = TRUE; - else - /* Either -p wasn't specified or it was and we've already gone the one - extra (pseudo-)level that it affords us, so we need to bail out. */ - { - DEBUGP (("Recursion depth %d exceeded max. depth %d.\n", - depth, opt.reclevel)); - --depth; - return RECLEVELEXC; - } - } + /* The queue of URLs we need to load. */ + struct url_queue *queue = url_queue_new (); - /* Determine whether this_url is an FTP URL. If it is, it means - that the retrieval is done through proxy. In that case, FTP - links will be followed by default and recursion will not be - turned off when following them. */ - this_url_ftp = (url_scheme (this_url) == SCHEME_FTP); + /* The URLs we decided we don't want to load. */ + struct hash_table *blacklist = make_string_hash_table (0); - /* Get the URL-s from an HTML file: */ - url_list = get_urls_html (file, canon_this_url ? canon_this_url : this_url, - dash_p_leaf_HTML, &meta_disallow_follow); + /* We'll need various components of this, so better get it over with + now. */ + struct url *start_url_parsed = url_parse (start_url, NULL); - if (opt.use_robots && meta_disallow_follow) - { - /* The META tag says we are not to follow this file. Respect - that. */ - free_urlpos (url_list); - url_list = NULL; - } + url_enqueue (queue, xstrdup (start_url), NULL, 0); + string_set_add (blacklist, start_url); - /* Decide what to do with each of the URLs. A URL will be loaded if - it meets several requirements, discussed later. */ - for (cur_url = url_list; cur_url; cur_url = cur_url->next) + while (1) { - /* If quota was exceeded earlier, bail out. */ + int descend = 0; + char *url, *referer, *file = NULL; + int depth; + boolean dash_p_leaf_HTML = FALSE; + if (downloaded_exceeds_quota ()) break; - /* Parse the URL for convenient use in other functions, as well - as to get the optimized form. It also checks URL integrity. */ - u = url_parse (cur_url->url, NULL); - if (!u) - { - DEBUGP (("Yuck! A bad URL.\n")); - continue; - } - assert (u->url != NULL); - constr = xstrdup (u->url); - - /* Several checkings whether a file is acceptable to load: - 1. check if URL is ftp, and we don't load it - 2. check for relative links (if relative_only is set) - 3. check for domain - 4. check for no-parent - 5. check for excludes && includes - 6. check for suffix - 7. check for same host (if spanhost is unset), with possible - gethostbyname baggage - 8. check for robots.txt - - Addendum: If the URL is FTP, and it is to be loaded, only the - domain and suffix settings are "stronger". - - Note that .html and (yuck) .htm will get loaded regardless of - suffix rules (but that is remedied later with unlink) unless - the depth equals the maximum depth. - - More time- and memory- consuming tests should be put later on - the list. */ - - /* inl is set if the URL we are working on (constr) is stored in - undesirable_urls. Using it is crucial to avoid unnecessary - repeated continuous hits to the hash table. */ - inl = string_set_contains (undesirable_urls, constr); - - /* If it is FTP, and FTP is not followed, chuck it out. */ - if (!inl) - if (u->scheme == SCHEME_FTP && !opt.follow_ftp && !this_url_ftp) - { - DEBUGP (("Uh, it is FTP but i'm not in the mood to follow FTP.\n")); - string_set_add (undesirable_urls, constr); - inl = 1; - } - /* If it is absolute link and they are not followed, chuck it - out. */ - if (!inl && u->scheme != SCHEME_FTP) - if (opt.relative_only && !cur_url->link_relative_p) - { - DEBUGP (("It doesn't really look like a relative link.\n")); - string_set_add (undesirable_urls, constr); - inl = 1; - } - /* If its domain is not to be accepted/looked-up, chuck it out. */ - if (!inl) - if (!accept_domain (u)) + + if (status == FWRITEERR) + break; + + /* Get the next URL from the queue. */ + + if (!url_dequeue (queue, + (const char **)&url, (const char **)&referer, + &depth)) + break; + + /* And download it. */ + + { + int dt = 0; + char *redirected = NULL; + int oldrec = opt.recursive; + + opt.recursive = 0; + status = retrieve_url (url, &file, &redirected, NULL, &dt); + opt.recursive = oldrec; + + if (redirected) { - DEBUGP (("I don't like the smell of that domain.\n")); - string_set_add (undesirable_urls, constr); - inl = 1; + xfree (url); + url = redirected; } - /* Check for parent directory. */ - if (!inl && opt.no_parent - /* If the new URL is FTP and the old was not, ignore - opt.no_parent. */ - && !(!this_url_ftp && u->scheme == SCHEME_FTP)) + if (file && status == RETROK + && (dt & RETROKF) && (dt & TEXTHTML)) + descend = 1; + } + + if (descend + && depth >= opt.reclevel && opt.reclevel != INFINITE_RECURSION) { - /* Check for base_dir first. */ - if (!(base_dir && frontcmp (base_dir, u->dir))) + if (opt.page_requisites && depth == opt.reclevel) + /* When -p is specified, we can do one more partial + recursion from the "leaf nodes" on the HTML document + tree. The recursion is partial in that we won't + traverse any or tags, nor any tags + except for . */ + /* #### This would be the place to implement the TODO + entry saying that -p should do two more hops on + framesets. */ + dash_p_leaf_HTML = TRUE; + else { - /* Failing that, check for parent dir. */ - struct url *ut = url_parse (this_url, NULL); - if (!ut) - DEBUGP (("Double yuck! The *base* URL is broken.\n")); - else if (!frontcmp (ut->dir, u->dir)) - { - /* Failing that too, kill the URL. */ - DEBUGP (("Trying to escape parental guidance with no_parent on.\n")); - string_set_add (undesirable_urls, constr); - inl = 1; - } - url_free (ut); + /* Either -p wasn't specified or it was and we've + already gone the one extra (pseudo-)level that it + affords us, so we need to bail out. */ + DEBUGP (("Not descending further; at depth %d, max. %d.\n", + depth, opt.reclevel)); + descend = 0; } } - /* If the file does not match the acceptance list, or is on the - rejection list, chuck it out. The same goes for the - directory exclude- and include- lists. */ - if (!inl && (opt.includes || opt.excludes)) + + /* If the downloaded document was HTML, parse it and enqueue the + links it contains. */ + + if (descend) { - if (!accdir (u->dir, ALLABS)) + int meta_disallow_follow = 0; + struct urlpos *children = get_urls_html (file, url, dash_p_leaf_HTML, + &meta_disallow_follow); + + if (opt.use_robots && meta_disallow_follow) { - DEBUGP (("%s (%s) is excluded/not-included.\n", constr, u->dir)); - string_set_add (undesirable_urls, constr); - inl = 1; + free_urlpos (children); + children = NULL; } - } - if (!inl) - { - char *suf = NULL; - /* We check for acceptance/rejection rules only for non-HTML - documents. Since we don't know whether they really are - HTML, it will be deduced from (an OR-ed list): - - 1) u->file is "" (meaning it is a directory) - 2) suffix exists, AND: - a) it is "html", OR - b) it is "htm" - - If the file *is* supposed to be HTML, it will *not* be - subject to acc/rej rules, unless a finite maximum depth has - been specified and the current depth is the maximum depth. */ - if (! - (!*u->file - || (((suf = suffix (constr)) != NULL) - && ((!strcmp (suf, "html") || !strcmp (suf, "htm")) - && ((opt.reclevel != INFINITE_RECURSION) && - (depth != opt.reclevel)))))) + + if (children) { - if (!acceptable (u->file)) + struct urlpos *child = children; + struct url *url_parsed = url_parsed = url_parse (url, NULL); + assert (url_parsed != NULL); + + for (; child; child = child->next) { - DEBUGP (("%s (%s) does not match acc/rej rules.\n", - constr, u->file)); - string_set_add (undesirable_urls, constr); - inl = 1; + if (descend_url_p (child, url_parsed, depth, start_url_parsed, + blacklist)) + { + url_enqueue (queue, xstrdup (child->url->url), + xstrdup (url), depth + 1); + /* We blacklist the URL we have enqueued, because we + don't want to enqueue (and hence download) the + same URL twice. */ + string_set_add (blacklist, child->url->url); + } } + + url_free (url_parsed); + free_urlpos (children); } - FREE_MAYBE (suf); } - /* Optimize the URL (which includes possible DNS lookup) only - after all other possibilities have been exhausted. */ - if (!inl) + + if (opt.delete_after || (file && !acceptable (file))) { - if (!opt.simple_check) - { - /* Find the "true" host. */ - char *host = realhost (u->host); - xfree (u->host); - u->host = host; - - /* Refresh the printed representation of the URL. */ - xfree (u->url); - u->url = url_string (u, 0); - } - else - { - char *p; - /* Just lowercase the hostname. */ - for (p = u->host; *p; p++) - *p = TOLOWER (*p); - xfree (u->url); - u->url = url_string (u, 0); - } - xfree (constr); - constr = xstrdup (u->url); - /* After we have canonicalized the URL, check if we have it - on the black list. */ - if (string_set_contains (undesirable_urls, constr)) - inl = 1; - /* This line is bogus. */ - /*string_set_add (undesirable_urls, constr);*/ - - if (!inl && !((u->scheme == SCHEME_FTP) && !this_url_ftp)) - if (!opt.spanhost && this_url && !same_host (this_url, constr)) - { - DEBUGP (("This is not the same hostname as the parent's.\n")); - string_set_add (undesirable_urls, constr); - inl = 1; - } + /* Either --delete-after was specified, or we loaded this + otherwise rejected (e.g. by -R) HTML file just so we + could harvest its hyperlinks -- in either case, delete + the local file. */ + DEBUGP (("Removing file due to %s in recursive_retrieve():\n", + opt.delete_after ? "--delete-after" : + "recursive rejection criteria")); + logprintf (LOG_VERBOSE, + (opt.delete_after ? _("Removing %s.\n") + : _("Removing %s since it should be rejected.\n")), + file); + if (unlink (file)) + logprintf (LOG_NOTQUIET, "unlink: %s\n", strerror (errno)); } - /* What about robots.txt? */ - if (!inl && opt.use_robots && u->scheme == SCHEME_HTTP) + + xfree (url); + FREE_MAYBE (referer); + FREE_MAYBE (file); + } + + /* If anything is left of the queue due to a premature exit, free it + now. */ + { + char *d1, *d2; + int d3; + while (url_dequeue (queue, (const char **)&d1, (const char **)&d2, &d3)) + { + xfree (d1); + FREE_MAYBE (d2); + } + } + url_queue_delete (queue); + + if (start_url_parsed) + url_free (start_url_parsed); + string_set_free (blacklist); + + if (downloaded_exceeds_quota ()) + return QUOTEXC; + else if (status == FWRITEERR) + return FWRITEERR; + else + return RETROK; +} + +/* Based on the context provided by retrieve_tree, decide whether a + URL is to be descended to. This is only ever called from + retrieve_tree, but is in a separate function for clarity. */ + +static int +descend_url_p (const struct urlpos *upos, struct url *parent, int depth, + struct url *start_url_parsed, struct hash_table *blacklist) +{ + struct url *u = upos->url; + const char *url = u->url; + + DEBUGP (("Deciding whether to enqueue \"%s\".\n", url)); + + if (string_set_contains (blacklist, url)) + { + DEBUGP (("Already on the black list.\n")); + goto out; + } + + /* Several things to check for: + 1. if scheme is not http, and we don't load it + 2. check for relative links (if relative_only is set) + 3. check for domain + 4. check for no-parent + 5. check for excludes && includes + 6. check for suffix + 7. check for same host (if spanhost is unset), with possible + gethostbyname baggage + 8. check for robots.txt + + Addendum: If the URL is FTP, and it is to be loaded, only the + domain and suffix settings are "stronger". + + Note that .html files will get loaded regardless of suffix rules + (but that is remedied later with unlink) unless the depth equals + the maximum depth. + + More time- and memory- consuming tests should be put later on + the list. */ + + /* 1. Schemes other than HTTP are normally not recursed into. */ + if (u->scheme != SCHEME_HTTP + && !(u->scheme == SCHEME_FTP && opt.follow_ftp)) + { + DEBUGP (("Not following non-HTTP schemes.\n")); + goto blacklist; + } + + /* 2. If it is an absolute link and they are not followed, throw it + out. */ + if (u->scheme == SCHEME_HTTP) + if (opt.relative_only && !upos->link_relative_p) + { + DEBUGP (("It doesn't really look like a relative link.\n")); + goto blacklist; + } + + /* 3. If its domain is not to be accepted/looked-up, chuck it + out. */ + if (!accept_domain (u)) + { + DEBUGP (("The domain was not accepted.\n")); + goto blacklist; + } + + /* 4. Check for parent directory. + + If we descended to a different host or changed the scheme, ignore + opt.no_parent. Also ignore it for -p leaf retrievals. */ + if (opt.no_parent + && u->scheme == parent->scheme + && 0 == strcasecmp (u->host, parent->host) + && u->port == parent->port) + { + if (!frontcmp (parent->dir, u->dir)) { - struct robot_specs *specs = res_get_specs (u->host, u->port); - if (!specs) - { - char *rfile; - if (res_retrieve_file (constr, &rfile)) - { - specs = res_parse_from_file (rfile); - xfree (rfile); - } - else - { - /* If we cannot get real specs, at least produce - dummy ones so that we can register them and stop - trying to retrieve them. */ - specs = res_parse ("", 0); - } - res_register_specs (u->host, u->port, specs); - } + DEBUGP (("Trying to escape the root directory with no_parent in effect.\n")); + goto blacklist; + } + } - /* Now that we have (or don't have) robots.txt specs, we can - check what they say. */ - if (!res_match_path (specs, u->path)) - { - DEBUGP (("Not following %s because robots.txt forbids it.\n", - constr)); - string_set_add (undesirable_urls, constr); - inl = 1; - } + /* 5. If the file does not match the acceptance list, or is on the + rejection list, chuck it out. The same goes for the directory + exclusion and inclusion lists. */ + if (opt.includes || opt.excludes) + { + if (!accdir (u->dir, ALLABS)) + { + DEBUGP (("%s (%s) is excluded/not-included.\n", url, u->dir)); + goto blacklist; } + } - filename = NULL; - /* If it wasn't chucked out, do something with it. */ - if (!inl) + /* 6. */ + { + char *suf = NULL; + /* Check for acceptance/rejection rules. We ignore these rules + for HTML documents because they might lead to other files which + need to be downloaded. Of course, we don't know which + documents are HTML before downloading them, so we guess. + + A file is subject to acceptance/rejection rules if: + + * u->file is not "" (i.e. it is not a directory) + and either: + + there is no file suffix, + + or there is a suffix, but is not "html" or "htm", + + both: + - recursion is not infinite, + - and we are at its very end. */ + + if (u->file[0] != '\0' + && ((suf = suffix (url)) == NULL + || (0 != strcmp (suf, "html") && 0 != strcmp (suf, "htm")) + || (opt.reclevel == INFINITE_RECURSION && depth >= opt.reclevel))) + { + if (!acceptable (u->file)) + { + DEBUGP (("%s (%s) does not match acc/rej rules.\n", + url, u->file)); + FREE_MAYBE (suf); + goto blacklist; + } + } + FREE_MAYBE (suf); + } + + /* 7. */ + if (u->scheme == parent->scheme) + if (!opt.spanhost && 0 != strcasecmp (parent->host, u->host)) + { + DEBUGP (("This is not the same hostname as the parent's (%s and %s).\n", + u->host, parent->host)); + goto blacklist; + } + + /* 8. */ + if (opt.use_robots && u->scheme == SCHEME_HTTP) + { + struct robot_specs *specs = res_get_specs (u->host, u->port); + if (!specs) { - DEBUGP (("I've decided to load it -> ")); - /* Add it to the list of already-loaded URL-s. */ - string_set_add (undesirable_urls, constr); - /* Automatically followed FTPs will *not* be downloaded - recursively. */ - if (u->scheme == SCHEME_FTP) + char *rfile; + if (res_retrieve_file (url, &rfile)) { - /* Don't you adore side-effects? */ - opt.recursive = 0; - } - /* Reset its type. */ - dt = 0; - /* Retrieve it. */ - retrieve_url (constr, &filename, &newloc, - canon_this_url ? canon_this_url : this_url, &dt); - if (u->scheme == SCHEME_FTP) - { - /* Restore... */ - opt.recursive = 1; - } - if (newloc) - { - xfree (constr); - constr = newloc; - } - /* If there was no error, and the type is text/html, parse - it recursively. */ - if (dt & TEXTHTML) - { - if (dt & RETROKF) - recursive_retrieve (filename, constr); + specs = res_parse_from_file (rfile); + xfree (rfile); } else - DEBUGP (("%s is not text/html so we don't chase.\n", - filename ? filename: "(null)")); - - if (opt.delete_after || (filename && !acceptable (filename))) - /* Either --delete-after was specified, or we loaded this otherwise - rejected (e.g. by -R) HTML file just so we could harvest its - hyperlinks -- in either case, delete the local file. */ { - DEBUGP (("Removing file due to %s in recursive_retrieve():\n", - opt.delete_after ? "--delete-after" : - "recursive rejection criteria")); - logprintf (LOG_VERBOSE, - (opt.delete_after ? _("Removing %s.\n") - : _("Removing %s since it should be rejected.\n")), - filename); - if (unlink (filename)) - logprintf (LOG_NOTQUIET, "unlink: %s\n", strerror (errno)); - dt &= ~RETROKF; + /* If we cannot get real specs, at least produce + dummy ones so that we can register them and stop + trying to retrieve them. */ + specs = res_parse ("", 0); } + res_register_specs (u->host, u->port, specs); + } - /* If everything was OK, and links are to be converted, let's - store the local filename. */ - if (opt.convert_links && (dt & RETROKF) && (filename != NULL)) - { - cur_url->convert = CO_CONVERT_TO_RELATIVE; - cur_url->local_name = xstrdup (filename); - } + /* Now that we have (or don't have) robots.txt specs, we can + check what they say. */ + if (!res_match_path (specs, u->path)) + { + DEBUGP (("Not following %s because robots.txt forbids it.\n", url)); + goto blacklist; } - else - DEBUGP (("%s already in list, so we don't load.\n", constr)); - /* Free filename and constr. */ - FREE_MAYBE (filename); - FREE_MAYBE (constr); - url_free (u); - /* Increment the pbuf for the appropriate size. */ } - if (opt.convert_links && !opt.delete_after) - /* This is merely the first pass: the links that have been - successfully downloaded are converted. In the second pass, - convert_all_links() will also convert those links that have NOT - been downloaded to their canonical form. */ - convert_links (file, url_list); - /* Free the linked list of URL-s. */ - free_urlpos (url_list); - /* Free the canonical this_url. */ - FREE_MAYBE (canon_this_url); - /* Decrement the recursion depth. */ - --depth; - if (downloaded_exceeds_quota ()) - return QUOTEXC; - else - return RETROK; + + /* The URL has passed all the tests. It can be placed in the + download queue. */ + DEBUGP (("Decided to load it.\n")); + + return 1; + + blacklist: + string_set_add (blacklist, url); + + out: + DEBUGP (("Decided NOT to load it.\n")); + + return 0; } +/* Register that URL has been successfully downloaded to FILE. */ + void register_download (const char *url, const char *file) { @@ -507,12 +535,35 @@ register_download (const char *url, const char *file) return; if (!dl_file_url_map) dl_file_url_map = make_string_hash_table (0); - hash_table_put (dl_file_url_map, xstrdup (file), xstrdup (url)); if (!dl_url_file_map) dl_url_file_map = make_string_hash_table (0); - hash_table_put (dl_url_file_map, xstrdup (url), xstrdup (file)); + + if (!hash_table_contains (dl_file_url_map, file)) + hash_table_put (dl_file_url_map, xstrdup (file), xstrdup (url)); + if (!hash_table_contains (dl_url_file_map, url)) + hash_table_put (dl_url_file_map, xstrdup (url), xstrdup (file)); } +/* Register that FROM has been redirected to TO. This assumes that TO + is successfully downloaded and already registered using + register_download() above. */ + +void +register_redirection (const char *from, const char *to) +{ + char *file; + + if (!opt.convert_links) + return; + + file = hash_table_get (dl_url_file_map, to); + assert (file != NULL); + if (!hash_table_contains (dl_url_file_map, from)) + hash_table_put (dl_url_file_map, xstrdup (from), xstrdup (file)); +} + +/* Register that URL corresponds to the HTML file FILE. */ + void register_html (const char *url, const char *file) { @@ -558,10 +609,11 @@ convert_all_links (void) for (html = downloaded_html_files; html; html = html->next) { - urlpos *urls, *cur_url; + struct urlpos *urls, *cur_url; char *url; DEBUGP (("Rescanning %s\n", html->string)); + /* Determine the URL of the HTML file. get_urls_html will need it. */ url = hash_table_get (dl_file_url_map, html->string); @@ -569,19 +621,19 @@ convert_all_links (void) DEBUGP (("It should correspond to %s.\n", url)); else DEBUGP (("I cannot find the corresponding URL.\n")); + /* Parse the HTML file... */ urls = get_urls_html (html->string, url, FALSE, NULL); + /* We don't respect meta_disallow_follow here because, even if the file is not followed, we might still want to convert the links that have been followed from other files. */ + for (cur_url = urls; cur_url; cur_url = cur_url->next) { char *local_name; + struct url *u = cur_url->url; - /* The URL must be in canonical form to be compared. */ - struct url *u = url_parse (cur_url->url, NULL); - if (!u) - continue; /* We decide the direction of conversion according to whether a URL was downloaded. Downloaded URLs will be converted ABS2REL, whereas non-downloaded will be converted REL2ABS. */ @@ -589,6 +641,7 @@ convert_all_links (void) if (local_name) DEBUGP (("%s marked for conversion, local %s\n", u->url, local_name)); + /* Decide on the conversion direction. */ if (local_name) { @@ -610,7 +663,6 @@ convert_all_links (void) cur_url->convert = CO_CONVERT_TO_COMPLETE; cur_url->local_name = NULL; } - url_free (u); } /* Convert the links in the file. */ convert_links (html->string, urls); @@ -618,3 +670,24 @@ convert_all_links (void) free_urlpos (urls); } } + +/* Cleanup the data structures associated with recursive retrieving + (the variables above). */ +void +recursive_cleanup (void) +{ + if (dl_file_url_map) + { + free_keys_and_values (dl_file_url_map); + hash_table_destroy (dl_file_url_map); + dl_file_url_map = NULL; + } + if (dl_url_file_map) + { + free_keys_and_values (dl_url_file_map); + hash_table_destroy (dl_url_file_map); + dl_url_file_map = NULL; + } + slist_free (downloaded_html_files); + downloaded_html_files = NULL; +} diff --git a/src/recur.h b/src/recur.h index de3c6295..4654de83 100644 --- a/src/recur.h +++ b/src/recur.h @@ -21,10 +21,10 @@ Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. */ #define RECUR_H void recursive_cleanup PARAMS ((void)); -void recursive_reset PARAMS ((void)); -uerr_t recursive_retrieve PARAMS ((const char *, const char *)); +uerr_t retrieve_tree PARAMS ((const char *)); void register_download PARAMS ((const char *, const char *)); +void register_redirection PARAMS ((const char *, const char *)); void register_html PARAMS ((const char *, const char *)); void convert_all_links PARAMS ((void)); diff --git a/src/res.c b/src/res.c index 9cd15db6..32b53b1b 100644 --- a/src/res.c +++ b/src/res.c @@ -125,6 +125,10 @@ add_path (struct robot_specs *specs, const char *path_b, const char *path_e, int allowedp, int exactp) { struct path_info pp; + if (path_b < path_e && *path_b == '/') + /* Our path representation doesn't use a leading slash, so remove + one from theirs. */ + ++path_b; pp.path = strdupdelim (path_b, path_e); pp.allowedp = allowedp; pp.user_agent_exact_p = exactp; @@ -390,6 +394,9 @@ res_parse_from_file (const char *filename) static void free_specs (struct robot_specs *specs) { + int i; + for (i = 0; i < specs->count; i++) + xfree (specs->paths[i].path); FREE_MAYBE (specs->paths); xfree (specs); } @@ -546,3 +553,22 @@ res_retrieve_file (const char *url, char **file) } return err == RETROK; } + +static int +cleanup_hash_table_mapper (void *key, void *value, void *arg_ignored) +{ + xfree (key); + free_specs (value); + return 0; +} + +void +res_cleanup (void) +{ + if (registered_specs) + { + hash_table_map (registered_specs, cleanup_hash_table_mapper, NULL); + hash_table_destroy (registered_specs); + registered_specs = NULL; + } +} diff --git a/src/res.h b/src/res.h index 97b89778..3b8b7647 100644 --- a/src/res.h +++ b/src/res.h @@ -29,3 +29,4 @@ struct robot_specs *res_get_specs PARAMS ((const char *, int)); int res_retrieve_file PARAMS ((const char *, char **)); +void res_cleanup PARAMS ((void)); diff --git a/src/retr.c b/src/retr.c index 32c58a7f..dcd81226 100644 --- a/src/retr.c +++ b/src/retr.c @@ -184,6 +184,26 @@ rate (long bytes, long msecs, int pad) return res; } +static int +register_redirections_mapper (void *key, void *value, void *arg) +{ + const char *redirected_from = (const char *)key; + const char *redirected_to = (const char *)arg; + if (0 != strcmp (redirected_from, redirected_to)) + register_redirection (redirected_from, redirected_to); + return 0; +} + +/* Register the redirections that lead to the successful download of + this URL. This is necessary so that the link converter can convert + redirected URLs to the local file. */ + +static void +register_all_redirections (struct hash_table *redirections, const char *final) +{ + hash_table_map (redirections, register_redirections_mapper, (void *)final); +} + #define USE_PROXY_P(u) (opt.use_proxy && getproxy((u)->scheme) \ && no_proxy_match((u)->host, \ (const char **)opt.no_proxy)) @@ -254,7 +274,7 @@ retrieve_url (const char *origurl, char **file, char **newloc, proxy_url = url_parse (proxy, &up_error_code); if (!proxy_url) { - logprintf (LOG_NOTQUIET, "Error parsing proxy URL %s: %s.\n", + logprintf (LOG_NOTQUIET, _("Error parsing proxy URL %s: %s.\n"), proxy, url_error (up_error_code)); if (redirections) string_set_free (redirections); @@ -310,7 +330,7 @@ retrieve_url (const char *origurl, char **file, char **newloc, if (location_changed) { char *construced_newloc; - struct url *newloc_struct; + struct url *newloc_parsed; assert (mynewloc != NULL); @@ -326,12 +346,11 @@ retrieve_url (const char *origurl, char **file, char **newloc, mynewloc = construced_newloc; /* Now, see if this new location makes sense. */ - newloc_struct = url_parse (mynewloc, &up_error_code); - if (!newloc_struct) + newloc_parsed = url_parse (mynewloc, &up_error_code); + if (!newloc_parsed) { logprintf (LOG_NOTQUIET, "%s: %s.\n", mynewloc, url_error (up_error_code)); - url_free (newloc_struct); url_free (u); if (redirections) string_set_free (redirections); @@ -340,11 +359,11 @@ retrieve_url (const char *origurl, char **file, char **newloc, return result; } - /* Now mynewloc will become newloc_struct->url, because if the + /* Now mynewloc will become newloc_parsed->url, because if the Location contained relative paths like .././something, we don't want that propagating as url. */ xfree (mynewloc); - mynewloc = xstrdup (newloc_struct->url); + mynewloc = xstrdup (newloc_parsed->url); if (!redirections) { @@ -356,11 +375,11 @@ retrieve_url (const char *origurl, char **file, char **newloc, /* The new location is OK. Check for redirection cycle by peeking through the history of redirections. */ - if (string_set_contains (redirections, newloc_struct->url)) + if (string_set_contains (redirections, newloc_parsed->url)) { logprintf (LOG_NOTQUIET, _("%s: Redirection cycle detected.\n"), mynewloc); - url_free (newloc_struct); + url_free (newloc_parsed); url_free (u); if (redirections) string_set_free (redirections); @@ -368,12 +387,12 @@ retrieve_url (const char *origurl, char **file, char **newloc, xfree (mynewloc); return WRONGCODE; } - string_set_add (redirections, newloc_struct->url); + string_set_add (redirections, newloc_parsed->url); xfree (url); url = mynewloc; url_free (u); - u = newloc_struct; + u = newloc_parsed; goto redirected; } @@ -382,6 +401,8 @@ retrieve_url (const char *origurl, char **file, char **newloc, if (*dt & RETROKF) { register_download (url, local_file); + if (redirections) + register_all_redirections (redirections, url); if (*dt & TEXTHTML) register_html (url, local_file); } @@ -415,16 +436,16 @@ uerr_t retrieve_from_file (const char *file, int html, int *count) { uerr_t status; - urlpos *url_list, *cur_url; + struct urlpos *url_list, *cur_url; url_list = (html ? get_urls_html (file, NULL, FALSE, NULL) : get_urls_file (file)); status = RETROK; /* Suppose everything is OK. */ *count = 0; /* Reset the URL count. */ - recursive_reset (); + for (cur_url = url_list; cur_url; cur_url = cur_url->next, ++*count) { - char *filename, *new_file; + char *filename = NULL, *new_file; int dt; if (downloaded_exceeds_quota ()) @@ -432,10 +453,10 @@ retrieve_from_file (const char *file, int html, int *count) status = QUOTEXC; break; } - status = retrieve_url (cur_url->url, &filename, &new_file, NULL, &dt); - if (opt.recursive && status == RETROK && (dt & TEXTHTML)) - status = recursive_retrieve (filename, new_file ? new_file - : cur_url->url); + if (opt.recursive && cur_url->url->scheme != SCHEME_FTP) + status = retrieve_tree (cur_url->url->url); + else + status = retrieve_url (cur_url->url->url, &filename, &new_file, NULL, &dt); if (filename && opt.delete_after && file_exists_p (filename)) { diff --git a/src/url.c b/src/url.c index e1685ab9..74547d90 100644 --- a/src/url.c +++ b/src/url.c @@ -37,6 +37,7 @@ Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. */ #include "utils.h" #include "url.h" #include "host.h" +#include "hash.h" #ifndef errno extern int errno; @@ -182,7 +183,7 @@ encode_string_maybe (const char *s) { if (UNSAFE_CHAR (*p1)) { - const unsigned char c = *p1++; + unsigned char c = *p1++; *p2++ = '%'; *p2++ = XDIGIT_TO_XCHAR (c >> 4); *p2++ = XDIGIT_TO_XCHAR (c & 0xf); @@ -378,7 +379,7 @@ reencode_string (const char *s) { case CM_ENCODE: { - char c = *p1++; + unsigned char c = *p1++; *p2++ = '%'; *p2++ = XDIGIT_TO_XCHAR (c >> 4); *p2++ = XDIGIT_TO_XCHAR (c & 0xf); @@ -586,6 +587,22 @@ strpbrk_or_eos (const char *s, const char *accept) return p; } +/* Turn STR into lowercase; return non-zero if a character was + actually changed. */ + +static int +lowercase_str (char *str) +{ + int change = 0; + for (; *str; str++) + if (!ISLOWER (*str)) + { + change = 1; + *str = TOLOWER (*str); + } + return change; +} + static char *parse_errors[] = { #define PE_NO_ERROR 0 "No error", @@ -614,6 +631,7 @@ url_parse (const char *url, int *error) { struct url *u; const char *p; + int path_modified, host_modified; enum url_scheme scheme; @@ -627,9 +645,7 @@ url_parse (const char *url, int *error) int port; char *user = NULL, *passwd = NULL; - const char *url_orig = url; - - p = url = reencode_string (url); + char *url_encoded; scheme = url_scheme (url); if (scheme == SCHEME_INVALID) @@ -638,6 +654,9 @@ url_parse (const char *url, int *error) return NULL; } + url_encoded = reencode_string (url); + p = url_encoded; + p += strlen (supported_schemes[scheme].leading_string); uname_b = p; p += url_skip_uname (p); @@ -749,11 +768,6 @@ url_parse (const char *url, int *error) u = (struct url *)xmalloc (sizeof (struct url)); memset (u, 0, sizeof (*u)); - if (url == url_orig) - u->url = xstrdup (url); - else - u->url = (char *)url; - u->scheme = scheme; u->host = strdupdelim (host_b, host_e); u->port = port; @@ -761,7 +775,10 @@ url_parse (const char *url, int *error) u->passwd = passwd; u->path = strdupdelim (path_b, path_e); - path_simplify (u->path); + path_modified = path_simplify (u->path); + parse_path (u->path, &u->dir, &u->file); + + host_modified = lowercase_str (u->host); if (params_b) u->params = strdupdelim (params_b, params_e); @@ -770,7 +787,26 @@ url_parse (const char *url, int *error) if (fragment_b) u->fragment = strdupdelim (fragment_b, fragment_e); - parse_path (u->path, &u->dir, &u->file); + + if (path_modified || u->fragment || host_modified) + { + /* If path_simplify modified the path, or if a fragment is + present, or if the original host name had caps in it, make + sure that u->url is equivalent to what would be printed by + url_string. */ + u->url = url_string (u, 0); + + if (url_encoded != url) + xfree ((char *) url_encoded); + } + else + { + if (url_encoded == url) + u->url = xstrdup (url); + else + u->url = url_encoded; + } + url_encoded = NULL; return u; } @@ -927,17 +963,18 @@ url_free (struct url *url) FREE_MAYBE (url->fragment); FREE_MAYBE (url->user); FREE_MAYBE (url->passwd); - FREE_MAYBE (url->dir); - FREE_MAYBE (url->file); + + xfree (url->dir); + xfree (url->file); xfree (url); } -urlpos * +struct urlpos * get_urls_file (const char *file) { struct file_memory *fm; - urlpos *head, *tail; + struct urlpos *head, *tail; const char *text, *text_end; /* Load the file. */ @@ -968,10 +1005,28 @@ get_urls_file (const char *file) --line_end; if (line_end > line_beg) { - urlpos *entry = (urlpos *)xmalloc (sizeof (urlpos)); + int up_error_code; + char *url_text; + struct urlpos *entry; + struct url *url; + + /* We must copy the URL to a zero-terminated string. *sigh*. */ + url_text = strdupdelim (line_beg, line_end); + url = url_parse (url_text, &up_error_code); + if (!url) + { + logprintf (LOG_NOTQUIET, "%s: Invalid URL %s: %s\n", + file, url_text, url_error (up_error_code)); + xfree (url_text); + continue; + } + xfree (url_text); + + entry = (struct urlpos *)xmalloc (sizeof (struct urlpos)); memset (entry, 0, sizeof (*entry)); entry->next = NULL; - entry->url = strdupdelim (line_beg, line_end); + entry->url = url; + if (!head) head = entry; else @@ -985,12 +1040,13 @@ get_urls_file (const char *file) /* Free the linked list of urlpos. */ void -free_urlpos (urlpos *l) +free_urlpos (struct urlpos *l) { while (l) { - urlpos *next = l->next; - xfree (l->url); + struct urlpos *next = l->next; + if (l->url) + url_free (l->url); FREE_MAYBE (l->local_name); xfree (l); l = next; @@ -1088,7 +1144,9 @@ count_slashes (const char *s) static char * mkstruct (const struct url *u) { - char *host, *dir, *file, *res, *dirpref; + char *dir, *dir_preencoding; + char *file, *res, *dirpref; + char *query = u->query && *u->query ? u->query : NULL; int l; if (opt.cut_dirs) @@ -1104,36 +1162,35 @@ mkstruct (const struct url *u) else dir = u->dir + (*u->dir == '/'); - host = xstrdup (u->host); /* Check for the true name (or at least a consistent name for saving to directory) of HOST, reusing the hlist if possible. */ - if (opt.add_hostdir && !opt.simple_check) - { - char *nhost = realhost (host); - xfree (host); - host = nhost; - } - /* Add dir_prefix and hostname (if required) to the beginning of - dir. */ if (opt.add_hostdir) { + /* Add dir_prefix and hostname (if required) to the beginning of + dir. */ + dirpref = (char *)alloca (strlen (opt.dir_prefix) + 1 + + strlen (u->host) + + 1 + numdigit (u->port) + + 1); if (!DOTP (opt.dir_prefix)) + sprintf (dirpref, "%s/%s", opt.dir_prefix, u->host); + else + strcpy (dirpref, u->host); + + if (u->port != scheme_default_port (u->scheme)) { - dirpref = (char *)alloca (strlen (opt.dir_prefix) + 1 - + strlen (host) + 1); - sprintf (dirpref, "%s/%s", opt.dir_prefix, host); + int len = strlen (dirpref); + dirpref[len] = ':'; + long_to_string (dirpref + len + 1, u->port); } - else - STRDUP_ALLOCA (dirpref, host); } - else /* not add_hostdir */ + else /* not add_hostdir */ { if (!DOTP (opt.dir_prefix)) dirpref = opt.dir_prefix; else dirpref = ""; } - xfree (host); /* If there is a prefix, prepend it. */ if (*dirpref) @@ -1142,7 +1199,10 @@ mkstruct (const struct url *u) sprintf (newdir, "%s%s%s", dirpref, *dir == '/' ? "" : "/", dir); dir = newdir; } - dir = encode_string (dir); + + dir_preencoding = dir; + dir = reencode_string (dir_preencoding); + l = strlen (dir); if (l && dir[l - 1] == '/') dir[l - 1] = '\0'; @@ -1153,9 +1213,17 @@ mkstruct (const struct url *u) file = u->file; /* Finally, construct the full name. */ - res = (char *)xmalloc (strlen (dir) + 1 + strlen (file) + 1); + res = (char *)xmalloc (strlen (dir) + 1 + strlen (file) + + (query ? (1 + strlen (query)) : 0) + + 1); sprintf (res, "%s%s%s", dir, *dir ? "/" : "", file); - xfree (dir); + if (query) + { + strcat (res, "?"); + strcat (res, query); + } + if (dir != dir_preencoding) + xfree (dir); return res; } @@ -1177,7 +1245,7 @@ compose_file_name (char *base, char *query) { if (UNSAFE_CHAR (*from)) { - const unsigned char c = *from++; + unsigned char c = *from++; *to++ = '%'; *to++ = XDIGIT_TO_XCHAR (c >> 4); *to++ = XDIGIT_TO_XCHAR (c & 0xf); @@ -1282,10 +1350,8 @@ url_filename (const struct url *u) static int urlpath_length (const char *url) { - const char *q = strchr (url, '?'); - if (q) - return q - url; - return strlen (url); + const char *q = strpbrk_or_eos (url, "?;#"); + return q - url; } /* Find the last occurrence of character C in the range [b, e), or @@ -1323,63 +1389,42 @@ uri_merge_1 (const char *base, const char *link, int linklength, int no_scheme) { const char *end = base + urlpath_length (base); - if (*link != '/') + if (!*link) { - /* LINK is a relative URL: we need to replace everything - after last slash (possibly empty) with LINK. - - So, if BASE is "whatever/foo/bar", and LINK is "qux/xyzzy", - our result should be "whatever/foo/qux/xyzzy". */ - int need_explicit_slash = 0; - int span; - const char *start_insert; - const char *last_slash = find_last_char (base, end, '/'); - if (!last_slash) - { - /* No slash found at all. Append LINK to what we have, - but we'll need a slash as a separator. - - Example: if base == "foo" and link == "qux/xyzzy", then - we cannot just append link to base, because we'd get - "fooqux/xyzzy", whereas what we want is - "foo/qux/xyzzy". - - To make sure the / gets inserted, we set - need_explicit_slash to 1. We also set start_insert - to end + 1, so that the length calculations work out - correctly for one more (slash) character. Accessing - that character is fine, since it will be the - delimiter, '\0' or '?'. */ - /* example: "foo?..." */ - /* ^ ('?' gets changed to '/') */ - start_insert = end + 1; - need_explicit_slash = 1; - } - else if (last_slash && last_slash != base && *(last_slash - 1) == '/') - { - /* example: http://host" */ - /* ^ */ - start_insert = end + 1; - need_explicit_slash = 1; - } - else - { - /* example: "whatever/foo/bar" */ - /* ^ */ - start_insert = last_slash + 1; - } - - span = start_insert - base; - constr = (char *)xmalloc (span + linklength + 1); - if (span) - memcpy (constr, base, span); - if (need_explicit_slash) - constr[span - 1] = '/'; - if (linklength) - memcpy (constr + span, link, linklength); - constr[span + linklength] = '\0'; + /* Empty LINK points back to BASE, query string and all. */ + constr = xstrdup (base); + } + else if (*link == '?') + { + /* LINK points to the same location, but changes the query + string. Examples: */ + /* uri_merge("path", "?new") -> "path?new" */ + /* uri_merge("path?foo", "?new") -> "path?new" */ + /* uri_merge("path?foo#bar", "?new") -> "path?new" */ + /* uri_merge("path#foo", "?new") -> "path?new" */ + int baselength = end - base; + constr = xmalloc (baselength + linklength + 1); + memcpy (constr, base, baselength); + memcpy (constr + baselength, link, linklength); + constr[baselength + linklength] = '\0'; } - else /* *link == `/' */ + else if (*link == '#') + { + /* uri_merge("path", "#new") -> "path#new" */ + /* uri_merge("path#foo", "#new") -> "path#new" */ + /* uri_merge("path?foo", "#new") -> "path?foo#new" */ + /* uri_merge("path?foo#bar", "#new") -> "path?foo#new" */ + int baselength; + const char *end1 = strchr (base, '#'); + if (!end1) + end1 = base + strlen (base); + baselength = end1 - base; + constr = xmalloc (baselength + linklength + 1); + memcpy (constr, base, baselength); + memcpy (constr + baselength, link, linklength); + constr[baselength + linklength] = '\0'; + } + else if (*link == '/') { /* LINK is an absolute path: we need to replace everything after (and including) the FIRST slash with LINK. @@ -1435,6 +1480,62 @@ uri_merge_1 (const char *base, const char *link, int linklength, int no_scheme) memcpy (constr + span, link, linklength); constr[span + linklength] = '\0'; } + else + { + /* LINK is a relative URL: we need to replace everything + after last slash (possibly empty) with LINK. + + So, if BASE is "whatever/foo/bar", and LINK is "qux/xyzzy", + our result should be "whatever/foo/qux/xyzzy". */ + int need_explicit_slash = 0; + int span; + const char *start_insert; + const char *last_slash = find_last_char (base, end, '/'); + if (!last_slash) + { + /* No slash found at all. Append LINK to what we have, + but we'll need a slash as a separator. + + Example: if base == "foo" and link == "qux/xyzzy", then + we cannot just append link to base, because we'd get + "fooqux/xyzzy", whereas what we want is + "foo/qux/xyzzy". + + To make sure the / gets inserted, we set + need_explicit_slash to 1. We also set start_insert + to end + 1, so that the length calculations work out + correctly for one more (slash) character. Accessing + that character is fine, since it will be the + delimiter, '\0' or '?'. */ + /* example: "foo?..." */ + /* ^ ('?' gets changed to '/') */ + start_insert = end + 1; + need_explicit_slash = 1; + } + else if (last_slash && last_slash != base && *(last_slash - 1) == '/') + { + /* example: http://host" */ + /* ^ */ + start_insert = end + 1; + need_explicit_slash = 1; + } + else + { + /* example: "whatever/foo/bar" */ + /* ^ */ + start_insert = last_slash + 1; + } + + span = start_insert - base; + constr = (char *)xmalloc (span + linklength + 1); + if (span) + memcpy (constr, base, span); + if (need_explicit_slash) + constr[span - 1] = '/'; + if (linklength) + memcpy (constr + span, link, linklength); + constr[span + linklength] = '\0'; + } } else /* !no_scheme */ { @@ -1602,12 +1703,13 @@ static void replace_attr PARAMS ((const char **, int, FILE *, const char *)); /* Change the links in an HTML document. Accepts a structure that defines the positions of all the links. */ void -convert_links (const char *file, urlpos *l) +convert_links (const char *file, struct urlpos *l) { struct file_memory *fm; FILE *fp; const char *p; downloaded_file_t downloaded_file_return; + int to_url_count = 0, to_file_count = 0; logprintf (LOG_VERBOSE, _("Converting %s... "), file); @@ -1615,12 +1717,12 @@ convert_links (const char *file, urlpos *l) /* First we do a "dry run": go through the list L and see whether any URL needs to be converted in the first place. If not, just leave the file alone. */ - int count = 0; - urlpos *dry = l; + int dry_count = 0; + struct urlpos *dry = l; for (dry = l; dry; dry = dry->next) if (dry->convert != CO_NOCONVERT) - ++count; - if (!count) + ++dry_count; + if (!dry_count) { logputs (LOG_VERBOSE, _("nothing to do.\n")); return; @@ -1674,7 +1776,7 @@ convert_links (const char *file, urlpos *l) /* If the URL is not to be converted, skip it. */ if (l->convert == CO_NOCONVERT) { - DEBUGP (("Skipping %s at position %d.\n", l->url, l->pos)); + DEBUGP (("Skipping %s at position %d.\n", l->url->url, l->pos)); continue; } @@ -1689,19 +1791,21 @@ convert_links (const char *file, urlpos *l) char *quoted_newname = html_quote_string (newname); replace_attr (&p, l->size, fp, quoted_newname); DEBUGP (("TO_RELATIVE: %s to %s at position %d in %s.\n", - l->url, newname, l->pos, file)); + l->url->url, newname, l->pos, file)); xfree (newname); xfree (quoted_newname); + ++to_file_count; } else if (l->convert == CO_CONVERT_TO_COMPLETE) { /* Convert the link to absolute URL. */ - char *newlink = l->url; + char *newlink = l->url->url; char *quoted_newlink = html_quote_string (newlink); replace_attr (&p, l->size, fp, quoted_newlink); DEBUGP (("TO_COMPLETE: to %s at position %d in %s.\n", newlink, l->pos, file)); xfree (quoted_newlink); + ++to_url_count; } } /* Output the rest of the file. */ @@ -1709,7 +1813,8 @@ convert_links (const char *file, urlpos *l) fwrite (p, 1, fm->length - (p - fm->content), fp); fclose (fp); read_file_free (fm); - logputs (LOG_VERBOSE, _("done.\n")); + logprintf (LOG_VERBOSE, + _("%d-%d\n"), to_file_count, to_url_count); } /* Construct and return a malloced copy of the relative link from two @@ -1766,20 +1871,6 @@ construct_relative (const char *s1, const char *s2) return res; } -/* Add URL to the head of the list L. */ -urlpos * -add_url (urlpos *l, const char *url, const char *file) -{ - urlpos *t; - - t = (urlpos *)xmalloc (sizeof (urlpos)); - memset (t, 0, sizeof (*t)); - t->url = xstrdup (url); - t->local_name = xstrdup (file); - t->next = l; - return t; -} - static void write_backup_file (const char *file, downloaded_file_t downloaded_file_return) { @@ -1850,15 +1941,9 @@ write_backup_file (const char *file, downloaded_file_t downloaded_file_return) -- Dan Harkless This [adding a field to the urlpos structure] didn't work - because convert_file() is called twice: once after all its - sublinks have been retrieved in recursive_retrieve(), and - once at the end of the day in convert_all_links(). The - original linked list collected in recursive_retrieve() is - lost after the first invocation of convert_links(), and - convert_all_links() makes a new one (it calls get_urls_html() - for each file it covers.) That's why your first approach didn't - work. The way to make it work is perhaps to make this flag a - field in the `urls_html' list. + because convert_file() is called from convert_all_links at + the end of the retrieval with a freshly built new urlpos + list. -- Hrvoje Niksic */ converted_file_ptr = xmalloc(sizeof(*converted_file_ptr)); @@ -1941,13 +2026,40 @@ find_fragment (const char *beg, int size, const char **bp, const char **ep) return 0; } -typedef struct _downloaded_file_list { - char* file; - downloaded_file_t download_type; - struct _downloaded_file_list* next; -} downloaded_file_list; +/* We're storing "modes" of type downloaded_file_t in the hash table. + However, our hash tables only accept pointers for keys and values. + So when we need a pointer, we use the address of a + downloaded_file_t variable of static storage. */ + +static downloaded_file_t * +downloaded_mode_to_ptr (downloaded_file_t mode) +{ + static downloaded_file_t + v1 = FILE_NOT_ALREADY_DOWNLOADED, + v2 = FILE_DOWNLOADED_NORMALLY, + v3 = FILE_DOWNLOADED_AND_HTML_EXTENSION_ADDED, + v4 = CHECK_FOR_FILE; + + switch (mode) + { + case FILE_NOT_ALREADY_DOWNLOADED: + return &v1; + case FILE_DOWNLOADED_NORMALLY: + return &v2; + case FILE_DOWNLOADED_AND_HTML_EXTENSION_ADDED: + return &v3; + case CHECK_FOR_FILE: + return &v4; + } + return NULL; +} + +/* This should really be merged with dl_file_url_map and + downloaded_html_files in recur.c. This was originally a list, but + I changed it to a hash table beause it was actually taking a lot of + time to find things in it. */ -static downloaded_file_list *downloaded_files; +static struct hash_table *downloaded_files_hash; /* Remembers which files have been downloaded. In the standard case, should be called with mode == FILE_DOWNLOADED_NORMALLY for each file we actually @@ -1962,46 +2074,47 @@ static downloaded_file_list *downloaded_files; it, call with mode == CHECK_FOR_FILE. Please be sure to call this function with local filenames, not remote URLs. */ downloaded_file_t -downloaded_file (downloaded_file_t mode, const char* file) +downloaded_file (downloaded_file_t mode, const char *file) { - boolean found_file = FALSE; - downloaded_file_list* rover = downloaded_files; + downloaded_file_t *ptr; - while (rover != NULL) - if (strcmp(rover->file, file) == 0) - { - found_file = TRUE; - break; - } - else - rover = rover->next; - - if (found_file) - return rover->download_type; /* file had already been downloaded */ - else + if (mode == CHECK_FOR_FILE) { - if (mode != CHECK_FOR_FILE) - { - rover = xmalloc(sizeof(*rover)); - rover->file = xstrdup(file); /* use xstrdup() so die on out-of-mem. */ - rover->download_type = mode; - rover->next = downloaded_files; - downloaded_files = rover; - } - - return FILE_NOT_ALREADY_DOWNLOADED; + if (!downloaded_files_hash) + return FILE_NOT_ALREADY_DOWNLOADED; + ptr = hash_table_get (downloaded_files_hash, file); + if (!ptr) + return FILE_NOT_ALREADY_DOWNLOADED; + return *ptr; } + + if (!downloaded_files_hash) + downloaded_files_hash = make_string_hash_table (0); + + ptr = hash_table_get (downloaded_files_hash, file); + if (ptr) + return *ptr; + + ptr = downloaded_mode_to_ptr (mode); + hash_table_put (downloaded_files_hash, xstrdup (file), &ptr); + + return FILE_NOT_ALREADY_DOWNLOADED; +} + +static int +df_free_mapper (void *key, void *value, void *ignored) +{ + xfree (key); + return 0; } void downloaded_files_free (void) { - downloaded_file_list* rover = downloaded_files; - while (rover) + if (downloaded_files_hash) { - downloaded_file_list *next = rover->next; - xfree (rover->file); - xfree (rover); - rover = next; + hash_table_map (downloaded_files_hash, df_free_mapper, NULL); + hash_table_destroy (downloaded_files_hash); + downloaded_files_hash = NULL; } } diff --git a/src/url.h b/src/url.h index 9808b540..aed9bc38 100644 --- a/src/url.h +++ b/src/url.h @@ -72,11 +72,11 @@ enum convert_options { /* A structure that defines the whereabouts of a URL, i.e. its position in an HTML document, etc. */ -typedef struct _urlpos -{ - char *url; /* linked URL, after it has been - merged with the base */ - char *local_name; /* Local file to which it was saved */ +struct urlpos { + struct url *url; /* the URL of the link, after it has + been merged with the base */ + char *local_name; /* local file to which it was saved + (used by convert_links) */ /* Information about the original link: */ int link_relative_p; /* was the link relative? */ @@ -89,8 +89,8 @@ typedef struct _urlpos /* URL's position in the buffer. */ int pos, size; - struct _urlpos *next; /* Next struct in list */ -} urlpos; + struct urlpos *next; /* next list element */ +}; /* downloaded_file() takes a parameter of this type and returns this type. */ typedef enum @@ -126,9 +126,9 @@ int url_skip_uname PARAMS ((const char *)); char *url_string PARAMS ((const struct url *, int)); -urlpos *get_urls_file PARAMS ((const char *)); -urlpos *get_urls_html PARAMS ((const char *, const char *, int, int *)); -void free_urlpos PARAMS ((urlpos *)); +struct urlpos *get_urls_file PARAMS ((const char *)); +struct urlpos *get_urls_html PARAMS ((const char *, const char *, int, int *)); +void free_urlpos PARAMS ((struct urlpos *)); char *uri_merge PARAMS ((const char *, const char *)); @@ -136,11 +136,10 @@ void rotate_backups PARAMS ((const char *)); int mkalldirs PARAMS ((const char *)); char *url_filename PARAMS ((const struct url *)); -char *getproxy PARAMS ((uerr_t)); +char *getproxy PARAMS ((enum url_scheme)); int no_proxy_match PARAMS ((const char *, const char **)); -void convert_links PARAMS ((const char *, urlpos *)); -urlpos *add_url PARAMS ((urlpos *, const char *, const char *)); +void convert_links PARAMS ((const char *, struct urlpos *)); downloaded_file_t downloaded_file PARAMS ((downloaded_file_t, const char *)); diff --git a/src/utils.c b/src/utils.c index fca15800..444455a5 100644 --- a/src/utils.c +++ b/src/utils.c @@ -307,6 +307,18 @@ xstrdup_debug (const char *s, const char *source_file, int source_line) #endif /* DEBUG_MALLOC */ +/* Utility function: like xstrdup(), but also lowercases S. */ + +char * +xstrdup_lower (const char *s) +{ + char *copy = xstrdup (s); + char *p = copy; + for (; *p; p++) + *p = TOLOWER (*p); + return copy; +} + /* Copy the string formed by two pointers (one on the beginning, other on the char after the last char) to a new, malloc-ed location. 0-terminate it. */ @@ -443,6 +455,8 @@ fork_to_background (void) } #endif /* not WINDOWS */ +#if 0 +/* debug */ char * ps (char *orig) { @@ -450,6 +464,7 @@ ps (char *orig) path_simplify (r); return r; } +#endif /* Canonicalize PATH, and return a new path. The new path differs from PATH in that: @@ -468,45 +483,31 @@ ps (char *orig) Change the original string instead of strdup-ing. React correctly when beginning with `./' and `../'. Don't zip out trailing slashes. */ -void +int path_simplify (char *path) { - register int i, start, ddot; + register int i, start; + int changes = 0; char stub_char; if (!*path) - return; + return 0; - /*stub_char = (*path == '/') ? '/' : '.';*/ stub_char = '/'; - /* Addition: Remove all `./'-s preceding the string. If `../'-s - precede, put `/' in front and remove them too. */ - i = 0; - ddot = 0; - while (1) - { - if (path[i] == '.' && path[i + 1] == '/') - i += 2; - else if (path[i] == '.' && path[i + 1] == '.' && path[i + 2] == '/') - { - i += 3; - ddot = 1; - } - else - break; - } - if (i) - strcpy (path, path + i - ddot); + if (path[0] == '/') + /* Preserve initial '/'. */ + ++path; - /* Replace single `.' or `..' with `/'. */ + /* Nix out leading `.' or `..' with. */ if ((path[0] == '.' && path[1] == '\0') || (path[0] == '.' && path[1] == '.' && path[2] == '\0')) { - path[0] = stub_char; - path[1] = '\0'; - return; + path[0] = '\0'; + changes = 1; + return changes; } + /* Walk along PATH looking for things to compact. */ i = 0; while (1) @@ -531,6 +532,7 @@ path_simplify (char *path) { strcpy (path + start + 1, path + i); i = start + 1; + changes = 1; } /* Check for `../', `./' or trailing `.' by itself. */ @@ -540,6 +542,7 @@ path_simplify (char *path) if (!path[i + 1]) { path[--i] = '\0'; + changes = 1; break; } @@ -548,6 +551,7 @@ path_simplify (char *path) { strcpy (path + i, path + i + 1); i = (start < 0) ? 0 : start; + changes = 1; continue; } @@ -556,12 +560,32 @@ path_simplify (char *path) (path[i + 2] == '/' || !path[i + 2])) { while (--start > -1 && path[start] != '/'); - strcpy (path + start + 1, path + i + 2); + strcpy (path + start + 1, path + i + 2 + (start == -1 && path[i + 2])); i = (start < 0) ? 0 : start; + changes = 1; continue; } } /* path == '.' */ } /* while */ + + /* Addition: Remove all `./'-s and `../'-s preceding the string. */ + i = 0; + while (1) + { + if (path[i] == '.' && path[i + 1] == '/') + i += 2; + else if (path[i] == '.' && path[i + 1] == '.' && path[i + 2] == '/') + i += 3; + else + break; + } + if (i) + { + strcpy (path, path + i - 0); + changes = 1; + } + + return changes; } /* "Touch" FILE, i.e. make its atime and mtime equal to the time diff --git a/src/utils.h b/src/utils.h index fedd2b33..ae7f47a3 100644 --- a/src/utils.h +++ b/src/utils.h @@ -48,12 +48,13 @@ char *datetime_str PARAMS ((time_t *)); void print_malloc_debug_stats (); #endif +char *xstrdup_lower PARAMS ((const char *)); char *strdupdelim PARAMS ((const char *, const char *)); char **sepstring PARAMS ((const char *)); int frontcmp PARAMS ((const char *, const char *)); char *pwd_cuserid PARAMS ((char *)); void fork_to_background PARAMS ((void)); -void path_simplify PARAMS ((char *)); +int path_simplify PARAMS ((char *)); void touch PARAMS ((const char *, time_t)); int remove_link PARAMS ((const char *)); @@ -98,4 +99,6 @@ long wtimer_granularity PARAMS ((void)); char *html_quote_string PARAMS ((const char *)); +int determine_screen_width PARAMS ((void)); + #endif /* UTILS_H */ diff --git a/src/wget.h b/src/wget.h index f0cabd16..a1f737a4 100644 --- a/src/wget.h +++ b/src/wget.h @@ -28,6 +28,11 @@ Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. */ # define NDEBUG /* To kill off assertions */ #endif /* not DEBUG */ +/* Define this if you want primitive but extensive malloc debugging. + It will make Wget extremely slow, so only do it in development + builds. */ +#undef DEBUG_MALLOC + #ifndef PARAMS # if PROTOTYPES # define PARAMS(args) args @@ -60,7 +65,7 @@ Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. */ 3) Finally, the debug messages are meant to be a clue for me to debug problems with Wget. If I get them in a language I don't - understand, debugging will become a new challenge of its own! :-) */ + understand, debugging will become a new challenge of its own! */ /* Include these, so random files need not include them. */