[svn] Implemented breadth-first retrieval.

author hniksic <devnull@localhost>

Sun, 25 Nov 2001 03:10:34 +0000 (19:10 -0800)

committer hniksic <devnull@localhost>

Sun, 25 Nov 2001 03:10:34 +0000 (19:10 -0800)
author hniksic <devnull@localhost>
Sun, 25 Nov 2001 03:10:34 +0000 (19:10 -0800)
committer hniksic <devnull@localhost>
Sun, 25 Nov 2001 03:10:34 +0000 (19:10 -0800)
diff --git a/ChangeLog b/ChangeLog

index 38a6ddafde6a9f5996c7a58496ab74bdeec7d7e6..87c2c2985a60a7bcfaf3198ca16224d6a8b48009 100644 (file)
--- a/ChangeLog
+++ b/ChangeLog
@@ -1,3 +1,9 @@
+2001-11-25  Hrvoje Niksic  <hniksic@arsdigita.com>
+
+       * TODO: Ditto.
+
+       * NEWS: Updated with the latest stuff.
+
  2001-11-23  Hrvoje Niksic  <hniksic@arsdigita.com>
  
         * po/hr.po: A major overhaul.
diff --git a/NEWS b/NEWS

index fb9a769e7e6a82d2f1d7cf3b4cb0ca9f0b35a95b..8a3b0b70d62bca1899d977616d41a0fdafd1bae4 100644 (file)
--- a/NEWS
+++ b/NEWS
@@ -7,9 +7,19 @@ Please send GNU Wget bug reports to <bug-wget@gnu.org>.
  \f
  * Changes in Wget 1.8.
  
+** "Recursive retrieval" now uses a breadth-first algorithm.
+Recursive downloads are faster and consume *significantly* less memory
+than before.
+
  ** A new progress indicator is now available.  Try it with
  --progress=bar or using `progress = bar' in `.wgetrc'.
  
+** Host directories now contain port information if the URL is at a
+non-standard port.
+
+** Wget now supports the robots.txt directives specified in
+<http://info.webcrawler.com/mak/projects/robots/norobots-rfc.html>.
+
  ** URL parser has been fixed, especially the infamous overzealous
  quoting bug.  Wget no longer dequotes reserved characters, e.g. `%3F'
  is no longer translated to `?', nor `%2B' to `+'.  Unsafe characters
diff --git a/TODO b/TODO

index fd5e00ec4f16a5bfb1fd3befcd84d959785fbfed..9632add71ec08818084baf84607d4e4ec0de1c40 100644 (file)
--- a/TODO
+++ b/TODO
@@ -20,15 +20,6 @@ changes.
    file, though forcibly disconnecting from the server at the desired endpoint
    might be workable).
  
-* RFC 1738 says that if logging on to an FTP server puts you in a directory
-  other than '/', the way to specify a file relative to '/' in a URL (let's use
-  "/bin/ls" in this example) is "ftp://host/%2Fbin/ls".  Wget needs to support
-  this (and ideally not consider "ftp://host//bin/ls" to be equivalent, as that
-  would equate to the command "CWD " rather than "CWD /").  To accomodate people
-  used to broken FTP clients like Internet Explorer and Netscape, if
-  "ftp://host/bin/ls" doesn't exist, Wget should try again (perhaps under
-  control of an option), acting as if the user had typed "ftp://host/%2Fbin/ls".
-
  * If multiple FTP URLs are specified that are on the same host, Wget should
    re-use the connection rather than opening a new one for each file.
  
@@ -37,16 +28,9 @@ changes.
  
  * Limit the number of successive redirection to max. 20 or so.
  
-* If -c used on a file that's already completely downloaded, don't re-download
-  it (unless normal --timestamping processing would cause you to do so).
-
  * If -c used with -N, check to make sure a file hasn't changed on the server
    before "continuing" to download it (preventing a bogus hybrid file).
  
-* Take a look at
-  <http://info.webcrawler.com/mak/projects/robots/norobots-rfc.html>
-  and support the new directives.
-
  * Generalize --html-extension to something like --mime-extensions and have it
    look at mime.types/mimecap file for preferred extension.  Non-HTML files with
    filenames changed this way would be re-downloaded each time despite -N unless
@@ -87,9 +71,6 @@ changes.
    turning it off.  Get rid of `--foo=no' stuff.  Short options would
    be handled as `-x' vs. `-nx'.
  
-* Implement "thermometer" display (not all that hard; use an
-  alternative show_progress() if the output goes to a terminal.)
-
  * Add option to only list wildcard matches without doing the download.
  
  * Add case-insensitivity as an option.
@@ -102,19 +83,13 @@ changes.
  
  * Allow time-stamping by arbitrary date.
  
-* Fix Unix directory parser to allow for spaces in file names.
-
  * Allow size limit to files (perhaps with an option to download oversize files 
    up through the limit or not at all, to get more functionality than [u]limit.
  
-* Implement breadth-first retrieval.
-
  * Download to .in* when mirroring.
  
  * Add an option to delete or move no-longer-existent files when mirroring.
  
-* Implement a switch to avoid downloading multiple files (e.g. x and x.gz).
-
  * Implement uploading (--upload URL?) in FTP and HTTP.
  
  * Rewrite FTP code to allow for easy addition of new commands.  It
@@ -129,13 +104,10 @@ changes.
  
  * Implement a concept of "packages" a la mirror.
  
-* Implement correct RFC1808 URL parsing.
-
-* Implement more HTTP/1.1 bells and whistles (ETag, Content-MD5 etc.)
-
-* Add a "rollback" option to have --continue throw away a configurable number of
-  bytes at the end of a file before resuming download.  Apparently, some stupid
-  proxies insert a "transfer interrupted" string we need to get rid of.
+* Add a "rollback" option to have continued retrieval throw away a
+  configurable number of bytes at the end of a file before resuming
+  download.  Apparently, some stupid proxies insert a "transfer
+  interrupted" string we need to get rid of.
  
  * When using --accept and --reject, you can end up with empty directories.  Have
    Wget any such at the end.
diff --git a/src/ChangeLog b/src/ChangeLog

index 2c58a116966b0d7013e4ff228bb66eeba290a2fc..68e13dfee6d80805ba0309b5abbdb6d785881138 100644 (file)
--- a/src/ChangeLog
+++ b/src/ChangeLog
@@ -1,3 +1,68 @@
+2001-11-25  Hrvoje Niksic  <hniksic@arsdigita.com>
+
+       * url.c (reencode_string): Use unsigned char, not char --
+       otherwise the hex digits come out wrong for 8-bit chars such as
+       nbsp.
+       (lowercase_str): New function.
+       (url_parse): Canonicalize u->url if needed.
+       (get_urls_file): Parse each URL, and return only the valid ones.
+       (free_urlpos): Call url_free.
+       (mkstruct): Add :port if the port is non-standard.
+       (mkstruct): Append the query string to the file name, if any.
+       (urlpath_length): Use strpbrk_or_eos.
+       (uri_merge_1): Handle the cases where LINK is an empty string,
+       where LINK consists only of query, and where LINK consists only of
+       fragment.
+       (convert_links): Count and report both kinds of conversion.
+       (downloaded_file): Use a hash table, not a list.
+       (downloaded_files_free): Free the hash table.
+
+       * retr.c (retrieve_from_file): Ditto.
+
+       * main.c (main): Call either retrieve_url or retrieve_tree
+       for each URL, not both.
+
+       * retr.c (register_all_redirections): New function.
+       (register_redirections_mapper): Ditto.
+       (retrieve_url): Register the redirections.
+       (retrieve_url): Make the string "Error parsing proxy ..." 
+       translatable.
+
+       * res.c (add_path): Strip leading slash from robots.txt paths so
+       that the path representations are "compatible".
+       (free_specs): Free each individual path, too.
+       (res_cleanup): New function.
+       (cleanup_hash_table_mapper): Ditto.
+
+       * recur.c (url_queue_new): New function.
+       (url_queue_delete): Ditto.
+       (url_enqueue): Ditto.
+       (url_dequeue): Ditto.
+       (retrieve_tree): New function, replacement for recursive_retrieve.
+       (descend_url_p): New function.
+       (register_redirection): New function.
+
+       * progress.c (create_image): Cosmetic changes.
+
+       * init.c (cleanup): Do all those complex cleanups only if
+       DEBUG_MALLOC is defined.
+
+       * main.c: Removed --simple-check and the corresponding
+       simple_host_check in init.c.
+
+       * html-url.c (handle_link): Parse the URL here, and propagate the
+       parsed URL to the caller, who would otherwise have to parse it
+       again.
+
+       * host.c (xstrdup_lower): Moved to utils.c.
+       (realhost): Removed.
+       (same_host): Ditto.
+
+2001-11-24  Hrvoje Niksic  <hniksic@arsdigita.com>
+
+       * utils.c (path_simplify): Preserver the (non-)existence of
+       leading slash.  Return non-zero if changes were made.
+
  2001-11-24  Hrvoje Niksic  <hniksic@arsdigita.com>
  
         * progress.c (bar_update): Don't modify bp->total_length if it is
diff --git a/src/Makefile.in b/src/Makefile.in

index 942e0e1e327bce8a16bdccb5315b1cea504f005a..4d60cc1f4defbd084d7ecbe64fe045ef54da1c26 100644 (file)
--- a/src/Makefile.in
+++ b/src/Makefile.in
@@ -162,8 +162,10 @@ main$o: wget.h utils.h init.h retr.h recur.h host.h cookies.h
  gnu-md5$o: wget.h gnu-md5.h
  mswindows$o: wget.h url.h
  netrc$o: wget.h utils.h netrc.h init.h
+progress$o: wget.h progress.h utils.h retr.h
  rbuf$o: wget.h rbuf.h connect.h
  recur$o: wget.h url.h recur.h utils.h retr.h ftp.h fnmatch.h host.h hash.h
+res$o: wget.h utils.h hash.h url.h retr.h res.h
  retr$o: wget.h utils.h retr.h url.h recur.h ftp.h host.h connect.h hash.h
  snprintf$o:
  safe-ctype$o: safe-ctype.h
diff --git a/src/host.c b/src/host.c

index 82725d8b695b79bb324cf0b9bf47a75de744c63a..4fb1197b40eeafc9c62f7a90a13d2a9bf364a333 100644 (file)
--- a/src/host.c
+++ b/src/host.c
@@ -60,8 +60,14 @@ extern int errno;
  #endif
  
  /* Mapping between all known hosts to their addresses (n.n.n.n). */
+
+/* #### We should map to *lists* of IP addresses. */
+
  struct hash_table *host_name_address_map;
  
+/* The following two tables are obsolete, since we no longer do host
+   canonicalization.  */
+
  /* Mapping between all known addresses (n.n.n.n) to their hosts.  This
     is the inverse of host_name_address_map.  These two tables share
     the strdup'ed strings. */
@@ -70,18 +76,6 @@ struct hash_table *host_address_name_map;
  /* Mapping between auxilliary (slave) and master host names. */
  struct hash_table *host_slave_master_map;
  
-/* Utility function: like xstrdup(), but also lowercases S.  */
-
-static char *
-xstrdup_lower (const char *s)
-{
-  char *copy = xstrdup (s);
-  char *p = copy;
-  for (; *p; p++)
-    *p = TOLOWER (*p);
-  return copy;
-}
-
  /* The same as gethostbyname, but supports internet addresses of the
     form `N.N.N.N'.  On some systems gethostbyname() knows how to do
     this automatically.  */
@@ -216,114 +210,6 @@ store_hostaddress (unsigned char *where, const char *hostname)
    return 1;
  }
  
-/* Determine the "real" name of HOST, as perceived by Wget.  If HOST
-   is referenced by more than one name, "real" name is considered to
-   be the first one encountered in the past.  */
-char *
-realhost (const char *host)
-{
-  struct in_addr in;
-  struct hostent *hptr;
-  char *master_name;
-
-  DEBUGP (("Checking for %s in host_name_address_map.\n", host));
-  if (hash_table_contains (host_name_address_map, host))
-    {
-      DEBUGP (("Found; %s was already used, by that name.\n", host));
-      return xstrdup_lower (host);
-    }
-
-  DEBUGP (("Checking for %s in host_slave_master_map.\n", host));
-  master_name = hash_table_get (host_slave_master_map, host);
-  if (master_name)
-    {
-    has_master:
-      DEBUGP (("Found; %s was already used, by the name %s.\n",
-              host, master_name));
-      return xstrdup (master_name);
-    }
-
-  DEBUGP (("First time I hear about %s by that name; looking it up.\n",
-          host));
-  hptr = ngethostbyname (host);
-  if (hptr)
-    {
-      char *inet_s;
-      /* Originally, we copied to in.s_addr, but it appears to be
-        missing on some systems.  */
-      memcpy (&in, *hptr->h_addr_list, sizeof (in));
-      inet_s = inet_ntoa (in);
-
-      add_host_to_cache (host, inet_s);
-
-      /* add_host_to_cache() can establish a slave-master mapping. */
-      DEBUGP (("Checking again for %s in host_slave_master_map.\n", host));
-      master_name = hash_table_get (host_slave_master_map, host);
-      if (master_name)
-       goto has_master;
-    }
-
-  return xstrdup_lower (host);
-}
-
-/* Compare two hostnames (out of URL-s if the arguments are URL-s),
-   taking care of aliases.  It uses realhost() to determine a unique
-   hostname for each of two hosts.  If simple_check is non-zero, only
-   strcmp() is used for comparison.  */
-int
-same_host (const char *u1, const char *u2)
-{
-  const char *s;
-  char *p1, *p2;
-  char *real1, *real2;
-
-  /* Skip protocol, if present.  */
-  u1 += url_skip_scheme (u1);
-  u2 += url_skip_scheme (u2);
-
-  /* Skip username ans password, if present.  */
-  u1 += url_skip_uname (u1);
-  u2 += url_skip_uname (u2);
-
-  for (s = u1; *u1 && *u1 != '/' && *u1 != ':'; u1++);
-  p1 = strdupdelim (s, u1);
-  for (s = u2; *u2 && *u2 != '/' && *u2 != ':'; u2++);
-  p2 = strdupdelim (s, u2);
-  DEBUGP (("Comparing hosts %s and %s...\n", p1, p2));
-  if (strcasecmp (p1, p2) == 0)
-    {
-      xfree (p1);
-      xfree (p2);
-      DEBUGP (("They are quite alike.\n"));
-      return 1;
-    }
-  else if (opt.simple_check)
-    {
-      xfree (p1);
-      xfree (p2);
-      DEBUGP (("Since checking is simple, I'd say they are not the same.\n"));
-      return 0;
-    }
-  real1 = realhost (p1);
-  real2 = realhost (p2);
-  xfree (p1);
-  xfree (p2);
-  if (strcasecmp (real1, real2) == 0)
-    {
-      DEBUGP (("They are alike, after realhost()->%s.\n", real1));
-      xfree (real1);
-      xfree (real2);
-      return 1;
-    }
-  else
-    {
-      DEBUGP (("They are not the same (%s, %s).\n", real1, real2));
-      xfree (real1);
-      xfree (real2);
-      return 0;
-    }
-}
-
  /* Determine whether a URL is acceptable to be followed, according to
     a list of domains to accept.  */
  int
@@ -383,7 +269,7 @@ herrmsg (int error)
  }
  
  void
-clean_hosts (void)
+host_cleanup (void)
  {
    /* host_name_address_map and host_address_name_map share the
       strings.  Because of that, calling free_keys_and_values once
diff --git a/src/host.h b/src/host.h

index 090ef1688951d4a16e14f61c71a9b1ca18ed29b7..95566c583737cb0854af4a0e245025f99a2f15a0 100644 (file)
--- a/src/host.h
+++ b/src/host.h
@@ -27,15 +27,11 @@ struct url;
  struct hostent *ngethostbyname PARAMS ((const char *));
  int store_hostaddress PARAMS ((unsigned char *, const char *));
  
-void clean_hosts PARAMS ((void));
+void host_cleanup PARAMS ((void));
  
-char *realhost PARAMS ((const char *));
-int same_host PARAMS ((const char *, const char *));
  int accept_domain PARAMS ((struct url *));
  int sufmatch PARAMS ((const char **, const char *));
  
-char *ftp_getaddress PARAMS ((void));
-
  char *herrmsg PARAMS ((int));
  
  #endif /* HOST_H */
diff --git a/src/html-url.c b/src/html-url.c

index 6ab54a9f992761528e7aa6cf1031a8375a2669b3..918778203a2c46ae6ddb6a961d40ebd4d48d8c5e 100644 (file)
--- a/src/html-url.c
+++ b/src/html-url.c
@@ -284,7 +284,7 @@ struct collect_urls_closure {
    char *text;                  /* HTML text. */
    char *base;                  /* Base URI of the document, possibly
                                    changed through <base href=...>. */
-  urlpos *head, *tail;         /* List of URLs */
+  struct urlpos *head, *tail;  /* List of URLs */
    const char *parent_base;     /* Base of the current document. */
    const char *document_file;   /* File name of this document. */
    int dash_p_leaf_HTML;                /* Whether -p is specified, and this
@@ -301,59 +301,67 @@ static void
  handle_link (struct collect_urls_closure *closure, const char *link_uri,
              struct taginfo *tag, int attrid)
  {
-  int no_scheme = !url_has_scheme (link_uri);
-  urlpos *newel;
-
+  int link_has_scheme = url_has_scheme (link_uri);
+  struct urlpos *newel;
    const char *base = closure->base ? closure->base : closure->parent_base;
-  char *complete_uri;
-
-  char *fragment = strrchr (link_uri, '#');
-
-  if (fragment)
-    {
-      /* Nullify the fragment identifier, i.e. everything after the
-         last occurrence of `#', inclusive.  This copying is
-         relatively inefficient, but it doesn't matter because
-         fragment identifiers don't come up all that often.  */
-      int hashlen = fragment - link_uri;
-      char *p = alloca (hashlen + 1);
-      memcpy (p, link_uri, hashlen);
-      p[hashlen] = '\0';
-      link_uri = p;
-    }
+  struct url *url;
  
    if (!base)
      {
-      if (no_scheme)
+      DEBUGP (("%s: no base, merge will use \"%s\".\n",
+              closure->document_file, link_uri));
+
+      if (!link_has_scheme)
         {
           /* We have no base, and the link does not have a host
              attached to it.  Nothing we can do.  */
           /* #### Should we print a warning here?  Wget 1.5.x used to.  */
           return;
         }
-      else
-       complete_uri = xstrdup (link_uri);
+
+      url = url_parse (link_uri, NULL);
+      if (!url)
+       {
+         DEBUGP (("%s: link \"%s\" doesn't parse.\n",
+                  closure->document_file, link_uri));
+         return;
+       }
      }
    else
-    complete_uri = uri_merge (base, link_uri);
+    {
+      /* Merge BASE with LINK_URI, but also make sure the result is
+        canonicalized, i.e. that "../" have been resolved.
+        (parse_url will do that for us.) */
+
+      char *complete_uri = uri_merge (base, link_uri);
+
+      DEBUGP (("%s: merge(\"%s\", \"%s\") -> %s\n",
+              closure->document_file, base, link_uri, complete_uri));
  
-  DEBUGP (("%s: merge(\"%s\", \"%s\") -> %s\n",
-          closure->document_file, base ? base : "(null)",
-          link_uri, complete_uri));
+      url = url_parse (complete_uri, NULL);
+      if (!url)
+       {
+         DEBUGP (("%s: merged link \"%s\" doesn't parse.\n",
+                  closure->document_file, complete_uri));
+         xfree (complete_uri);
+         return;
+       }
+      xfree (complete_uri);
+    }
  
-  newel = (urlpos *)xmalloc (sizeof (urlpos));
+  newel = (struct urlpos *)xmalloc (sizeof (struct urlpos));
  
    memset (newel, 0, sizeof (*newel));
    newel->next = NULL;
-  newel->url = complete_uri;
+  newel->url = url;
    newel->pos = tag->attrs[attrid].value_raw_beginning - closure->text;
    newel->size = tag->attrs[attrid].value_raw_size;
  
    /* A URL is relative if the host is not named, and the name does not
       start with `/'.  */
-  if (no_scheme && *link_uri != '/')
+  if (!link_has_scheme && *link_uri != '/')
      newel->link_relative_p = 1;
-  else if (!no_scheme)
+  else if (link_has_scheme)
      newel->link_complete_p = 1;
  
    if (closure->tail)
@@ -542,7 +550,7 @@ collect_tags_mapper (struct taginfo *tag, void *arg)
  
     If dash_p_leaf_HTML is non-zero, only the elements needed to render
     FILE ("non-external" links) will be returned.  */
-urlpos *
+struct urlpos *
  get_urls_html (const char *file, const char *this_url, int dash_p_leaf_HTML,
                int *meta_disallow_follow)
  {
diff --git a/src/http.c b/src/http.c

index 02331fa9bf296e876f361972c4fbeafb033a7e1d..8d79de51c564a3e97cf5cc75bdc76df53c28e993 100644 (file)
--- a/src/http.c
+++ b/src/http.c
@@ -1452,8 +1452,8 @@ File `%s' already there, will not retrieve.\n"), *hstat.local_file);
        if (((suf = suffix (*hstat.local_file)) != NULL)
           && (!strcmp (suf, "html") || !strcmp (suf, "htm")))
         *dt |= TEXTHTML;
-      xfree (suf);
  
+      FREE_MAYBE (suf);
        FREE_MAYBE (dummy);
        return RETROK;
      }
diff --git a/src/init.c b/src/init.c

index 09b86ea9912ae7c4dc6d35f287bea6aae4fb30f0..83cf070d0edde0122304376fbba2d5d19f9c583a 100644 (file)
--- a/src/init.c
+++ b/src/init.c
@@ -171,7 +171,6 @@ static struct {
    { "savecookies",     &opt.cookies_output,    cmd_file },
    { "saveheaders",     &opt.save_headers,      cmd_boolean },
    { "serverresponse",  &opt.server_response,   cmd_boolean },
-  { "simplehostcheck", &opt.simple_check,      cmd_boolean },
    { "spanhosts",       &opt.spanhost,          cmd_boolean },
    { "spider",          &opt.spider,            cmd_boolean },
  #ifdef HAVE_SSL
@@ -1009,6 +1008,7 @@ check_user_specified_header (const char *s)
  }
  \f
  void cleanup_html_url PARAMS ((void));
+void res_cleanup PARAMS ((void));
  void downloaded_files_free PARAMS ((void));
  
  
@@ -1016,13 +1016,27 @@ void downloaded_files_free PARAMS ((void));
  void
  cleanup (void)
  {
-  extern acc_t *netrc_list;
+  /* Free external resources, close files, etc. */
  
-  recursive_cleanup ();
-  clean_hosts ();
-  free_netrc (netrc_list);
    if (opt.dfp)
      fclose (opt.dfp);
+
+  /* We're exiting anyway so there's no real need to call free()
+     hundreds of times.  Skipping the frees will make Wget exit
+     faster.
+
+     However, when detecting leaks, it's crucial to free() everything
+     because then you can find the real leaks, i.e. the allocated
+     memory which grows with the size of the program.  */
+
+#ifdef DEBUG_MALLOC
+  recursive_cleanup ();
+  res_cleanup ();
+  host_cleanup ();
+  {
+    extern acc_t *netrc_list;
+    free_netrc (netrc_list);
+  }
    cleanup_html_url ();
    downloaded_files_free ();
    cookies_cleanup ();
@@ -1037,6 +1051,7 @@ cleanup (void)
    free_vec (opt.domains);
    free_vec (opt.follow_tags);
    free_vec (opt.ignore_tags);
+  FREE_MAYBE (opt.progress_type);
    xfree (opt.ftp_acc);
    FREE_MAYBE (opt.ftp_pass);
    FREE_MAYBE (opt.ftp_proxy);
@@ -1055,4 +1070,5 @@ cleanup (void)
    FREE_MAYBE (opt.bind_address);
    FREE_MAYBE (opt.cookies_input);
    FREE_MAYBE (opt.cookies_output);
+#endif
  }
diff --git a/src/main.c b/src/main.c

index 54d3f3184a62a652ddaa861ddf54ce786d6b88e8..79e17af05bf729de694a39c9cba614e11ae31fd2 100644 (file)
--- a/src/main.c
+++ b/src/main.c
@@ -402,9 +402,6 @@ hpVqvdkKsxmNWrHSLcFbEY:G:g:T:U:O:l:n:i:o:a:t:D:A:R:P:B:e:Q:X:I:w:C:",
         case 149:
           setval ("removelisting", "off");
           break;
-       case 150:
-         setval ("simplehostcheck", "on");
-         break;
         case 155:
           setval ("bindaddress", optarg);
           break;
@@ -604,7 +601,7 @@ GNU General Public License for more details.\n"));
           break;
         case 'n':
           {
-           /* #### The n? options are utter crock!  */
+           /* #### What we really want here is --no-foo. */
             char *p;
  
             for (p = optarg; *p; p++)
@@ -613,9 +610,6 @@ GNU General Public License for more details.\n"));
                 case 'v':
                   setval ("verbose", "off");
                   break;
-               case 'h':
-                 setval ("simplehostcheck", "on");
-                 break;
                 case 'H':
                   setval ("addhostdir", "off");
                   break;
@@ -806,17 +800,17 @@ Can't timestamp and not clobber old files at the same time.\n"));
  #endif /* HAVE_SIGNAL */
  
    status = RETROK;             /* initialize it, just-in-case */
-  recursive_reset ();
+  /*recursive_reset ();*/
    /* Retrieve the URLs from argument list.  */
    for (t = url; *t; t++)
      {
-      char *filename, *redirected_URL;
+      char *filename = NULL, *redirected_URL = NULL;
        int dt;
  
-      status = retrieve_url (*t, &filename, &redirected_URL, NULL, &dt);
-      if (opt.recursive && status == RETROK && (dt & TEXTHTML))
-       status = recursive_retrieve (filename,
-                                    redirected_URL ? redirected_URL : *t);
+      if (opt.recursive && url_scheme (*t) != SCHEME_FTP)
+       status = retrieve_tree (*t);
+      else
+       status = retrieve_url (*t, &filename, &redirected_URL, NULL, &dt);
  
        if (opt.delete_after && file_exists_p(filename))
         {
diff --git a/src/options.h b/src/options.h

index 70746b15aeaf81316afe47484af6f6cf6bb09f10..4f75c8b9d82cc8569266225682bd91fc33d52013 100644 (file)
--- a/src/options.h
+++ b/src/options.h
@@ -36,9 +36,6 @@ struct options
    int relative_only;           /* Follow only relative links. */
    int no_parent;               /* Restrict access to the parent
                                    directory.  */
-  int simple_check;            /* Should we use simple checking
-                                  (strcmp) or do we create a host
-                                  hash and call gethostbyname? */
    int reclevel;                        /* Maximum level of recursion */
    int dirstruct;               /* Do we build the directory structure
                                   as we go along? */
diff --git a/src/progress.c b/src/progress.c

index 9c981bd87fc6ec975ecb26f1ef9795a1c94c5caf..e167b7bfc41843c4288c9caf0f937fe2615e0fab 100644 (file)
--- a/src/progress.c
+++ b/src/progress.c
@@ -27,6 +27,9 @@ Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.  */
  # include <strings.h>
  #endif /* HAVE_STRING_H */
  #include <assert.h>
+#ifdef HAVE_UNISTD_H
+# include <unistd.h>
+#endif
  
  #include "wget.h"
  #include "progress.h"
@@ -470,14 +473,14 @@ create_image (struct bar_progress *bp, long dltime)
       Calculate its geometry:
  
       "xxx% "         - percentage                - 5 chars
-     "| ... | "      - progress bar decorations  - 3 chars
+     "| ... |"       - progress bar decorations  - 2 chars
       "1012.56 K/s "  - dl rate                   - 12 chars
       "nnnn "         - downloaded bytes          - 11 chars
       "ETA: xx:xx:xx" - ETA                       - 13 chars
  
       "=====>..."     - progress bar content      - the rest
    */
-  int progress_len = screen_width - (5 + 3 + 12 + 11 + 13);
+  int progress_len = screen_width - (5 + 2 + 12 + 11 + 13);
  
    if (progress_len < 7)
      progress_len = 0;
@@ -530,7 +533,7 @@ create_image (struct bar_progress *bp, long dltime)
      }
    else
      {
-      strcpy (p, "----.-- K/s ");
+      strcpy (p, "  --.-- K/s ");
        p += 12;
      }
  
diff --git a/src/recur.c b/src/recur.c

index 019f39268140210c9909ea4c05ef2d1e589bde16..2c26157933646915eba8801cea8dd8f2d84da284 100644 (file)
--- a/src/recur.c
+++ b/src/recur.c
@@ -1,5 +1,5 @@
  /* Handling of recursive HTTP retrieving.
-   Copyright (C) 1995, 1996, 1997, 2000 Free Software Foundation, Inc.
+   Copyright (C) 1995, 1996, 1997, 2000, 2001 Free Software Foundation, Inc.
  
  This file is part of GNU Wget.
  
@@ -54,452 +54,480 @@ static struct hash_table *dl_file_url_map;
  static struct hash_table *dl_url_file_map;
  
  /* List of HTML files downloaded in this Wget run.  Used for link
-   conversion after Wget is done.  */
+   conversion after Wget is done.  This list should only be traversed
+   in order.  If you need to check whether a file has been downloaded,
+   use a hash table, e.g. dl_file_url_map.  */
  static slist *downloaded_html_files;
+\f
+/* Functions for maintaining the URL queue.  */
  
-/* List of undesirable-to-load URLs.  */
-static struct hash_table *undesirable_urls;
+struct queue_element {
+  const char *url;
+  const char *referer;
+  int depth;
+  struct queue_element *next;
+};
  
-/* Current recursion depth.  */
-static int depth;
+struct url_queue {
+  struct queue_element *head;
+  struct queue_element *tail;
+  int count, maxcount;
+};
  
-/* Base directory we're recursing from (used by no_parent).  */
-static char *base_dir;
+/* Create a URL queue. */
  
-static int first_time = 1;
+static struct url_queue *
+url_queue_new (void)
+{
+  struct url_queue *queue = xmalloc (sizeof (*queue));
+  memset (queue, '\0', sizeof (*queue));
+  return queue;
+}
  
+/* Delete a URL queue. */
  
-/* Cleanup the data structures associated with recursive retrieving
-   (the variables above).  */
-void
-recursive_cleanup (void)
+static void
+url_queue_delete (struct url_queue *queue)
  {
-  if (undesirable_urls)
-    {
-      string_set_free (undesirable_urls);
-      undesirable_urls = NULL;
-    }
-  if (dl_file_url_map)
-    {
-      free_keys_and_values (dl_file_url_map);
-      hash_table_destroy (dl_file_url_map);
-      dl_file_url_map = NULL;
-    }
-  if (dl_url_file_map)
-    {
-      free_keys_and_values (dl_url_file_map);
-      hash_table_destroy (dl_url_file_map);
-      dl_url_file_map = NULL;
-    }
-  undesirable_urls = NULL;
-  slist_free (downloaded_html_files);
-  downloaded_html_files = NULL;
-  FREE_MAYBE (base_dir);
-  first_time = 1;
+  xfree (queue);
  }
  
-/* Reset FIRST_TIME to 1, so that some action can be taken in
-   recursive_retrieve().  */
-void
-recursive_reset (void)
+/* Enqueue a URL in the queue.  The queue is FIFO: the items will be
+   retrieved ("dequeued") from the queue in the order they were placed
+   into it.  */
+
+static void
+url_enqueue (struct url_queue *queue,
+            const char *url, const char *referer, int depth)
  {
-  first_time = 1;
+  struct queue_element *qel = xmalloc (sizeof (*qel));
+  qel->url = url;
+  qel->referer = referer;
+  qel->depth = depth;
+  qel->next = NULL;
+
+  ++queue->count;
+  if (queue->count > queue->maxcount)
+    queue->maxcount = queue->count;
+
+  DEBUGP (("Enqueuing %s at depth %d\n", url, depth));
+  DEBUGP (("Queue count %d, maxcount %d.\n", queue->count, queue->maxcount));
+
+  if (queue->tail)
+    queue->tail->next = qel;
+  queue->tail = qel;
+
+  if (!queue->head)
+    queue->head = queue->tail;
  }
  
-/* The core of recursive retrieving.  Endless recursion is avoided by
-   having all URLs stored to a linked list of URLs, which is checked
-   before loading any URL.  That way no URL can get loaded twice.
+/* Take a URL out of the queue.  Return 1 if this operation succeeded,
+   or 0 if the queue is empty.  */
+
+static int
+url_dequeue (struct url_queue *queue,
+            const char **url, const char **referer, int *depth)
+{
+  struct queue_element *qel = queue->head;
+
+  if (!qel)
+    return 0;
+
+  queue->head = queue->head->next;
+  if (!queue->head)
+    queue->tail = NULL;
+
+  *url = qel->url;
+  *referer = qel->referer;
+  *depth = qel->depth;
+
+  --queue->count;
+
+  DEBUGP (("Dequeuing %s at depth %d\n", qel->url, qel->depth));
+  DEBUGP (("Queue count %d, maxcount %d.\n", queue->count, queue->maxcount));
+
+  xfree (qel);
+  return 1;
+}
+
+static int descend_url_p PARAMS ((const struct urlpos *, struct url *, int,
+                                 struct url *, struct hash_table *));
+
+/* Retrieve a part of the web beginning with START_URL.  This used to
+   be called "recursive retrieval", because the old function was
+   recursive and implemented depth-first search.  retrieve_tree on the
+   other hand implements breadth-search traversal of the tree, which
+   results in much nicer ordering of downloads.
+
+   The algorithm this function uses is simple:
+
+   1. put START_URL in the queue.
+   2. while there are URLs in the queue:
+
+     3. get next URL from the queue.
+     4. download it.
+     5. if the URL is HTML and its depth does not exceed maximum depth,
+        get the list of URLs embedded therein.
+     6. for each of those URLs do the following:
+
+       7. if the URL is not one of those downloaded before, and if it
+          satisfies the criteria specified by the various command-line
+         options, add it to the queue. */
  
-   The function also supports specification of maximum recursion depth
-   and a number of other goodies.  */
  uerr_t
-recursive_retrieve (const char *file, const char *this_url)
+retrieve_tree (const char *start_url)
  {
-  char *constr, *filename, *newloc;
-  char *canon_this_url = NULL;
-  int dt, inl, dash_p_leaf_HTML = FALSE;
-  int meta_disallow_follow;
-  int this_url_ftp;            /* See below the explanation */
-  urlpos *url_list, *cur_url;
-  struct url *u;
-
-  assert (this_url != NULL);
-  assert (file != NULL);
-  /* If quota was exceeded earlier, bail out.  */
-  if (downloaded_exceeds_quota ())
-    return QUOTEXC;
-  /* Cache the current URL in the list.  */
-  if (first_time)
-    {
-      /* These three operations need to be done only once per Wget
-         run.  They should probably be at a different location.  */
-      if (!undesirable_urls)
-       undesirable_urls = make_string_hash_table (0);
-
-      hash_table_clear (undesirable_urls);
-      string_set_add (undesirable_urls, this_url);
-      /* Enter this_url to the hash table, in original and "enhanced" form.  */
-      u = url_parse (this_url, NULL);
-      if (u)
-       {
-         string_set_add (undesirable_urls, u->url);
-         if (opt.no_parent)
-           base_dir = xstrdup (u->dir); /* Set the base dir.  */
-         /* Set the canonical this_url to be sent as referer.  This
-            problem exists only when running the first time.  */
-         canon_this_url = xstrdup (u->url);
-       }
-      else
-       {
-         DEBUGP (("Double yuck!  The *base* URL is broken.\n"));
-         base_dir = NULL;
-       }
-      url_free (u);
-      depth = 1;
-      first_time = 0;
-    }
-  else
-    ++depth;
+  uerr_t status = RETROK;
  
-  if (opt.reclevel != INFINITE_RECURSION && depth > opt.reclevel)
-    /* We've exceeded the maximum recursion depth specified by the user. */
-    {
-      if (opt.page_requisites && depth <= opt.reclevel + 1)
-       /* When -p is specified, we can do one more partial recursion from the
-          "leaf nodes" on the HTML document tree.  The recursion is partial in
-          that we won't traverse any <A> or <AREA> tags, nor any <LINK> tags
-          except for <LINK REL="stylesheet">. */
-       dash_p_leaf_HTML = TRUE;
-      else
-       /* Either -p wasn't specified or it was and we've already gone the one
-          extra (pseudo-)level that it affords us, so we need to bail out. */
-       {
-         DEBUGP (("Recursion depth %d exceeded max. depth %d.\n",
-                  depth, opt.reclevel));
-         --depth;
-         return RECLEVELEXC;
-       }
-    }
+  /* The queue of URLs we need to load. */
+  struct url_queue *queue = url_queue_new ();
  
-  /* Determine whether this_url is an FTP URL.  If it is, it means
-     that the retrieval is done through proxy.  In that case, FTP
-     links will be followed by default and recursion will not be
-     turned off when following them.  */
-  this_url_ftp = (url_scheme (this_url) == SCHEME_FTP);
+  /* The URLs we decided we don't want to load. */
+  struct hash_table *blacklist = make_string_hash_table (0);
  
-  /* Get the URL-s from an HTML file: */
-  url_list = get_urls_html (file, canon_this_url ? canon_this_url : this_url,
-                           dash_p_leaf_HTML, &meta_disallow_follow);
+  /* We'll need various components of this, so better get it over with
+     now. */
+  struct url *start_url_parsed = url_parse (start_url, NULL);
  
-  if (opt.use_robots && meta_disallow_follow)
-    {
-      /* The META tag says we are not to follow this file.  Respect
-         that.  */
-      free_urlpos (url_list);
-      url_list = NULL;
-    }
+  url_enqueue (queue, xstrdup (start_url), NULL, 0);
+  string_set_add (blacklist, start_url);
  
-  /* Decide what to do with each of the URLs.  A URL will be loaded if
-     it meets several requirements, discussed later.  */
-  for (cur_url = url_list; cur_url; cur_url = cur_url->next)
+  while (1)
      {
-      /* If quota was exceeded earlier, bail out.  */
+      int descend = 0;
+      char *url, *referer, *file = NULL;
+      int depth;
+      boolean dash_p_leaf_HTML = FALSE;
+
        if (downloaded_exceeds_quota ())
         break;
-      /* Parse the URL for convenient use in other functions, as well
-        as to get the optimized form.  It also checks URL integrity.  */
-      u = url_parse (cur_url->url, NULL);
-      if (!u)
-       {
-         DEBUGP (("Yuck!  A bad URL.\n"));
-         continue;
-       }
-      assert (u->url != NULL);
-      constr = xstrdup (u->url);
-
-      /* Several checkings whether a file is acceptable to load:
-        1. check if URL is ftp, and we don't load it
-        2. check for relative links (if relative_only is set)
-        3. check for domain
-        4. check for no-parent
-        5. check for excludes && includes
-        6. check for suffix
-        7. check for same host (if spanhost is unset), with possible
-        gethostbyname baggage
-        8. check for robots.txt
-
-        Addendum: If the URL is FTP, and it is to be loaded, only the
-        domain and suffix settings are "stronger".
-
-        Note that .html and (yuck) .htm will get loaded regardless of
-        suffix rules (but that is remedied later with unlink) unless
-        the depth equals the maximum depth.
-
-        More time- and memory- consuming tests should be put later on
-        the list.  */
-
-      /* inl is set if the URL we are working on (constr) is stored in
-        undesirable_urls.  Using it is crucial to avoid unnecessary
-        repeated continuous hits to the hash table.  */
-      inl = string_set_contains (undesirable_urls, constr);
-
-      /* If it is FTP, and FTP is not followed, chuck it out.  */
-      if (!inl)
-       if (u->scheme == SCHEME_FTP && !opt.follow_ftp && !this_url_ftp)
-         {
-           DEBUGP (("Uh, it is FTP but i'm not in the mood to follow FTP.\n"));
-           string_set_add (undesirable_urls, constr);
-           inl = 1;
-         }
-      /* If it is absolute link and they are not followed, chuck it
-        out.  */
-      if (!inl && u->scheme != SCHEME_FTP)
-       if (opt.relative_only && !cur_url->link_relative_p)
-         {
-           DEBUGP (("It doesn't really look like a relative link.\n"));
-           string_set_add (undesirable_urls, constr);
-           inl = 1;
-         }
-      /* If its domain is not to be accepted/looked-up, chuck it out.  */
-      if (!inl)
-       if (!accept_domain (u))
+
+      if (status == FWRITEERR)
+       break;
+
+      /* Get the next URL from the queue. */
+
+      if (!url_dequeue (queue,
+                       (const char **)&url, (const char **)&referer,
+                       &depth))
+       break;
+
+      /* And download it. */
+
+      {
+       int dt = 0;
+       char *redirected = NULL;
+       int oldrec = opt.recursive;
+
+       opt.recursive = 0;
+       status = retrieve_url (url, &file, &redirected, NULL, &dt);
+       opt.recursive = oldrec;
+
+       if (redirected)
           {
-           DEBUGP (("I don't like the smell of that domain.\n"));
-           string_set_add (undesirable_urls, constr);
-           inl = 1;
+           xfree (url);
+           url = redirected;
           }
-      /* Check for parent directory.  */
-      if (!inl && opt.no_parent
-         /* If the new URL is FTP and the old was not, ignore
-             opt.no_parent.  */
-         && !(!this_url_ftp && u->scheme == SCHEME_FTP))
+       if (file && status == RETROK
+           && (dt & RETROKF) && (dt & TEXTHTML))
+         descend = 1;
+      }
+
+      if (descend
+         && depth >= opt.reclevel && opt.reclevel != INFINITE_RECURSION)
         {
-         /* Check for base_dir first.  */
-         if (!(base_dir && frontcmp (base_dir, u->dir)))
+         if (opt.page_requisites && depth == opt.reclevel)
+           /* When -p is specified, we can do one more partial
+              recursion from the "leaf nodes" on the HTML document
+              tree.  The recursion is partial in that we won't
+              traverse any <A> or <AREA> tags, nor any <LINK> tags
+              except for <LINK REL="stylesheet">. */
+           /* #### This would be the place to implement the TODO
+              entry saying that -p should do two more hops on
+              framesets.  */
+           dash_p_leaf_HTML = TRUE;
+         else
             {
-             /* Failing that, check for parent dir.  */
-             struct url *ut = url_parse (this_url, NULL);
-             if (!ut)
-               DEBUGP (("Double yuck!  The *base* URL is broken.\n"));
-             else if (!frontcmp (ut->dir, u->dir))
-               {
-                 /* Failing that too, kill the URL.  */
-                 DEBUGP (("Trying to escape parental guidance with no_parent on.\n"));
-                 string_set_add (undesirable_urls, constr);
-                 inl = 1;
-               }
-             url_free (ut);
+             /* Either -p wasn't specified or it was and we've
+                already gone the one extra (pseudo-)level that it
+                affords us, so we need to bail out. */
+             DEBUGP (("Not descending further; at depth %d, max. %d.\n",
+                      depth, opt.reclevel));
+             descend = 0;
             }
         }
-      /* If the file does not match the acceptance list, or is on the
-        rejection list, chuck it out.  The same goes for the
-        directory exclude- and include- lists.  */
-      if (!inl && (opt.includes || opt.excludes))
+
+      /* If the downloaded document was HTML, parse it and enqueue the
+        links it contains. */
+
+      if (descend)
         {
-         if (!accdir (u->dir, ALLABS))
+         int meta_disallow_follow = 0;
+         struct urlpos *children = get_urls_html (file, url, dash_p_leaf_HTML,
+                                                  &meta_disallow_follow);
+
+         if (opt.use_robots && meta_disallow_follow)
             {
-             DEBUGP (("%s (%s) is excluded/not-included.\n", constr, u->dir));
-             string_set_add (undesirable_urls, constr);
-             inl = 1;
+             free_urlpos (children);
+             children = NULL;
             }
-       }
-      if (!inl)
-       {
-         char *suf = NULL;
-         /* We check for acceptance/rejection rules only for non-HTML
-            documents.  Since we don't know whether they really are
-            HTML, it will be deduced from (an OR-ed list):
-
-            1) u->file is "" (meaning it is a directory)
-            2) suffix exists, AND:
-            a) it is "html", OR
-            b) it is "htm"
-
-            If the file *is* supposed to be HTML, it will *not* be
-            subject to acc/rej rules, unless a finite maximum depth has
-            been specified and the current depth is the maximum depth. */
-         if (!
-             (!*u->file
-              || (((suf = suffix (constr)) != NULL)
-                  && ((!strcmp (suf, "html") || !strcmp (suf, "htm"))
-                      && ((opt.reclevel != INFINITE_RECURSION) &&
-                         (depth != opt.reclevel))))))
+
+         if (children)
             {
-             if (!acceptable (u->file))
+             struct urlpos *child = children;
+             struct url *url_parsed = url_parsed = url_parse (url, NULL);
+             assert (url_parsed != NULL);
+
+             for (; child; child = child->next)
                 {
-                 DEBUGP (("%s (%s) does not match acc/rej rules.\n",
-                         constr, u->file));
-                 string_set_add (undesirable_urls, constr);
-                 inl = 1;
+                 if (descend_url_p (child, url_parsed, depth, start_url_parsed,
+                                    blacklist))
+                   {
+                     url_enqueue (queue, xstrdup (child->url->url),
+                                  xstrdup (url), depth + 1);
+                     /* We blacklist the URL we have enqueued, because we
+                        don't want to enqueue (and hence download) the
+                        same URL twice.  */
+                     string_set_add (blacklist, child->url->url);
+                   }
                 }
+
+             url_free (url_parsed);
+             free_urlpos (children);
             }
-         FREE_MAYBE (suf);
         }
-      /* Optimize the URL (which includes possible DNS lookup) only
-        after all other possibilities have been exhausted.  */
-      if (!inl)
+
+      if (opt.delete_after || (file && !acceptable (file)))
         {
-         if (!opt.simple_check)
-           {
-             /* Find the "true" host.  */
-             char *host = realhost (u->host);
-             xfree (u->host);
-             u->host = host;
-
-             /* Refresh the printed representation of the URL.  */
-             xfree (u->url);
-             u->url = url_string (u, 0);
-           }
-         else
-           {
-             char *p;
-             /* Just lowercase the hostname.  */
-             for (p = u->host; *p; p++)
-               *p = TOLOWER (*p);
-             xfree (u->url);
-             u->url = url_string (u, 0);
-           }
-         xfree (constr);
-         constr = xstrdup (u->url);
-         /* After we have canonicalized the URL, check if we have it
-            on the black list. */
-         if (string_set_contains (undesirable_urls, constr))
-           inl = 1;
-         /* This line is bogus. */
-         /*string_set_add (undesirable_urls, constr);*/
-
-         if (!inl && !((u->scheme == SCHEME_FTP) && !this_url_ftp))
-           if (!opt.spanhost && this_url && !same_host (this_url, constr))
-             {
-               DEBUGP (("This is not the same hostname as the parent's.\n"));
-               string_set_add (undesirable_urls, constr);
-               inl = 1;
-             }
+         /* Either --delete-after was specified, or we loaded this
+            otherwise rejected (e.g. by -R) HTML file just so we
+            could harvest its hyperlinks -- in either case, delete
+            the local file. */
+         DEBUGP (("Removing file due to %s in recursive_retrieve():\n",
+                  opt.delete_after ? "--delete-after" :
+                  "recursive rejection criteria"));
+         logprintf (LOG_VERBOSE,
+                    (opt.delete_after ? _("Removing %s.\n")
+                     : _("Removing %s since it should be rejected.\n")),
+                    file);
+         if (unlink (file))
+           logprintf (LOG_NOTQUIET, "unlink: %s\n", strerror (errno));
         }
-      /* What about robots.txt?  */
-      if (!inl && opt.use_robots && u->scheme == SCHEME_HTTP)
+
+      xfree (url);
+      FREE_MAYBE (referer);
+      FREE_MAYBE (file);
+    }
+
+  /* If anything is left of the queue due to a premature exit, free it
+     now.  */
+  {
+    char *d1, *d2;
+    int d3;
+    while (url_dequeue (queue, (const char **)&d1, (const char **)&d2, &d3))
+      {
+       xfree (d1);
+       FREE_MAYBE (d2);
+      }
+  }
+  url_queue_delete (queue);
+
+  if (start_url_parsed)
+    url_free (start_url_parsed);
+  string_set_free (blacklist);
+
+  if (downloaded_exceeds_quota ())
+    return QUOTEXC;
+  else if (status == FWRITEERR)
+    return FWRITEERR;
+  else
+    return RETROK;
+}
+
+/* Based on the context provided by retrieve_tree, decide whether a
+   URL is to be descended to.  This is only ever called from
+   retrieve_tree, but is in a separate function for clarity.  */
+
+static int
+descend_url_p (const struct urlpos *upos, struct url *parent, int depth,
+              struct url *start_url_parsed, struct hash_table *blacklist)
+{
+  struct url *u = upos->url;
+  const char *url = u->url;
+
+  DEBUGP (("Deciding whether to enqueue \"%s\".\n", url));
+
+  if (string_set_contains (blacklist, url))
+    {
+      DEBUGP (("Already on the black list.\n"));
+      goto out;
+    }
+
+  /* Several things to check for:
+     1. if scheme is not http, and we don't load it
+     2. check for relative links (if relative_only is set)
+     3. check for domain
+     4. check for no-parent
+     5. check for excludes && includes
+     6. check for suffix
+     7. check for same host (if spanhost is unset), with possible
+     gethostbyname baggage
+     8. check for robots.txt
+
+     Addendum: If the URL is FTP, and it is to be loaded, only the
+     domain and suffix settings are "stronger".
+
+     Note that .html files will get loaded regardless of suffix rules
+     (but that is remedied later with unlink) unless the depth equals
+     the maximum depth.
+
+     More time- and memory- consuming tests should be put later on
+     the list.  */
+
+  /* 1. Schemes other than HTTP are normally not recursed into. */
+  if (u->scheme != SCHEME_HTTP
+      && !(u->scheme == SCHEME_FTP && opt.follow_ftp))
+    {
+      DEBUGP (("Not following non-HTTP schemes.\n"));
+      goto blacklist;
+    }
+
+  /* 2. If it is an absolute link and they are not followed, throw it
+     out.  */
+  if (u->scheme == SCHEME_HTTP)
+    if (opt.relative_only && !upos->link_relative_p)
+      {
+       DEBUGP (("It doesn't really look like a relative link.\n"));
+       goto blacklist;
+      }
+
+  /* 3. If its domain is not to be accepted/looked-up, chuck it
+     out.  */
+  if (!accept_domain (u))
+    {
+      DEBUGP (("The domain was not accepted.\n"));
+      goto blacklist;
+    }
+
+  /* 4. Check for parent directory.
+
+     If we descended to a different host or changed the scheme, ignore
+     opt.no_parent.  Also ignore it for -p leaf retrievals.  */
+  if (opt.no_parent
+      && u->scheme == parent->scheme
+      && 0 == strcasecmp (u->host, parent->host)
+      && u->port == parent->port)
+    {
+      if (!frontcmp (parent->dir, u->dir))
         {
-         struct robot_specs *specs = res_get_specs (u->host, u->port);
-         if (!specs)
-           {
-             char *rfile;
-             if (res_retrieve_file (constr, &rfile))
-               {
-                 specs = res_parse_from_file (rfile);
-                 xfree (rfile);
-               }
-             else
-               {
-                 /* If we cannot get real specs, at least produce
-                    dummy ones so that we can register them and stop
-                    trying to retrieve them.  */
-                 specs = res_parse ("", 0);
-               }
-             res_register_specs (u->host, u->port, specs);
-           }
+         DEBUGP (("Trying to escape the root directory with no_parent in effect.\n"));
+         goto blacklist;
+       }
+    }
  
-         /* Now that we have (or don't have) robots.txt specs, we can
-            check what they say.  */
-         if (!res_match_path (specs, u->path))
-           {
-             DEBUGP (("Not following %s because robots.txt forbids it.\n",
-                      constr));
-             string_set_add (undesirable_urls, constr);
-             inl = 1;
-           }
+  /* 5. If the file does not match the acceptance list, or is on the
+     rejection list, chuck it out.  The same goes for the directory
+     exclusion and inclusion lists.  */
+  if (opt.includes || opt.excludes)
+    {
+      if (!accdir (u->dir, ALLABS))
+       {
+         DEBUGP (("%s (%s) is excluded/not-included.\n", url, u->dir));
+         goto blacklist;
         }
+    }
  
-      filename = NULL;
-      /* If it wasn't chucked out, do something with it.  */
-      if (!inl)
+  /* 6. */
+  {
+    char *suf = NULL;
+    /* Check for acceptance/rejection rules.  We ignore these rules
+       for HTML documents because they might lead to other files which
+       need to be downloaded.  Of course, we don't know which
+       documents are HTML before downloading them, so we guess.
+
+       A file is subject to acceptance/rejection rules if:
+
+       * u->file is not "" (i.e. it is not a directory)
+       and either:
+         + there is no file suffix,
+        + or there is a suffix, but is not "html" or "htm",
+        + both:
+          - recursion is not infinite,
+          - and we are at its very end. */
+
+    if (u->file[0] != '\0'
+       && ((suf = suffix (url)) == NULL
+           || (0 != strcmp (suf, "html") && 0 != strcmp (suf, "htm"))
+           || (opt.reclevel == INFINITE_RECURSION && depth >= opt.reclevel)))
+      {
+       if (!acceptable (u->file))
+         {
+           DEBUGP (("%s (%s) does not match acc/rej rules.\n",
+                    url, u->file));
+           FREE_MAYBE (suf);
+           goto blacklist;
+         }
+      }
+    FREE_MAYBE (suf);
+  }
+
+  /* 7. */
+  if (u->scheme == parent->scheme)
+    if (!opt.spanhost && 0 != strcasecmp (parent->host, u->host))
+      {
+       DEBUGP (("This is not the same hostname as the parent's (%s and %s).\n",
+                u->host, parent->host));
+       goto blacklist;
+      }
+
+  /* 8. */
+  if (opt.use_robots && u->scheme == SCHEME_HTTP)
+    {
+      struct robot_specs *specs = res_get_specs (u->host, u->port);
+      if (!specs)
         {
-         DEBUGP (("I've decided to load it -> "));
-         /* Add it to the list of already-loaded URL-s.  */
-         string_set_add (undesirable_urls, constr);
-         /* Automatically followed FTPs will *not* be downloaded
-            recursively.  */
-         if (u->scheme == SCHEME_FTP)
+         char *rfile;
+         if (res_retrieve_file (url, &rfile))
             {
-             /* Don't you adore side-effects?  */
-             opt.recursive = 0;
-           }
-         /* Reset its type.  */
-         dt = 0;
-         /* Retrieve it.  */
-         retrieve_url (constr, &filename, &newloc,
-                      canon_this_url ? canon_this_url : this_url, &dt);
-         if (u->scheme == SCHEME_FTP)
-           {
-             /* Restore...  */
-             opt.recursive = 1;
-           }
-         if (newloc)
-           {
-             xfree (constr);
-             constr = newloc;
-           }
-         /* If there was no error, and the type is text/html, parse
-            it recursively.  */
-         if (dt & TEXTHTML)
-           {
-             if (dt & RETROKF)
-               recursive_retrieve (filename, constr);
+             specs = res_parse_from_file (rfile);
+             xfree (rfile);
             }
           else
-           DEBUGP (("%s is not text/html so we don't chase.\n",
-                    filename ? filename: "(null)"));
-
-         if (opt.delete_after || (filename && !acceptable (filename)))
-           /* Either --delete-after was specified, or we loaded this otherwise
-              rejected (e.g. by -R) HTML file just so we could harvest its
-              hyperlinks -- in either case, delete the local file. */
             {
-             DEBUGP (("Removing file due to %s in recursive_retrieve():\n",
-                      opt.delete_after ? "--delete-after" :
-                      "recursive rejection criteria"));
-             logprintf (LOG_VERBOSE,
-                        (opt.delete_after ? _("Removing %s.\n")
-                         : _("Removing %s since it should be rejected.\n")),
-                        filename);
-             if (unlink (filename))
-               logprintf (LOG_NOTQUIET, "unlink: %s\n", strerror (errno));
-             dt &= ~RETROKF;
+             /* If we cannot get real specs, at least produce
+                dummy ones so that we can register them and stop
+                trying to retrieve them.  */
+             specs = res_parse ("", 0);
             }
+         res_register_specs (u->host, u->port, specs);
+       }
  
-         /* If everything was OK, and links are to be converted, let's
-            store the local filename.  */
-         if (opt.convert_links && (dt & RETROKF) && (filename != NULL))
-           {
-             cur_url->convert = CO_CONVERT_TO_RELATIVE;
-             cur_url->local_name = xstrdup (filename);
-           }
+      /* Now that we have (or don't have) robots.txt specs, we can
+        check what they say.  */
+      if (!res_match_path (specs, u->path))
+       {
+         DEBUGP (("Not following %s because robots.txt forbids it.\n", url));
+         goto blacklist;
         }
-      else
-       DEBUGP (("%s already in list, so we don't load.\n", constr));
-      /* Free filename and constr.  */
-      FREE_MAYBE (filename);
-      FREE_MAYBE (constr);
-      url_free (u);
-      /* Increment the pbuf for the appropriate size.  */
      }
-  if (opt.convert_links && !opt.delete_after)
-    /* This is merely the first pass: the links that have been
-       successfully downloaded are converted.  In the second pass,
-       convert_all_links() will also convert those links that have NOT
-       been downloaded to their canonical form.  */
-    convert_links (file, url_list);
-  /* Free the linked list of URL-s.  */
-  free_urlpos (url_list);
-  /* Free the canonical this_url.  */
-  FREE_MAYBE (canon_this_url);
-  /* Decrement the recursion depth.  */
-  --depth;
-  if (downloaded_exceeds_quota ())
-    return QUOTEXC;
-  else
-    return RETROK;
+
+  /* The URL has passed all the tests.  It can be placed in the
+     download queue. */
+  DEBUGP (("Decided to load it.\n"));
+
+  return 1;
+
+ blacklist:
+  string_set_add (blacklist, url);
+
+ out:
+  DEBUGP (("Decided NOT to load it.\n"));
+
+  return 0;
  }
  \f
+/* Register that URL has been successfully downloaded to FILE. */
+
  void
  register_download (const char *url, const char *file)
  {
@@ -507,12 +535,35 @@ register_download (const char *url, const char *file)
      return;
    if (!dl_file_url_map)
      dl_file_url_map = make_string_hash_table (0);
-  hash_table_put (dl_file_url_map, xstrdup (file), xstrdup (url));
    if (!dl_url_file_map)
      dl_url_file_map = make_string_hash_table (0);
-  hash_table_put (dl_url_file_map, xstrdup (url), xstrdup (file));
+
+  if (!hash_table_contains (dl_file_url_map, file))
+    hash_table_put (dl_file_url_map, xstrdup (file), xstrdup (url));
+  if (!hash_table_contains (dl_url_file_map, url))
+    hash_table_put (dl_url_file_map, xstrdup (url), xstrdup (file));
  }
  
+/* Register that FROM has been redirected to TO.  This assumes that TO
+   is successfully downloaded and already registered using
+   register_download() above.  */
+
+void
+register_redirection (const char *from, const char *to)
+{
+  char *file;
+
+  if (!opt.convert_links)
+    return;
+
+  file = hash_table_get (dl_url_file_map, to);
+  assert (file != NULL);
+  if (!hash_table_contains (dl_url_file_map, from))
+    hash_table_put (dl_url_file_map, xstrdup (from), xstrdup (file));
+}
+
+/* Register that URL corresponds to the HTML file FILE. */
+
  void
  register_html (const char *url, const char *file)
  {
@@ -558,10 +609,11 @@ convert_all_links (void)
  
    for (html = downloaded_html_files; html; html = html->next)
      {
-      urlpos *urls, *cur_url;
+      struct urlpos *urls, *cur_url;
        char *url;
  
        DEBUGP (("Rescanning %s\n", html->string));
+
        /* Determine the URL of the HTML file.  get_urls_html will need
          it.  */
        url = hash_table_get (dl_file_url_map, html->string);
@@ -569,19 +621,19 @@ convert_all_links (void)
         DEBUGP (("It should correspond to %s.\n", url));
        else
         DEBUGP (("I cannot find the corresponding URL.\n"));
+
        /* Parse the HTML file...  */
        urls = get_urls_html (html->string, url, FALSE, NULL);
+
        /* We don't respect meta_disallow_follow here because, even if
           the file is not followed, we might still want to convert the
           links that have been followed from other files.  */
+
        for (cur_url = urls; cur_url; cur_url = cur_url->next)
         {
           char *local_name;
+         struct url *u = cur_url->url;
  
-         /* The URL must be in canonical form to be compared.  */
-         struct url *u = url_parse (cur_url->url, NULL);
-         if (!u)
-           continue;
           /* We decide the direction of conversion according to whether
              a URL was downloaded.  Downloaded URLs will be converted
              ABS2REL, whereas non-downloaded will be converted REL2ABS.  */
@@ -589,6 +641,7 @@ convert_all_links (void)
           if (local_name)
             DEBUGP (("%s marked for conversion, local %s\n",
                      u->url, local_name));
+
           /* Decide on the conversion direction.  */
           if (local_name)
             {
@@ -610,7 +663,6 @@ convert_all_links (void)
                 cur_url->convert = CO_CONVERT_TO_COMPLETE;
               cur_url->local_name = NULL;
             }
-         url_free (u);
         }
        /* Convert the links in the file.  */
        convert_links (html->string, urls);
@@ -618,3 +670,24 @@ convert_all_links (void)
        free_urlpos (urls);
      }
  }
+
+/* Cleanup the data structures associated with recursive retrieving
+   (the variables above).  */
+void
+recursive_cleanup (void)
+{
+  if (dl_file_url_map)
+    {
+      free_keys_and_values (dl_file_url_map);
+      hash_table_destroy (dl_file_url_map);
+      dl_file_url_map = NULL;
+    }
+  if (dl_url_file_map)
+    {
+      free_keys_and_values (dl_url_file_map);
+      hash_table_destroy (dl_url_file_map);
+      dl_url_file_map = NULL;
+    }
+  slist_free (downloaded_html_files);
+  downloaded_html_files = NULL;
+}
diff --git a/src/recur.h b/src/recur.h

index de3c62952824a8c94c92e9813c0ba227470fe20f..4654de8314bd08ab34fca69c25e7a3067f28bd52 100644 (file)
--- a/src/recur.h
+++ b/src/recur.h
@@ -21,10 +21,10 @@ Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.  */
  #define RECUR_H
  
  void recursive_cleanup PARAMS ((void));
-void recursive_reset PARAMS ((void));
-uerr_t recursive_retrieve PARAMS ((const char *, const char *));
+uerr_t retrieve_tree PARAMS ((const char *));
  
  void register_download PARAMS ((const char *, const char *));
+void register_redirection PARAMS ((const char *, const char *));
  void register_html PARAMS ((const char *, const char *));
  void convert_all_links PARAMS ((void));
  
diff --git a/src/res.c b/src/res.c

index 9cd15db6ca58f344a9dfdf6ffa67f7a4e12cac55..32b53b1beadaeebb33d4cb10ac8ed9a730e49743 100644 (file)
--- a/src/res.c
+++ b/src/res.c
@@ -125,6 +125,10 @@ add_path (struct robot_specs *specs, const char *path_b, const char *path_e,
           int allowedp, int exactp)
  {
    struct path_info pp;
+  if (path_b < path_e && *path_b == '/')
+    /* Our path representation doesn't use a leading slash, so remove
+       one from theirs. */
+    ++path_b;
    pp.path     = strdupdelim (path_b, path_e);
    pp.allowedp = allowedp;
    pp.user_agent_exact_p = exactp;
@@ -390,6 +394,9 @@ res_parse_from_file (const char *filename)
  static void
  free_specs (struct robot_specs *specs)
  {
+  int i;
+  for (i = 0; i < specs->count; i++)
+    xfree (specs->paths[i].path);
    FREE_MAYBE (specs->paths);
    xfree (specs);
  }
@@ -546,3 +553,22 @@ res_retrieve_file (const char *url, char **file)
      }
    return err == RETROK;
  }
+\f
+static int
+cleanup_hash_table_mapper (void *key, void *value, void *arg_ignored)
+{
+  xfree (key);
+  free_specs (value);
+  return 0;
+}
+
+void
+res_cleanup (void)
+{
+  if (registered_specs)
+    {
+      hash_table_map (registered_specs, cleanup_hash_table_mapper, NULL);
+      hash_table_destroy (registered_specs);
+      registered_specs = NULL;
+    }
+}
diff --git a/src/res.h b/src/res.h

index 97b8977835894bd37e120228f85a9e0657d0a850..3b8b7647f810a3f0349affc73b18192ef4513e2a 100644 (file)
--- a/src/res.h
+++ b/src/res.h
@@ -29,3 +29,4 @@ struct robot_specs *res_get_specs PARAMS ((const char *, int));
  
  int res_retrieve_file PARAMS ((const char *, char **));
  
+void res_cleanup PARAMS ((void));
diff --git a/src/retr.c b/src/retr.c

index 32c58a7fb6bdf5e32263bc49176400d7a78cfbc9..dcd812268df40116fc69038b1936da6cae5cf51e 100644 (file)
--- a/src/retr.c
+++ b/src/retr.c
@@ -184,6 +184,26 @@ rate (long bytes, long msecs, int pad)
    return res;
  }
  \f
+static int
+register_redirections_mapper (void *key, void *value, void *arg)
+{
+  const char *redirected_from = (const char *)key;
+  const char *redirected_to   = (const char *)arg;
+  if (0 != strcmp (redirected_from, redirected_to))
+    register_redirection (redirected_from, redirected_to);
+  return 0;
+}
+
+/* Register the redirections that lead to the successful download of
+   this URL.  This is necessary so that the link converter can convert
+   redirected URLs to the local file.  */
+
+static void
+register_all_redirections (struct hash_table *redirections, const char *final)
+{
+  hash_table_map (redirections, register_redirections_mapper, (void *)final);
+}
+
  #define USE_PROXY_P(u) (opt.use_proxy && getproxy((u)->scheme)         \
                         && no_proxy_match((u)->host,                    \
                                           (const char **)opt.no_proxy))
@@ -254,7 +274,7 @@ retrieve_url (const char *origurl, char **file, char **newloc,
        proxy_url = url_parse (proxy, &up_error_code);
        if (!proxy_url)
         {
-         logprintf (LOG_NOTQUIET, "Error parsing proxy URL %s: %s.\n",
+         logprintf (LOG_NOTQUIET, _("Error parsing proxy URL %s: %s.\n"),
                      proxy, url_error (up_error_code));
           if (redirections)
             string_set_free (redirections);
@@ -310,7 +330,7 @@ retrieve_url (const char *origurl, char **file, char **newloc,
    if (location_changed)
      {
        char *construced_newloc;
-      struct url *newloc_struct;
+      struct url *newloc_parsed;
  
        assert (mynewloc != NULL);
  
@@ -326,12 +346,11 @@ retrieve_url (const char *origurl, char **file, char **newloc,
        mynewloc = construced_newloc;
  
        /* Now, see if this new location makes sense. */
-      newloc_struct = url_parse (mynewloc, &up_error_code);
-      if (!newloc_struct)
+      newloc_parsed = url_parse (mynewloc, &up_error_code);
+      if (!newloc_parsed)
         {
           logprintf (LOG_NOTQUIET, "%s: %s.\n", mynewloc,
                      url_error (up_error_code));
-         url_free (newloc_struct);
           url_free (u);
           if (redirections)
             string_set_free (redirections);
@@ -340,11 +359,11 @@ retrieve_url (const char *origurl, char **file, char **newloc,
           return result;
         }
  
-      /* Now mynewloc will become newloc_struct->url, because if the
+      /* Now mynewloc will become newloc_parsed->url, because if the
           Location contained relative paths like .././something, we
           don't want that propagating as url.  */
        xfree (mynewloc);
-      mynewloc = xstrdup (newloc_struct->url);
+      mynewloc = xstrdup (newloc_parsed->url);
  
        if (!redirections)
         {
@@ -356,11 +375,11 @@ retrieve_url (const char *origurl, char **file, char **newloc,
  
        /* The new location is OK.  Check for redirection cycle by
           peeking through the history of redirections. */
-      if (string_set_contains (redirections, newloc_struct->url))
+      if (string_set_contains (redirections, newloc_parsed->url))
         {
           logprintf (LOG_NOTQUIET, _("%s: Redirection cycle detected.\n"),
                      mynewloc);
-         url_free (newloc_struct);
+         url_free (newloc_parsed);
           url_free (u);
           if (redirections)
             string_set_free (redirections);
@@ -368,12 +387,12 @@ retrieve_url (const char *origurl, char **file, char **newloc,
           xfree (mynewloc);
           return WRONGCODE;
         }
-      string_set_add (redirections, newloc_struct->url);
+      string_set_add (redirections, newloc_parsed->url);
  
        xfree (url);
        url = mynewloc;
        url_free (u);
-      u = newloc_struct;
+      u = newloc_parsed;
        goto redirected;
      }
  
@@ -382,6 +401,8 @@ retrieve_url (const char *origurl, char **file, char **newloc,
        if (*dt & RETROKF)
         {
           register_download (url, local_file);
+         if (redirections)
+           register_all_redirections (redirections, url);
           if (*dt & TEXTHTML)
             register_html (url, local_file);
         }
@@ -415,16 +436,16 @@ uerr_t
  retrieve_from_file (const char *file, int html, int *count)
  {
    uerr_t status;
-  urlpos *url_list, *cur_url;
+  struct urlpos *url_list, *cur_url;
  
    url_list = (html ? get_urls_html (file, NULL, FALSE, NULL)
               : get_urls_file (file));
    status = RETROK;             /* Suppose everything is OK.  */
    *count = 0;                  /* Reset the URL count.  */
-  recursive_reset ();
+
    for (cur_url = url_list; cur_url; cur_url = cur_url->next, ++*count)
      {
-      char *filename, *new_file;
+      char *filename = NULL, *new_file;
        int dt;
  
        if (downloaded_exceeds_quota ())
@@ -432,10 +453,10 @@ retrieve_from_file (const char *file, int html, int *count)
           status = QUOTEXC;
           break;
         }
-      status = retrieve_url (cur_url->url, &filename, &new_file, NULL, &dt);
-      if (opt.recursive && status == RETROK && (dt & TEXTHTML))
-       status = recursive_retrieve (filename, new_file ? new_file
-                                                       : cur_url->url);
+      if (opt.recursive && cur_url->url->scheme != SCHEME_FTP)
+       status = retrieve_tree (cur_url->url->url);
+      else
+       status = retrieve_url (cur_url->url->url, &filename, &new_file, NULL, &dt);
  
        if (filename && opt.delete_after && file_exists_p (filename))
         {
diff --git a/src/url.c b/src/url.c

index e1685ab9aa8554093b0ba6d7bd6e817f1b951324..74547d900de2054cc0925371e26b6445f6b4dbcf 100644 (file)
--- a/src/url.c
+++ b/src/url.c
@@ -37,6 +37,7 @@ Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.  */
  #include "utils.h"
  #include "url.h"
  #include "host.h"
+#include "hash.h"
  
  #ifndef errno
  extern int errno;
@@ -182,7 +183,7 @@ encode_string_maybe (const char *s)
      {
        if (UNSAFE_CHAR (*p1))
         {
-         const unsigned char c = *p1++;
+         unsigned char c = *p1++;
           *p2++ = '%';
           *p2++ = XDIGIT_TO_XCHAR (c >> 4);
           *p2++ = XDIGIT_TO_XCHAR (c & 0xf);
@@ -378,7 +379,7 @@ reencode_string (const char *s)
         {
         case CM_ENCODE:
           {
-           char c = *p1++;
+           unsigned char c = *p1++;
             *p2++ = '%';
             *p2++ = XDIGIT_TO_XCHAR (c >> 4);
             *p2++ = XDIGIT_TO_XCHAR (c & 0xf);
@@ -586,6 +587,22 @@ strpbrk_or_eos (const char *s, const char *accept)
    return p;
  }
  
+/* Turn STR into lowercase; return non-zero if a character was
+   actually changed. */
+
+static int
+lowercase_str (char *str)
+{
+  int change = 0;
+  for (; *str; str++)
+    if (!ISLOWER (*str))
+      {
+       change = 1;
+       *str = TOLOWER (*str);
+      }
+  return change;
+}
+
  static char *parse_errors[] = {
  #define PE_NO_ERROR            0
    "No error",
@@ -614,6 +631,7 @@ url_parse (const char *url, int *error)
  {
    struct url *u;
    const char *p;
+  int path_modified, host_modified;
  
    enum url_scheme scheme;
  
@@ -627,9 +645,7 @@ url_parse (const char *url, int *error)
    int port;
    char *user = NULL, *passwd = NULL;
  
-  const char *url_orig = url;
-
-  p = url = reencode_string (url);
+  char *url_encoded;
  
    scheme = url_scheme (url);
    if (scheme == SCHEME_INVALID)
@@ -638,6 +654,9 @@ url_parse (const char *url, int *error)
        return NULL;
      }
  
+  url_encoded = reencode_string (url);
+  p = url_encoded;
+
    p += strlen (supported_schemes[scheme].leading_string);
    uname_b = p;
    p += url_skip_uname (p);
@@ -749,11 +768,6 @@ url_parse (const char *url, int *error)
    u = (struct url *)xmalloc (sizeof (struct url));
    memset (u, 0, sizeof (*u));
  
-  if (url == url_orig)
-    u->url    = xstrdup (url);
-  else
-    u->url    = (char *)url;
-
    u->scheme = scheme;
    u->host   = strdupdelim (host_b, host_e);
    u->port   = port;
@@ -761,7 +775,10 @@ url_parse (const char *url, int *error)
    u->passwd = passwd;
  
    u->path = strdupdelim (path_b, path_e);
-  path_simplify (u->path);
+  path_modified = path_simplify (u->path);
+  parse_path (u->path, &u->dir, &u->file);
+
+  host_modified = lowercase_str (u->host);
  
    if (params_b)
      u->params = strdupdelim (params_b, params_e);
@@ -770,7 +787,26 @@ url_parse (const char *url, int *error)
    if (fragment_b)
      u->fragment = strdupdelim (fragment_b, fragment_e);
  
-  parse_path (u->path, &u->dir, &u->file);
+
+  if (path_modified || u->fragment || host_modified)
+    {
+      /* If path_simplify modified the path, or if a fragment is
+        present, or if the original host name had caps in it, make
+        sure that u->url is equivalent to what would be printed by
+        url_string.  */
+      u->url = url_string (u, 0);
+
+      if (url_encoded != url)
+       xfree ((char *) url_encoded);
+    }
+  else
+    {
+      if (url_encoded == url)
+       u->url    = xstrdup (url);
+      else
+       u->url    = url_encoded;
+    }
+  url_encoded = NULL;
  
    return u;
  }
@@ -927,17 +963,18 @@ url_free (struct url *url)
    FREE_MAYBE (url->fragment);
    FREE_MAYBE (url->user);
    FREE_MAYBE (url->passwd);
-  FREE_MAYBE (url->dir);
-  FREE_MAYBE (url->file);
+
+  xfree (url->dir);
+  xfree (url->file);
  
    xfree (url);
  }
  \f
-urlpos *
+struct urlpos *
  get_urls_file (const char *file)
  {
    struct file_memory *fm;
-  urlpos *head, *tail;
+  struct urlpos *head, *tail;
    const char *text, *text_end;
  
    /* Load the file.  */
@@ -968,10 +1005,28 @@ get_urls_file (const char *file)
         --line_end;
        if (line_end > line_beg)
         {
-         urlpos *entry = (urlpos *)xmalloc (sizeof (urlpos));
+         int up_error_code;
+         char *url_text;
+         struct urlpos *entry;
+         struct url *url;
+
+         /* We must copy the URL to a zero-terminated string.  *sigh*.  */
+         url_text = strdupdelim (line_beg, line_end);
+         url = url_parse (url_text, &up_error_code);
+         if (!url)
+           {
+             logprintf (LOG_NOTQUIET, "%s: Invalid URL %s: %s\n",
+                        file, url_text, url_error (up_error_code));
+             xfree (url_text);
+             continue;
+           }
+         xfree (url_text);
+
+         entry = (struct urlpos *)xmalloc (sizeof (struct urlpos));
           memset (entry, 0, sizeof (*entry));
           entry->next = NULL;
-         entry->url = strdupdelim (line_beg, line_end);
+         entry->url = url;
+
           if (!head)
             head = entry;
           else
@@ -985,12 +1040,13 @@ get_urls_file (const char *file)
  \f
  /* Free the linked list of urlpos.  */
  void
-free_urlpos (urlpos *l)
+free_urlpos (struct urlpos *l)
  {
    while (l)
      {
-      urlpos *next = l->next;
-      xfree (l->url);
+      struct urlpos *next = l->next;
+      if (l->url)
+       url_free (l->url);
        FREE_MAYBE (l->local_name);
        xfree (l);
        l = next;
@@ -1088,7 +1144,9 @@ count_slashes (const char *s)
  static char *
  mkstruct (const struct url *u)
  {
-  char *host, *dir, *file, *res, *dirpref;
+  char *dir, *dir_preencoding;
+  char *file, *res, *dirpref;
+  char *query = u->query && *u->query ? u->query : NULL;
    int l;
  
    if (opt.cut_dirs)
@@ -1104,36 +1162,35 @@ mkstruct (const struct url *u)
    else
      dir = u->dir + (*u->dir == '/');
  
-  host = xstrdup (u->host);
    /* Check for the true name (or at least a consistent name for saving
       to directory) of HOST, reusing the hlist if possible.  */
-  if (opt.add_hostdir && !opt.simple_check)
-    {
-      char *nhost = realhost (host);
-      xfree (host);
-      host = nhost;
-    }
-  /* Add dir_prefix and hostname (if required) to the beginning of
-     dir.  */
    if (opt.add_hostdir)
      {
+      /* Add dir_prefix and hostname (if required) to the beginning of
+        dir.  */
+      dirpref = (char *)alloca (strlen (opt.dir_prefix) + 1
+                               + strlen (u->host)
+                               + 1 + numdigit (u->port)
+                               + 1);
        if (!DOTP (opt.dir_prefix))
+       sprintf (dirpref, "%s/%s", opt.dir_prefix, u->host);
+      else
+       strcpy (dirpref, u->host);
+
+      if (u->port != scheme_default_port (u->scheme))
         {
-         dirpref = (char *)alloca (strlen (opt.dir_prefix) + 1
-                                   + strlen (host) + 1);
-         sprintf (dirpref, "%s/%s", opt.dir_prefix, host);
+         int len = strlen (dirpref);
+         dirpref[len] = ':';
+         long_to_string (dirpref + len + 1, u->port);
         }
-      else
-       STRDUP_ALLOCA (dirpref, host);
      }
-  else                         /* not add_hostdir */
+  else                         /* not add_hostdir */
      {
        if (!DOTP (opt.dir_prefix))
         dirpref = opt.dir_prefix;
        else
         dirpref = "";
      }
-  xfree (host);
  
    /* If there is a prefix, prepend it.  */
    if (*dirpref)
@@ -1142,7 +1199,10 @@ mkstruct (const struct url *u)
        sprintf (newdir, "%s%s%s", dirpref, *dir == '/' ? "" : "/", dir);
        dir = newdir;
      }
-  dir = encode_string (dir);
+
+  dir_preencoding = dir;
+  dir = reencode_string (dir_preencoding);
+
    l = strlen (dir);
    if (l && dir[l - 1] == '/')
      dir[l - 1] = '\0';
@@ -1153,9 +1213,17 @@ mkstruct (const struct url *u)
      file = u->file;
  
    /* Finally, construct the full name.  */
-  res = (char *)xmalloc (strlen (dir) + 1 + strlen (file) + 1);
+  res = (char *)xmalloc (strlen (dir) + 1 + strlen (file)
+                        + (query ? (1 + strlen (query)) : 0)
+                        + 1);
    sprintf (res, "%s%s%s", dir, *dir ? "/" : "", file);
-  xfree (dir);
+  if (query)
+    {
+      strcat (res, "?");
+      strcat (res, query);
+    }
+  if (dir != dir_preencoding)
+    xfree (dir);
    return res;
  }
  
@@ -1177,7 +1245,7 @@ compose_file_name (char *base, char *query)
      {
        if (UNSAFE_CHAR (*from))
         {
-         const unsigned char c = *from++;
+         unsigned char c = *from++;
           *to++ = '%';
           *to++ = XDIGIT_TO_XCHAR (c >> 4);
           *to++ = XDIGIT_TO_XCHAR (c & 0xf);
@@ -1282,10 +1350,8 @@ url_filename (const struct url *u)
  static int
  urlpath_length (const char *url)
  {
-  const char *q = strchr (url, '?');
-  if (q)
-    return q - url;
-  return strlen (url);
+  const char *q = strpbrk_or_eos (url, "?;#");
+  return q - url;
  }
  
  /* Find the last occurrence of character C in the range [b, e), or
@@ -1323,63 +1389,42 @@ uri_merge_1 (const char *base, const char *link, int linklength, int no_scheme)
      {
        const char *end = base + urlpath_length (base);
  
-      if (*link != '/')
+      if (!*link)
         {
-         /* LINK is a relative URL: we need to replace everything
-            after last slash (possibly empty) with LINK.
-
-            So, if BASE is "whatever/foo/bar", and LINK is "qux/xyzzy",
-            our result should be "whatever/foo/qux/xyzzy".  */
-         int need_explicit_slash = 0;
-         int span;
-         const char *start_insert;
-         const char *last_slash = find_last_char (base, end, '/');
-         if (!last_slash)
-           {
-             /* No slash found at all.  Append LINK to what we have,
-                but we'll need a slash as a separator.
-
-                Example: if base == "foo" and link == "qux/xyzzy", then
-                we cannot just append link to base, because we'd get
-                "fooqux/xyzzy", whereas what we want is
-                "foo/qux/xyzzy".
-
-                To make sure the / gets inserted, we set
-                need_explicit_slash to 1.  We also set start_insert
-                to end + 1, so that the length calculations work out
-                correctly for one more (slash) character.  Accessing
-                that character is fine, since it will be the
-                delimiter, '\0' or '?'.  */
-             /* example: "foo?..." */
-             /*               ^    ('?' gets changed to '/') */
-             start_insert = end + 1;
-             need_explicit_slash = 1;
-           }
-         else if (last_slash && last_slash != base && *(last_slash - 1) == '/')
-           {
-             /* example: http://host"  */
-             /*                      ^ */
-             start_insert = end + 1;
-             need_explicit_slash = 1;
-           }
-         else
-           {
-             /* example: "whatever/foo/bar" */
-             /*                        ^    */
-             start_insert = last_slash + 1;
-           }
-
-         span = start_insert - base;
-         constr = (char *)xmalloc (span + linklength + 1);
-         if (span)
-           memcpy (constr, base, span);
-         if (need_explicit_slash)
-           constr[span - 1] = '/';
-         if (linklength)
-           memcpy (constr + span, link, linklength);
-         constr[span + linklength] = '\0';
+         /* Empty LINK points back to BASE, query string and all. */
+         constr = xstrdup (base);
+       }
+      else if (*link == '?')
+       {
+         /* LINK points to the same location, but changes the query
+            string.  Examples: */
+         /* uri_merge("path",         "?new") -> "path?new"     */
+         /* uri_merge("path?foo",     "?new") -> "path?new"     */
+         /* uri_merge("path?foo#bar", "?new") -> "path?new"     */
+         /* uri_merge("path#foo",     "?new") -> "path?new"     */
+         int baselength = end - base;
+         constr = xmalloc (baselength + linklength + 1);
+         memcpy (constr, base, baselength);
+         memcpy (constr + baselength, link, linklength);
+         constr[baselength + linklength] = '\0';
         }
-      else /* *link == `/' */
+      else if (*link == '#')
+       {
+         /* uri_merge("path",         "#new") -> "path#new"     */
+         /* uri_merge("path#foo",     "#new") -> "path#new"     */
+         /* uri_merge("path?foo",     "#new") -> "path?foo#new" */
+         /* uri_merge("path?foo#bar", "#new") -> "path?foo#new" */
+         int baselength;
+         const char *end1 = strchr (base, '#');
+         if (!end1)
+           end1 = base + strlen (base);
+         baselength = end1 - base;
+         constr = xmalloc (baselength + linklength + 1);
+         memcpy (constr, base, baselength);
+         memcpy (constr + baselength, link, linklength);
+         constr[baselength + linklength] = '\0';
+       }
+      else if (*link == '/')
         {
           /* LINK is an absolute path: we need to replace everything
               after (and including) the FIRST slash with LINK.
@@ -1435,6 +1480,62 @@ uri_merge_1 (const char *base, const char *link, int linklength, int no_scheme)
             memcpy (constr + span, link, linklength);
           constr[span + linklength] = '\0';
         }
+      else
+       {
+         /* LINK is a relative URL: we need to replace everything
+            after last slash (possibly empty) with LINK.
+
+            So, if BASE is "whatever/foo/bar", and LINK is "qux/xyzzy",
+            our result should be "whatever/foo/qux/xyzzy".  */
+         int need_explicit_slash = 0;
+         int span;
+         const char *start_insert;
+         const char *last_slash = find_last_char (base, end, '/');
+         if (!last_slash)
+           {
+             /* No slash found at all.  Append LINK to what we have,
+                but we'll need a slash as a separator.
+
+                Example: if base == "foo" and link == "qux/xyzzy", then
+                we cannot just append link to base, because we'd get
+                "fooqux/xyzzy", whereas what we want is
+                "foo/qux/xyzzy".
+
+                To make sure the / gets inserted, we set
+                need_explicit_slash to 1.  We also set start_insert
+                to end + 1, so that the length calculations work out
+                correctly for one more (slash) character.  Accessing
+                that character is fine, since it will be the
+                delimiter, '\0' or '?'.  */
+             /* example: "foo?..." */
+             /*               ^    ('?' gets changed to '/') */
+             start_insert = end + 1;
+             need_explicit_slash = 1;
+           }
+         else if (last_slash && last_slash != base && *(last_slash - 1) == '/')
+           {
+             /* example: http://host"  */
+             /*                      ^ */
+             start_insert = end + 1;
+             need_explicit_slash = 1;
+           }
+         else
+           {
+             /* example: "whatever/foo/bar" */
+             /*                        ^    */
+             start_insert = last_slash + 1;
+           }
+
+         span = start_insert - base;
+         constr = (char *)xmalloc (span + linklength + 1);
+         if (span)
+           memcpy (constr, base, span);
+         if (need_explicit_slash)
+           constr[span - 1] = '/';
+         if (linklength)
+           memcpy (constr + span, link, linklength);
+         constr[span + linklength] = '\0';
+       }
      }
    else /* !no_scheme */
      {
@@ -1602,12 +1703,13 @@ static void replace_attr PARAMS ((const char **, int, FILE *, const char *));
  /* Change the links in an HTML document.  Accepts a structure that
     defines the positions of all the links.  */
  void
-convert_links (const char *file, urlpos *l)
+convert_links (const char *file, struct urlpos *l)
  {
    struct file_memory *fm;
    FILE               *fp;
    const char         *p;
    downloaded_file_t  downloaded_file_return;
+  int to_url_count = 0, to_file_count = 0;
  
    logprintf (LOG_VERBOSE, _("Converting %s... "), file);
  
@@ -1615,12 +1717,12 @@ convert_links (const char *file, urlpos *l)
      /* First we do a "dry run": go through the list L and see whether
         any URL needs to be converted in the first place.  If not, just
         leave the file alone.  */
-    int count = 0;
-    urlpos *dry = l;
+    int dry_count = 0;
+    struct urlpos *dry = l;
      for (dry = l; dry; dry = dry->next)
        if (dry->convert != CO_NOCONVERT)
-       ++count;
-    if (!count)
+       ++dry_count;
+    if (!dry_count)
        {
         logputs (LOG_VERBOSE, _("nothing to do.\n"));
         return;
@@ -1674,7 +1776,7 @@ convert_links (const char *file, urlpos *l)
        /* If the URL is not to be converted, skip it.  */
        if (l->convert == CO_NOCONVERT)
         {
-         DEBUGP (("Skipping %s at position %d.\n", l->url, l->pos));
+         DEBUGP (("Skipping %s at position %d.\n", l->url->url, l->pos));
           continue;
         }
  
@@ -1689,19 +1791,21 @@ convert_links (const char *file, urlpos *l)
           char *quoted_newname = html_quote_string (newname);
           replace_attr (&p, l->size, fp, quoted_newname);
           DEBUGP (("TO_RELATIVE: %s to %s at position %d in %s.\n",
-                  l->url, newname, l->pos, file));
+                  l->url->url, newname, l->pos, file));
           xfree (newname);
           xfree (quoted_newname);
+         ++to_file_count;
         }
        else if (l->convert == CO_CONVERT_TO_COMPLETE)
         {
           /* Convert the link to absolute URL. */
-         char *newlink = l->url;
+         char *newlink = l->url->url;
           char *quoted_newlink = html_quote_string (newlink);
           replace_attr (&p, l->size, fp, quoted_newlink);
           DEBUGP (("TO_COMPLETE: <something> to %s at position %d in %s.\n",
                    newlink, l->pos, file));
           xfree (quoted_newlink);
+         ++to_url_count;
         }
      }
    /* Output the rest of the file. */
@@ -1709,7 +1813,8 @@ convert_links (const char *file, urlpos *l)
      fwrite (p, 1, fm->length - (p - fm->content), fp);
    fclose (fp);
    read_file_free (fm);
-  logputs (LOG_VERBOSE, _("done.\n"));
+  logprintf (LOG_VERBOSE,
+            _("%d-%d\n"), to_file_count, to_url_count);
  }
  
  /* Construct and return a malloced copy of the relative link from two
@@ -1766,20 +1871,6 @@ construct_relative (const char *s1, const char *s2)
    return res;
  }
  \f
-/* Add URL to the head of the list L.  */
-urlpos *
-add_url (urlpos *l, const char *url, const char *file)
-{
-  urlpos *t;
-
-  t = (urlpos *)xmalloc (sizeof (urlpos));
-  memset (t, 0, sizeof (*t));
-  t->url = xstrdup (url);
-  t->local_name = xstrdup (file);
-  t->next = l;
-  return t;
-}
-
  static void
  write_backup_file (const char *file, downloaded_file_t downloaded_file_return)
  {
@@ -1850,15 +1941,9 @@ write_backup_file (const char *file, downloaded_file_t downloaded_file_return)
          -- Dan Harkless <wget@harkless.org>
  
           This [adding a field to the urlpos structure] didn't work
-         because convert_file() is called twice: once after all its
-         sublinks have been retrieved in recursive_retrieve(), and
-         once at the end of the day in convert_all_links().  The
-         original linked list collected in recursive_retrieve() is
-         lost after the first invocation of convert_links(), and
-         convert_all_links() makes a new one (it calls get_urls_html()
-         for each file it covers.)  That's why your first approach didn't
-         work.  The way to make it work is perhaps to make this flag a
-         field in the `urls_html' list.
+         because convert_file() is called from convert_all_links at
+         the end of the retrieval with a freshly built new urlpos
+         list.
          -- Hrvoje Niksic <hniksic@arsdigita.com>
        */
        converted_file_ptr = xmalloc(sizeof(*converted_file_ptr));
@@ -1941,13 +2026,40 @@ find_fragment (const char *beg, int size, const char **bp, const char **ep)
    return 0;
  }
  
-typedef struct _downloaded_file_list {
-  char*                          file;
-  downloaded_file_t              download_type;
-  struct _downloaded_file_list*  next;
-} downloaded_file_list;
+/* We're storing "modes" of type downloaded_file_t in the hash table.
+   However, our hash tables only accept pointers for keys and values.
+   So when we need a pointer, we use the address of a
+   downloaded_file_t variable of static storage.  */
+   
+static downloaded_file_t *
+downloaded_mode_to_ptr (downloaded_file_t mode)
+{
+  static downloaded_file_t
+    v1 = FILE_NOT_ALREADY_DOWNLOADED,
+    v2 = FILE_DOWNLOADED_NORMALLY,
+    v3 = FILE_DOWNLOADED_AND_HTML_EXTENSION_ADDED,
+    v4 = CHECK_FOR_FILE;
+
+  switch (mode)
+    {
+    case FILE_NOT_ALREADY_DOWNLOADED:
+      return &v1;
+    case FILE_DOWNLOADED_NORMALLY:
+      return &v2;
+    case FILE_DOWNLOADED_AND_HTML_EXTENSION_ADDED:
+      return &v3;
+    case CHECK_FOR_FILE:
+      return &v4;
+    }
+  return NULL;
+}
+
+/* This should really be merged with dl_file_url_map and
+   downloaded_html_files in recur.c.  This was originally a list, but
+   I changed it to a hash table beause it was actually taking a lot of
+   time to find things in it.  */
  
-static downloaded_file_list *downloaded_files;
+static struct hash_table *downloaded_files_hash;
  
  /* Remembers which files have been downloaded.  In the standard case, should be
     called with mode == FILE_DOWNLOADED_NORMALLY for each file we actually
@@ -1962,46 +2074,47 @@ static downloaded_file_list *downloaded_files;
     it, call with mode == CHECK_FOR_FILE.  Please be sure to call this function
     with local filenames, not remote URLs. */
  downloaded_file_t
-downloaded_file (downloaded_file_t  mode, const char*  file)
+downloaded_file (downloaded_file_t mode, const char *file)
  {
-  boolean                       found_file = FALSE;
-  downloaded_file_list*         rover = downloaded_files;
+  downloaded_file_t *ptr;
  
-  while (rover != NULL)
-    if (strcmp(rover->file, file) == 0)
-      {
-       found_file = TRUE;
-       break;
-      }
-    else
-      rover = rover->next;
-
-  if (found_file)
-    return rover->download_type;  /* file had already been downloaded */
-  else
+  if (mode == CHECK_FOR_FILE)
      {
-      if (mode != CHECK_FOR_FILE)
-       {
-         rover = xmalloc(sizeof(*rover));
-         rover->file = xstrdup(file); /* use xstrdup() so die on out-of-mem. */
-         rover->download_type = mode;
-         rover->next = downloaded_files;
-         downloaded_files = rover;
-       }
-
-      return FILE_NOT_ALREADY_DOWNLOADED;
+      if (!downloaded_files_hash)
+       return FILE_NOT_ALREADY_DOWNLOADED;
+      ptr = hash_table_get (downloaded_files_hash, file);
+      if (!ptr)
+       return FILE_NOT_ALREADY_DOWNLOADED;
+      return *ptr;
      }
+
+  if (!downloaded_files_hash)
+    downloaded_files_hash = make_string_hash_table (0);
+
+  ptr = hash_table_get (downloaded_files_hash, file);
+  if (ptr)
+    return *ptr;
+
+  ptr = downloaded_mode_to_ptr (mode);
+  hash_table_put (downloaded_files_hash, xstrdup (file), &ptr);
+
+  return FILE_NOT_ALREADY_DOWNLOADED;
+}
+
+static int
+df_free_mapper (void *key, void *value, void *ignored)
+{
+  xfree (key);
+  return 0;
  }
  
  void
  downloaded_files_free (void)
  {
-  downloaded_file_list*         rover = downloaded_files;
-  while (rover)
+  if (downloaded_files_hash)
      {
-      downloaded_file_list *next = rover->next;
-      xfree (rover->file);
-      xfree (rover);
-      rover = next;
+      hash_table_map (downloaded_files_hash, df_free_mapper, NULL);
+      hash_table_destroy (downloaded_files_hash);
+      downloaded_files_hash = NULL;
      }
  }
diff --git a/src/url.h b/src/url.h

index 9808b54024c21ab2f4aa3533613dbd32a3745967..aed9bc382abac3de69cec21449178ff41219dfaf 100644 (file)
--- a/src/url.h
+++ b/src/url.h
@@ -72,11 +72,11 @@ enum convert_options {
  /* A structure that defines the whereabouts of a URL, i.e. its
     position in an HTML document, etc.  */
  
-typedef struct _urlpos
-{
-  char *url;                   /* linked URL, after it has been
-                                  merged with the base */
-  char *local_name;            /* Local file to which it was saved */
+struct urlpos {
+  struct url *url;             /* the URL of the link, after it has
+                                  been merged with the base */
+  char *local_name;            /* local file to which it was saved
+                                  (used by convert_links) */
  
    /* Information about the original link: */
    int link_relative_p;         /* was the link relative? */
@@ -89,8 +89,8 @@ typedef struct _urlpos
    /* URL's position in the buffer. */
    int pos, size;
  
-  struct _urlpos *next;                /* Next struct in list */
-} urlpos;
+  struct urlpos *next;         /* next list element */
+};
  
  /* downloaded_file() takes a parameter of this type and returns this type. */
  typedef enum
@@ -126,9 +126,9 @@ int url_skip_uname PARAMS ((const char *));
  
  char *url_string PARAMS ((const struct url *, int));
  
-urlpos *get_urls_file PARAMS ((const char *));
-urlpos *get_urls_html PARAMS ((const char *, const char *, int, int *));
-void free_urlpos PARAMS ((urlpos *));
+struct urlpos *get_urls_file PARAMS ((const char *));
+struct urlpos *get_urls_html PARAMS ((const char *, const char *, int, int *));
+void free_urlpos PARAMS ((struct urlpos *));
  
  char *uri_merge PARAMS ((const char *, const char *));
  
@@ -136,11 +136,10 @@ void rotate_backups PARAMS ((const char *));
  int mkalldirs PARAMS ((const char *));
  char *url_filename PARAMS ((const struct url *));
  
-char *getproxy PARAMS ((uerr_t));
+char *getproxy PARAMS ((enum url_scheme));
  int no_proxy_match PARAMS ((const char *, const char **));
  
-void convert_links PARAMS ((const char *, urlpos *));
-urlpos *add_url PARAMS ((urlpos *, const char *, const char *));
+void convert_links PARAMS ((const char *, struct urlpos *));
  
  downloaded_file_t downloaded_file PARAMS ((downloaded_file_t, const char *));
  
diff --git a/src/utils.c b/src/utils.c

index fca15800248351052a573cb2582eb8eb466a91a1..444455a503f6528088cfa90ba8b4c4068943fe06 100644 (file)
--- a/src/utils.c
+++ b/src/utils.c
@@ -307,6 +307,18 @@ xstrdup_debug (const char *s, const char *source_file, int source_line)
  
  #endif /* DEBUG_MALLOC */
  \f
+/* Utility function: like xstrdup(), but also lowercases S.  */
+
+char *
+xstrdup_lower (const char *s)
+{
+  char *copy = xstrdup (s);
+  char *p = copy;
+  for (; *p; p++)
+    *p = TOLOWER (*p);
+  return copy;
+}
+
  /* Copy the string formed by two pointers (one on the beginning, other
     on the char after the last char) to a new, malloc-ed location.
     0-terminate it.  */
@@ -443,6 +455,8 @@ fork_to_background (void)
  }
  #endif /* not WINDOWS */
  \f
+#if 0
+/* debug */
  char *
  ps (char *orig)
  {
@@ -450,6 +464,7 @@ ps (char *orig)
    path_simplify (r);
    return r;
  }
+#endif
  
  /* Canonicalize PATH, and return a new path.  The new path differs from PATH
     in that:
@@ -468,45 +483,31 @@ ps (char *orig)
         Change the original string instead of strdup-ing.
         React correctly when beginning with `./' and `../'.
         Don't zip out trailing slashes.  */
-void
+int
  path_simplify (char *path)
  {
-  register int i, start, ddot;
+  register int i, start;
+  int changes = 0;
    char stub_char;
  
    if (!*path)
-    return;
+    return 0;
  
-  /*stub_char = (*path == '/') ? '/' : '.';*/
    stub_char = '/';
  
-  /* Addition: Remove all `./'-s preceding the string.  If `../'-s
-     precede, put `/' in front and remove them too.  */
-  i = 0;
-  ddot = 0;
-  while (1)
-    {
-      if (path[i] == '.' && path[i + 1] == '/')
-       i += 2;
-      else if (path[i] == '.' && path[i + 1] == '.' && path[i + 2] == '/')
-       {
-         i += 3;
-         ddot = 1;
-       }
-      else
-       break;
-    }
-  if (i)
-    strcpy (path, path + i - ddot);
+  if (path[0] == '/')
+    /* Preserve initial '/'. */
+    ++path;
  
-  /* Replace single `.' or `..' with `/'.  */
+  /* Nix out leading `.' or `..' with.  */
    if ((path[0] == '.' && path[1] == '\0')
        || (path[0] == '.' && path[1] == '.' && path[2] == '\0'))
      {
-      path[0] = stub_char;
-      path[1] = '\0';
-      return;
+      path[0] = '\0';
+      changes = 1;
+      return changes;
      }
+
    /* Walk along PATH looking for things to compact.  */
    i = 0;
    while (1)
@@ -531,6 +532,7 @@ path_simplify (char *path)
         {
           strcpy (path + start + 1, path + i);
           i = start + 1;
+         changes = 1;
         }
  
        /* Check for `../', `./' or trailing `.' by itself.  */
@@ -540,6 +542,7 @@ path_simplify (char *path)
           if (!path[i + 1])
             {
               path[--i] = '\0';
+             changes = 1;
               break;
             }
  
@@ -548,6 +551,7 @@ path_simplify (char *path)
             {
               strcpy (path + i, path + i + 1);
               i = (start < 0) ? 0 : start;
+             changes = 1;
               continue;
             }
  
@@ -556,12 +560,32 @@ path_simplify (char *path)
               (path[i + 2] == '/' || !path[i + 2]))
             {
               while (--start > -1 && path[start] != '/');
-             strcpy (path + start + 1, path + i + 2);
+             strcpy (path + start + 1, path + i + 2 + (start == -1 && path[i + 2]));
               i = (start < 0) ? 0 : start;
+             changes = 1;
               continue;
             }
         }       /* path == '.' */
      } /* while */
+
+  /* Addition: Remove all `./'-s and `../'-s preceding the string.  */
+  i = 0;
+  while (1)
+    {
+      if (path[i] == '.' && path[i + 1] == '/')
+       i += 2;
+      else if (path[i] == '.' && path[i + 1] == '.' && path[i + 2] == '/')
+       i += 3;
+      else
+       break;
+    }
+  if (i)
+    {
+      strcpy (path, path + i - 0);
+      changes = 1;
+    }
+
+  return changes;
  }
  \f
  /* "Touch" FILE, i.e. make its atime and mtime equal to the time
diff --git a/src/utils.h b/src/utils.h

index fedd2b33f167f32c555f19475944078fec4514ab..ae7f47a39f8beaa348da8b1bd9138be187f1e811 100644 (file)
--- a/src/utils.h
+++ b/src/utils.h
@@ -48,12 +48,13 @@ char *datetime_str PARAMS ((time_t *));
  void print_malloc_debug_stats ();
  #endif
  
+char *xstrdup_lower PARAMS ((const char *));
  char *strdupdelim PARAMS ((const char *, const char *));
  char **sepstring PARAMS ((const char *));
  int frontcmp PARAMS ((const char *, const char *));
  char *pwd_cuserid PARAMS ((char *));
  void fork_to_background PARAMS ((void));
-void path_simplify PARAMS ((char *));
+int path_simplify PARAMS ((char *));
  
  void touch PARAMS ((const char *, time_t));
  int remove_link PARAMS ((const char *));
@@ -98,4 +99,6 @@ long wtimer_granularity PARAMS ((void));
  
  char *html_quote_string PARAMS ((const char *));
  
+int determine_screen_width PARAMS ((void));
+
  #endif /* UTILS_H */
diff --git a/src/wget.h b/src/wget.h

index f0cabd16c4e4c614d0d1b6f5094a398a5399ae66..a1f737a4d607e146ec2288aad7305ec85e5df6c4 100644 (file)
--- a/src/wget.h
+++ b/src/wget.h
@@ -28,6 +28,11 @@ Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.  */
  # define NDEBUG /* To kill off assertions */
  #endif /* not DEBUG */
  
+/* Define this if you want primitive but extensive malloc debugging.
+   It will make Wget extremely slow, so only do it in development
+   builds.  */
+#undef DEBUG_MALLOC
+
  #ifndef PARAMS
  # if PROTOTYPES
  #  define PARAMS(args) args
@@ -60,7 +65,7 @@ Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.  */
  
     3) Finally, the debug messages are meant to be a clue for me to
     debug problems with Wget.  If I get them in a language I don't
-   understand, debugging will become a new challenge of its own!  :-) */
+   understand, debugging will become a new challenge of its own!  */
  
  
  /* Include these, so random files need not include them.  */
author	hniksic <devnull@localhost>
	Sun, 25 Nov 2001 03:10:34 +0000 (19:10 -0800)
committer	hniksic <devnull@localhost>
	Sun, 25 Nov 2001 03:10:34 +0000 (19:10 -0800)
ChangeLog		patch \| blob \| history
NEWS		patch \| blob \| history
TODO		patch \| blob \| history
src/ChangeLog		patch \| blob \| history
src/Makefile.in		patch \| blob \| history
src/host.c		patch \| blob \| history
src/host.h		patch \| blob \| history
src/html-url.c		patch \| blob \| history
src/http.c		patch \| blob \| history
src/init.c		patch \| blob \| history
src/main.c		patch \| blob \| history
src/options.h		patch \| blob \| history
src/progress.c		patch \| blob \| history
src/recur.c		patch \| blob \| history
src/recur.h		patch \| blob \| history
src/res.c		patch \| blob \| history
src/res.h		patch \| blob \| history
src/retr.c		patch \| blob \| history
src/url.c		patch \| blob \| history
src/url.h		patch \| blob \| history
src/utils.c		patch \| blob \| history
src/utils.h		patch \| blob \| history
src/wget.h		patch \| blob \| history