Automated merge.

author Saint Xavier <wget@sxav.eu>

Sun, 20 Jul 2008 19:45:09 +0000 (21:45 +0200)

committer Saint Xavier <wget@sxav.eu>

Sun, 20 Jul 2008 19:45:09 +0000 (21:45 +0200)
author Saint Xavier <wget@sxav.eu>
Sun, 20 Jul 2008 19:45:09 +0000 (21:45 +0200)
committer Saint Xavier <wget@sxav.eu>
Sun, 20 Jul 2008 19:45:09 +0000 (21:45 +0200)
diff --git a/.hgignore b/.hgignore

index d5895afecdc469f45ec31ff053a7a99c78074714..0b6a441f554249a89cdced790cb967ba7560f23f 100644 (file)
--- a/.hgignore
+++ b/.hgignore
@@ -36,6 +36,7 @@ src/.deps
  src/stamp-h1
  src/config.h
  src/config.h.in
+src/css.c
  src/wget
  src/cscope.out
  src/libunittest.a
diff --git a/AUTHORS b/AUTHORS

index 219877e9a4fb65bad0661875c0f5d7ef8376215b..ab44acb2aa366c9897d27562d8fb91ed4590d136 100644 (file)
--- a/AUTHORS
+++ b/AUTHORS
@@ -45,3 +45,8 @@ Micah Cowan.  Current Wget maintainer, from mid-2007.
  
  Ralf Wildenhues.  Contributed patches to convert Wget to use Automake as
  part of its build process, and various bugfixes.
+
+Steven Schubiger.  Many helpful patches, bugfixes and improvements.
+Notably, conversion of Wget to use the Gnulib quotes and quoteargs
+modules, and the addition of password prompts at the console, via the
+Gnulib getpasswd-gnu module.
diff --git a/ChangeLog b/ChangeLog

index 8989841447b3d37fd88b89114f348b6b9eeb1974..21d380b261252d9c27b43fe1f9efe87ad89ab99f 100644 (file)
--- a/ChangeLog
+++ b/ChangeLog
@@ -1,3 +1,9 @@
+2008-06-30  Micah Cowan  <micah@cowan.name>
+
+       * NEWS: Entries for 1.11.4.
+
+       * AUTHORS: Added Steven Schubiger.
+
  2008-06-26  Xavier Saint  <wget@sxav.eu>
  
         * configure.ac : IRIs support required libiconv, check it.
@@ -54,6 +60,19 @@
         md5/m4/stdint.m4, md5/md5.c, md5/md5.h, md5/stdint.in.h,
         md5/wchar.in.h: Updated from gnulib.
   
+2008-04-24  Micah Cowan  <micah@cowan.name>
+
+       * NEWS: Removed info about move to Automake, Gnulib. Added item
+       about the addition of CSS support.
+
+2008-04-22  Micah Cowan  <micah@cowan.name>
+
+       * ylwrap: Added via automake -ac.
+
+2008-04-22  Ted Mielczarek  <ted.mielczarek@gmail.com>
+
+       * configure.ac: Added check for lex.
+
  2008-04-14  Micah Cowan  <micah@cowan.name>
  
         * GNUmakefile, lib/Makefile.am, lib/error.c, lib/error.h,
diff --git a/NEWS b/NEWS

index e74ce2a0ae1a9c74ecc539ec283386eff270aea9..a57747942afb81491f837cc56f02ec479780c038 100644 (file)
--- a/NEWS
+++ b/NEWS
@@ -8,9 +8,36 @@ Please send GNU Wget bug reports to <bug-wget@gnu.org>.
  \f
  * Changes in Wget 1.12 (MAINLINE)
  
+** Added support for CSS. This includes:
+     - Parsing links from CSS files, and from CSS content found in HTML
+       style tags and attributes.
+     - Supporting conversion of links found within CSS content, when
+       --convert-links is specified.
+     - Ensuring that CSS files end in the ".css" filename extension,
+       when --convert-links is specified.
+
+   CSS support in Wget is thanks to Ted Mielczarek
+   <ted.mielczarek@gmail.com>.
+
  ** --ask-password option (and associated wgetrc command) added to
  support password prompts at the console.
+ 
+** The --input-file option now also handles retrieving links from
+an external file.
+\f
+* Changes in Wget 1.11.4
+
+** Fixed an issue (apparently a regression) where -O would refuse to
+download when -nc was given, even though the file didn't exist.
+
+** Fixed a situation where Wget could abort with --continue if the
+remote server gives a content-length of zero when the file exists
+locally with content.
+
+** Fixed a crash on some systems, due to Wget casting a pointer-to-long
+to a pointer-to-time_t.
  
+** Translation updates for Catalan.
  \f
  * Changes in Wget 1.11.3
  
diff --git a/configure.ac b/configure.ac

index b29234360ac09c012099105edd718fbd11abcee4..fb0c65d10d10bef526dd59126868a463653e8c3e 100644 (file)
--- a/configure.ac
+++ b/configure.ac
@@ -113,6 +113,8 @@ md5_EARLY
  
  AC_PROG_RANLIB
  
+AC_PROG_LEX
+
  dnl Turn on optimization by default.  Specifically:
  dnl
  dnl if the user hasn't specified CFLAGS, then
diff --git a/doc/ChangeLog b/doc/ChangeLog

index e028ad40380246dfeb1531cd8366cc6f2136cd8e..94d724438506e74994eaeb67b83bd008e9361466 100644 (file)
--- a/doc/ChangeLog
+++ b/doc/ChangeLog
@@ -1,3 +1,10 @@
+2008-06-29  Micah Cowan  <micah@cowan.name>
+
+       * wget.texi <Contributors>: Added Joao Ferreira, Mike Frysinger,
+       Alain, Guibert, Madhusudan Hosaagrahara, Jim Paris, Kenny
+       Parnell, Benno Schulenberg, and Pranab Shenoy. Added Steven
+       Schubiger to the "Special Thanks" section.
+
  2008-06-13  Micah Cowan  <micah@cowan.name>
  
         * wget.texi (Mailing List): The wget-notify mailing list no longer
@@ -26,6 +33,11 @@
         * wget.texi (Download Options) <-O>: Elaborate on why certain
         options make poor combinations with -O.
  
+2008-04-24  Micah Cowan  <micah@cowan.name>
+
+       * wget.texi: Adjusted documentation to account for CSS support;
+       added Ted Mielczarek to contributors.
+
  2008-04-22  Mike Frysinger  <vapier@gentoo.org>
  
         * sample.wgetrc: Added prefer_family example. Resolves bug
diff --git a/doc/wget.texi b/doc/wget.texi

index 83f97a3768673feb1dc51072f3a75cc044f5be85..50ee459e95c33fc3709566167d6ad7b04c9dc159 100644 (file)
--- a/doc/wget.texi
+++ b/doc/wget.texi
@@ -3,7 +3,7 @@
  @c %**start of header
  @setfilename wget.info
  @include version.texi
-@set UPDATED Mar 2008
+@set UPDATED Jun 2008
  @settitle GNU Wget @value{VERSION} Manual
  @c Disable the monstrous rectangles beside overfull hbox-es.
  @finalout
@@ -133,13 +133,13 @@ which can be a great hindrance when transferring a lot of data.
  @c man end
  @end ignore
  @c man begin DESCRIPTION
-Wget can follow links in @sc{html} and @sc{xhtml} pages and create local 
-versions of remote web sites, fully recreating the directory structure of 
-the original site.  This is sometimes referred to as ``recursive
-downloading.''  While doing that, Wget respects the Robot Exclusion
-Standard (@file{/robots.txt}).  Wget can be instructed to convert the
-links in downloaded @sc{html} files to the local files for offline
-viewing.
+Wget can follow links in @sc{html}, @sc{xhtml}, and @sc{css} pages, to
+create local versions of remote web sites, fully recreating the
+directory structure of the original site.  This is sometimes referred to
+as ``recursive downloading.''  While doing that, Wget respects the Robot
+Exclusion Standard (@file{/robots.txt}).  Wget can be instructed to
+convert the links in downloaded files to point at the local files, for
+offline viewing.
  @c man end
  
  @item
@@ -480,9 +480,9 @@ printed.
  @cindex input-file
  @item -i @var{file}
  @itemx --input-file=@var{file}
-Read @sc{url}s from @var{file}.  If @samp{-} is specified as
-@var{file}, @sc{url}s are read from the standard input.  (Use
-@samp{./-} to read from a file literally named @samp{-}.)
+Read @sc{url}s from a local or external @var{file}.  If @samp{-} is
+specified as @var{file}, @sc{url}s are read from the standard input.  
+(Use @samp{./-} to read from a file literally named @samp{-}.)
  
  If this function is used, no @sc{url}s need be present on the command
  line.  If there are @sc{url}s both on the command line and in an input
@@ -1093,6 +1093,11 @@ re-downloading, you must use @samp{-k} and @samp{-K} so that the original
  version of the file will be saved as @file{@var{X}.orig} (@pxref{Recursive 
  Retrieval Options}).
  
+As of version 1.12, Wget will also ensure that any downloaded files of
+type @samp{text/css} end in the suffix @samp{.css}. Obviously, this
+makes the name @samp{--html-extension} misleading; a better name is
+expected to be offered as an alternative in the near future.
+
  @cindex http user
  @cindex http password
  @cindex authentication
@@ -1943,16 +1948,17 @@ GNU Wget is capable of traversing parts of the Web (or a single
  @sc{http} or @sc{ftp} server), following links and directory structure.
  We refer to this as to @dfn{recursive retrieval}, or @dfn{recursion}.
  
-With @sc{http} @sc{url}s, Wget retrieves and parses the @sc{html} from
-the given @sc{url}, documents, retrieving the files the @sc{html}
-document was referring to, through markup like @code{href}, or
-@code{src}.  If the freshly downloaded file is also of type
-@code{text/html} or @code{application/xhtml+xml}, it will be parsed and 
-followed further.
+With @sc{http} @sc{url}s, Wget retrieves and parses the @sc{html} or
+@sc{css} from the given @sc{url}, retrieving the files the document
+refers to, through markup like @code{href} or @code{src}, or @sc{css}
+@sc{uri} values specified using the @samp{url()} functional notation.
+If the freshly downloaded file is also of type @code{text/html},
+@code{application/xhtml+xml}, or @code{text/css}, it will be parsed
+and followed further.
  
-Recursive retrieval of @sc{http} and @sc{html} content is
+Recursive retrieval of @sc{http} and @sc{html}/@sc{css} content is
  @dfn{breadth-first}.  This means that Wget first downloads the requested
-@sc{html} document, then the documents linked from that document, then the
+document, then the documents linked from that document, then the
  documents linked by them, and so on.  In other words, Wget first
  downloads the documents at depth 1, then those at depth 2, and so on
  until the specified maximum depth.
@@ -2741,7 +2747,8 @@ Define a header for HTTP downloads, like using
  
  @item html_extension = on/off
  Add a @samp{.html} extension to @samp{text/html} or
-@samp{application/xhtml+xml} files without it, like @samp{-E}.
+@samp{application/xhtml+xml} files without it, or a @samp{.css}
+extension to @samp{text/css} files without it, like @samp{-E}.
  
  @item http_keep_alive = on/off
  Turn the keep-alive feature on or off (defaults to on).  Turning it
@@ -3103,7 +3110,7 @@ wget -r http://www.gnu.org/ -o gnulog
  @end example
  
  @item
-The same as the above, but convert the links in the @sc{html} files to
+The same as the above, but convert the links in the downloaded files to
  point to local files, so you can view the documents off-line:
  
  @example
@@ -3749,21 +3756,30 @@ Junio Hamano---donated support for Opie and @sc{http} @code{Digest}
  authentication.
  
  @item
-Mauro Tortonesi---Improved IPv6 support, adding support for dual
+Mauro Tortonesi---improved IPv6 support, adding support for dual
  family systems.  Refactored and enhanced FTP IPv6 code. Maintained GNU
  Wget from 2004--2007.
  
  @item
-Christopher G.@: Lewis---Maintenance of the Windows version of GNU WGet.
+Christopher G.@: Lewis---maintenance of the Windows version of GNU WGet.
  
  @item
-Gisle Vanem---Many helpful patches and improvements, especially for
+Gisle Vanem---many helpful patches and improvements, especially for
  Windows and MS-DOS support.
  
  @item
-Ralf Wildenhues---Contributed patches to convert Wget to use Automake as
+Ralf Wildenhues---contributed patches to convert Wget to use Automake as
  part of its build process, and various bugfixes.
  
+@item
+Steven Schubiger---Many helpful patches, bugfixes and improvements.
+Notably, conversion of Wget to use the Gnulib quotes and quoteargs
+modules, and the addition of password prompts at the console, via the
+Gnulib getpasswd-gnu module.
+
+@item
+Ted Mielczarek---donated support for CSS.
+
  @item
  People who provided donations for development---including Brian Gough.
  @end itemize
@@ -3819,8 +3835,15 @@ Aleksandar Erkalovi@'{c},
  Aleksandar Erkalovic,
  @end ifnottex
  Andy Eskilsson,
+@iftex
+Jo@~{a}o Ferreira,
+@end iftex
+@ifnottex
+Joao Ferreira,
+@end ifnottex
  Christian Fraenkel,
  David Fritz,
+Mike Frysinger,
  Charles C.@: Fu,
  FUJISHIMA Satsuki,
  Masashi Fujita,
@@ -3828,10 +3851,12 @@ Howard Gayle,
  Marcel Gerrits,
  Lemble Gregory,
  Hans Grobler,
+Alain Guibert,
  Mathieu Guillaume,
  Aaron Hawley,
  Jochen Hein,
  Karl Heuer,
+Madhusudan Hosaagrahara,
  HIROSE Masaaki,
  Ulf Harnhammar,
  Gregor Hoffleit,
@@ -3895,6 +3920,7 @@ Andre Majorel,
  Aurelien Marchand,
  Matthew J.@: Mellon,
  Jordan Mendelson,
+Ted Mielczarek,
  Lin Zhe Min,
  Jan Minar,
  Tim Mooney,
@@ -3903,6 +3929,8 @@ Adam D.@: Moss,
  Simon Munton,
  Charlie Negyesi,
  R.@: K.@: Owen,
+Jim Paris,
+Kenny Parnell,
  Leonid Petrov,
  Simone Piunno,
  Andrew Pollock,
@@ -3937,9 +3965,11 @@ Edward J.@: Sabol,
  Heinz Salzmann,
  Robert Schmidt,
  Nicolas Schodet,
+Benno Schulenberg,
  Andreas Schwab,
  Steven M.@: Schweda,
  Chris Seawood,
+Pranab Shenoy,
  Dennis Smit,
  Toomas Soome,
  Tage Stabell-Kulo,
diff --git a/src/ChangeLog b/src/ChangeLog

index 7aca052701951708624a479b1f793e21f1dfe1c9..02bc331b8e4a712f29cebce774b0df63a080ff0f 100644 (file)
--- a/src/ChangeLog
+++ b/src/ChangeLog
@@ -6,6 +6,11 @@
         * host.c : Show hostname to be resolved both in locale and
         ASCII encoded.
  
+2008-06-28  Steven Schubiger  <stsc@members.fsf.org>
+
+       * retr.c (retrieve_from_file): Allow for reading the links from
+       an external file (HTTP/FTP).
+
  2008-06-26  Xavier Saint  <wget@sxav.eu>
  
         * iri.c, iri.h : New functions locale_to_utf8() and
@@ -14,6 +19,11 @@
         * url.c : Convert URLs from locale to UTF-8 allowing a basic
         support of IRI/IDN
  
+2008-06-25  Steven Schubiger  <stsc@members.fsf.org>
+
+       * ftp.c (getftp): When spidering a FTP URL, emit a diagnostic
+       message if the remote file exists.
+
  2008-06-24  Steven Schubiger  <stsc@members.fsf.org>
  
         * http.c (http_loop): Replace escnonprint() occurence with
@@ -210,11 +220,55 @@
  
         * Makefile.am: -I foo -> -Ifoo.
  
+2008-04-24  Micah Cowan  <micah@cowan.name>
+
+       * main.c: Revised usage description of --convert-links to apply
+       to CSS as well as to HTML.
+
  2008-04-23  Micah Cowan  <micah@cowan.name>
  
         * utils.c (test_dir_matches_p): Added a test for the case
         described in issue #20518.
  
+2008-04-22  Micah Cowan  <micah@cowan.name>
+
+       * Makefile.am, css.lex, css.l: Renamed css.lex to css.l.
+       * recur.c (retrieve_tree): Fix typo to allow text/css files to
+       be parsed.
+
+2008-04-22  Ted Mielczarek  <ted.mielczarek@gmail.com>
+
+       * css.lex, css-url.c, css-url.h: Added to implement support for
+       parsing CSS in Wget.
+       * convert.c: Convert links in CSS files, too.
+       * convert.h (convert_options): Added for options link_css_p,
+       link_expect_css.
+       * convert.h: Added prototype for new register_css function.
+       * html-parse.c: Added support for parsing element content, in
+       addition to tag starts and ends.
+       * html-parse.h (taginfo): Added delimiter fields for element
+       content.
+       * html-url.h: Added.
+       * html-url.c (append_url): No longer internal-linkage only. Now
+       takes position and size as explicit parameters.
+       * html-url.c: Use new html-url.h header, add support for
+       handling of "style" HTML attributes. Mark URIs obtained from
+       link tags with rel="stylesheet" with link_expect_css. Adapt
+       uses of append_url to supply the newly-added parameters for
+       position and size.
+       * http.c: Add detection for when the content-type is text/css;
+       and ensure that such files have the ".css" filename extension,
+       when --convert-links is active.
+       * recur.h: Remove declarations for functions found in
+       html-url.c (moved to html-url.h).
+       * recur.c: Add support for culling links from CSS files, too,
+       and tracking for when we're expecting the file to be CSS (even
+       when its content type isn't text/css).
+       * retr.c (retrieve_url): Add registration of CSS files.
+       * wget.h: Added TEXTCSS to dt flags enum.
+       * Makefile.am: Added css.lex, css-url.c, css-url.h, html-url.h
+       to wget_SOURCES.
+
  2008-04-22  Jim Paris  <jim@jtan.com>
  
         * openssl.c (ssl_init): Enable combined certificate/key in
diff --git a/src/Makefile.am b/src/Makefile.am

index 6ae5805d5ba6400354079aa871969c107c72f9f0..edbb592e8990422a8ae65d5954fa2c58d7aee599 100644 (file)
--- a/src/Makefile.am
+++ b/src/Makefile.am
@@ -40,13 +40,14 @@ LIBS     = @LIBSSL@ @LIBGNUTLS@ @LIBINTL@ @LIBS@
  
  bin_PROGRAMS = wget
  wget_SOURCES = build_info.c cmpt.c connect.c convert.c cookies.c ftp.c    \
+              css.l css-url.c \
                ftp-basic.c ftp-ls.c hash.c host.c html-parse.c html-url.c \
                http.c init.c log.c main.c netrc.c progress.c ptimer.c     \
                recur.c res.c retr.c snprintf.c spider.c url.c             \
                utils.c $(IRI_OBJ)                                         \
-              connect.h convert.h cookies.h                              \
-              ftp.h gen-md5.h hash.h host.h html-parse.h                 \
-              http.h http-ntlm.h init.h iri.h log.h mswindows.h netrc.h  \
+              css-url.h connect.h convert.h cookies.h                    \
+              ftp.h gen-md5.h hash.h host.h html-parse.h html-url.h      \
+              http.h http-ntlm.h init.h log.h mswindows.h netrc.h        \
                options.h progress.h ptimer.h recur.h res.h retr.h         \
                spider.h ssl.h sysdep.h url.h utils.h wget.h
  nodist_wget_SOURCES = version.c
diff --git a/src/convert.c b/src/convert.c

index 69d713f812d1d68e31cb343d936f876bcf512c0d..e72a4b0f50f4d8784ac63114c1da2b526ba96450 100644 (file)
--- a/src/convert.c
+++ b/src/convert.c
@@ -45,50 +45,37 @@ as that of the covered work.  */
  #include "hash.h"
  #include "ptimer.h"
  #include "res.h"
+#include "html-url.h"
+#include "css-url.h"
  
  static struct hash_table *dl_file_url_map;
  struct hash_table *dl_url_file_map;
  
-/* Set of HTML files downloaded in this Wget run, used for link
+/* Set of HTML/CSS files downloaded in this Wget run, used for link
     conversion after Wget is done.  */
  struct hash_table *downloaded_html_set;
+struct hash_table *downloaded_css_set;
  
  static void convert_links (const char *, struct urlpos *);
  
-/* This function is called when the retrieval is done to convert the
-   links that have been downloaded.  It has to be called at the end of
-   the retrieval, because only then does Wget know conclusively which
-   URLs have been downloaded, and which not, so it can tell which
-   direction to convert to.
-
-   The "direction" means that the URLs to the files that have been
-   downloaded get converted to the relative URL which will point to
-   that file.  And the other URLs get converted to the remote URL on
-   the server.
-
-   All the downloaded HTMLs are kept in downloaded_html_files, and
-   downloaded URLs in urls_downloaded.  All the information is
-   extracted from these two lists.  */
  
  void
-convert_all_links (void)
+convert_links_in_hashtable (struct hash_table *downloaded_set,
+                            int is_css,
+                            int *file_count)
  {
    int i;
-  double secs;
-  int file_count = 0;
-
-  struct ptimer *timer = ptimer_new ();
  
    int cnt;
    char **file_array;
  
    cnt = 0;
-  if (downloaded_html_set)
-    cnt = hash_table_count (downloaded_html_set);
+  if (downloaded_set)
+    cnt = hash_table_count (downloaded_set);
    if (cnt == 0)
-    goto cleanup;
+    return;
    file_array = alloca_array (char *, cnt);
-  string_set_to_array (downloaded_html_set, file_array);
+  string_set_to_array (downloaded_set, file_array);
  
    for (i = 0; i < cnt; i++)
      {
@@ -96,7 +83,7 @@ convert_all_links (void)
        char *url;
        char *file = file_array[i];
  
-      /* Determine the URL of the HTML file.  get_urls_html will need
+      /* Determine the URL of the file.  get_urls_{html,css} will need
           it.  */
        url = hash_table_get (dl_file_url_map, file);
        if (!url)
@@ -107,8 +94,9 @@ convert_all_links (void)
  
        DEBUGP (("Scanning %s (from %s)\n", file, url));
  
-      /* Parse the HTML file...  */
-      urls = get_urls_html (file, url, NULL);
+      /* Parse the file...  */
+      urls = is_css ? get_urls_css_file (file, url) :
+                      get_urls_html (file, url, NULL);
  
        /* We don't respect meta_disallow_follow here because, even if
           the file is not followed, we might still want to convert the
@@ -160,27 +148,55 @@ convert_all_links (void)
  
        /* Convert the links in the file.  */
        convert_links (file, urls);
-      ++file_count;
+      ++*file_count;
  
        /* Free the data.  */
        free_urlpos (urls);
      }
+}
+
+/* This function is called when the retrieval is done to convert the
+   links that have been downloaded.  It has to be called at the end of
+   the retrieval, because only then does Wget know conclusively which
+   URLs have been downloaded, and which not, so it can tell which
+   direction to convert to.
+
+   The "direction" means that the URLs to the files that have been
+   downloaded get converted to the relative URL which will point to
+   that file.  And the other URLs get converted to the remote URL on
+   the server.
+
+   All the downloaded HTMLs are kept in downloaded_html_files, and
+   downloaded URLs in urls_downloaded.  All the information is
+   extracted from these two lists.  */
+
+void
+convert_all_links (void)
+{
+  double secs;
+  int file_count = 0;
+
+  struct ptimer *timer = ptimer_new ();
+
+  convert_links_in_hashtable (downloaded_html_set, 0, &file_count);
+  convert_links_in_hashtable (downloaded_css_set, 1, &file_count);
  
    secs = ptimer_measure (timer);
    logprintf (LOG_VERBOSE, _("Converted %d files in %s seconds.\n"),
               file_count, print_decimal (secs));
-cleanup:
+
    ptimer_destroy (timer);
  }
  
  static void write_backup_file (const char *, downloaded_file_t);
+static const char *replace_plain (const char*, int, FILE*, const char *);
  static const char *replace_attr (const char *, int, FILE *, const char *);
  static const char *replace_attr_refresh_hack (const char *, int, FILE *,
                                                const char *, int);
  static char *local_quote_string (const char *);
  static char *construct_relative (const char *, const char *);
  
-/* Change the links in one HTML file.  LINKS is a list of links in the
+/* Change the links in one file.  LINKS is a list of links in the
     document, along with their positions and the desired direction of
     the conversion.  */
  static void
@@ -277,7 +293,9 @@ convert_links (const char *file, struct urlpos *links)
              char *newname = construct_relative (file, link->local_name);
              char *quoted_newname = local_quote_string (newname);
  
-            if (!link->link_refresh_p)
+            if (link->link_css_p)
+              p = replace_plain (p, link->size, fp, quoted_newname);
+            else if (!link->link_refresh_p)
                p = replace_attr (p, link->size, fp, quoted_newname);
              else
                p = replace_attr_refresh_hack (p, link->size, fp, quoted_newname,
@@ -296,7 +314,9 @@ convert_links (const char *file, struct urlpos *links)
              char *newlink = link->url->url;
              char *quoted_newlink = html_quote_string (newlink);
  
-            if (!link->link_refresh_p)
+            if (link->link_css_p)
+              p = replace_plain (p, link->size, fp, quoted_newlink);
+            else if (!link->link_refresh_p)
                p = replace_attr (p, link->size, fp, quoted_newlink);
              else
                p = replace_attr_refresh_hack (p, link->size, fp, quoted_newlink,
@@ -406,6 +426,7 @@ write_backup_file (const char *file, downloaded_file_t downloaded_file_return)
    size_t         filename_len = strlen (file);
    char*          filename_plus_orig_suffix;
  
+  /* TODO: hack this to work with css files */
    if (downloaded_file_return == FILE_DOWNLOADED_AND_HTML_EXTENSION_ADDED)
      {
        /* Just write "orig" over "html".  We need to do it this way
@@ -465,6 +486,15 @@ write_backup_file (const char *file, downloaded_file_t downloaded_file_return)
  
  static bool find_fragment (const char *, int, const char **, const char **);
  
+/* Replace a string with NEW_TEXT.  Ignore quoting. */
+static const char *
+replace_plain (const char *p, int size, FILE *fp, const char *new_text)
+{
+  fputs (new_text, fp);
+  p += size;
+  return p;
+}
+
  /* Replace an attribute's original text with NEW_TEXT. */
  
  static const char *
@@ -832,6 +862,16 @@ register_html (const char *url, const char *file)
    string_set_add (downloaded_html_set, file);
  }
  
+/* Register that FILE is a CSS file that has been downloaded. */
+
+void
+register_css (const char *url, const char *file)
+{
+  if (!downloaded_css_set)
+    downloaded_css_set = make_string_hash_table (0);
+  string_set_add (downloaded_css_set, file);
+}
+
  static void downloaded_files_free (void);
  
  /* Cleanup the data structures associated with this file.  */
diff --git a/src/convert.h b/src/convert.h

index 0dd9d018473422b31281003c988b83cc508da4e7..3d8b30599ad43163360ec5b0eeac3a5e16fdb410 100644 (file)
--- a/src/convert.h
+++ b/src/convert.h
@@ -33,6 +33,7 @@ as that of the covered work.  */
  struct hash_table;             /* forward decl */
  extern struct hash_table *dl_url_file_map;
  extern struct hash_table *downloaded_html_set;
+extern struct hash_table *downloaded_css_set;
  
  enum convert_options {
    CO_NOCONVERT = 0,            /* don't convert this URL */
@@ -64,7 +65,9 @@ struct urlpos {
    unsigned int link_complete_p :1; /* the link was complete (had host name) */
    unsigned int link_base_p     :1; /* the url came from <base href=...> */
    unsigned int link_inline_p   :1; /* needed to render the page */
+  unsigned int link_css_p      :1; /* the url came from CSS */
    unsigned int link_expect_html        :1; /* expected to contain HTML */
+  unsigned int link_expect_css :1; /* expected to contain CSS */
  
    unsigned int link_refresh_p  :1; /* link was received from
                                        <meta http-equiv=refresh content=...> */
@@ -98,6 +101,7 @@ downloaded_file_t downloaded_file (downloaded_file_t, const char *);
  void register_download (const char *, const char *);
  void register_redirection (const char *, const char *);
  void register_html (const char *, const char *);
+void register_css (const char *, const char *);
  void register_delete_file (const char *);
  void convert_all_links (void);
  void convert_cleanup (void);
diff --git a/src/css-tokens.h b/src/css-tokens.h

new file mode 100644 (file)

index 0000000..4feef42
--- /dev/null
+++ b/src/css-tokens.h
@@ -0,0 +1,66 @@
+/* Declarations for css.lex
+   Copyright (C) 2006 Free Software Foundation, Inc.
+
+This file is part of GNU Wget.
+
+GNU Wget is free software; you can redistribute it and/or modify
+it under the terms of the GNU General Public License as published by
+the Free Software Foundation; either version 2 of the License, or
+(at your option) any later version.
+
+GNU Wget is distributed in the hope that it will be useful,
+but WITHOUT ANY WARRANTY; without even the implied warranty of
+MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+GNU General Public License for more details.
+
+You should have received a copy of the GNU General Public License
+along with Wget; if not, write to the Free Software
+Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
+
+In addition, as a special exception, the Free Software Foundation
+gives permission to link the code of its release of Wget with the
+OpenSSL project's "OpenSSL" library (or with modified versions of it
+that use the same license as the "OpenSSL" library), and distribute
+the linked executables.  You must obey the GNU General Public License
+in all respects for all of the code used other than "OpenSSL".  If you
+modify this file, you may extend this exception to your version of the
+file, but you are not obligated to do so.  If you do not wish to do
+so, delete this exception statement from your version.  */
+
+#ifndef CSS_TOKENS_H
+#define CSS_TOKENS_H
+
+enum {
+  CSSEOF,
+  S,
+  CDO,
+  CDC,
+  INCLUDES,
+  DASHMATCH,
+  LBRACE,
+  PLUS,
+  GREATER,
+  COMMA,
+  STRING,
+  INVALID,
+  IDENT,
+  HASH,
+  IMPORT_SYM,
+  PAGE_SYM,
+  MEDIA_SYM,
+  CHARSET_SYM,
+  IMPORTANT_SYM,
+  EMS,
+  EXS,
+  LENGTH,
+  ANGLE,
+  TIME,
+  FREQ,
+  DIMENSION,
+  PERCENTAGE,
+  NUMBER,
+  URI,
+  FUNCTION
+} css_tokens;
+
+#endif /* CSS_TOKENS_H */
diff --git a/src/css-url.c b/src/css-url.c

new file mode 100644 (file)

index 0000000..42c8fc3
--- /dev/null
+++ b/src/css-url.c
@@ -0,0 +1,273 @@
+/* Collect URLs from CSS source.
+   Copyright (C) 1998, 2000, 2001, 2002, 2003 Free Software Foundation, Inc.
+
+This file is part of GNU Wget.
+
+GNU Wget is free software; you can redistribute it and/or modify
+it under the terms of the GNU General Public License as published by
+the Free Software Foundation; either version 2 of the License, or
+ (at your option) any later version.
+
+GNU Wget is distributed in the hope that it will be useful,
+but WITHOUT ANY WARRANTY; without even the implied warranty of
+MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+GNU General Public License for more details.
+
+You should have received a copy of the GNU General Public License
+along with Wget; if not, write to the Free Software
+Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
+
+In addition, as a special exception, the Free Software Foundation
+gives permission to link the code of its release of Wget with the
+OpenSSL project's "OpenSSL" library (or with modified versions of it
+that use the same license as the "OpenSSL" library), and distribute
+the linked executables.  You must obey the GNU General Public License
+in all respects for all of the code used other than "OpenSSL".  If you
+modify this file, you may extend this exception to your version of the
+file, but you are not obligated to do so.  If you do not wish to do
+so, delete this exception statement from your version.  */
+
+/*
+  Note that this is not an actual CSS parser, but just a lexical
+  scanner with a tiny bit more smarts bolted on top.  A full parser
+  is somewhat overkill for this job.  The only things we're interested
+  in are @import rules and url() tokens, so it's easy enough to
+  grab those without truly understanding the input.  The only downside
+  to this is that we might be coerced into downloading files that
+  a browser would ignore.  That might merit some more investigation.
+ */
+
+#include <config.h>
+
+#include <stdio.h>
+#ifdef HAVE_STRING_H
+# include <string.h>
+#else
+# include <strings.h>
+#endif
+#include <stdlib.h>
+#include <ctype.h>
+#include <errno.h>
+#include <assert.h>
+
+#include "wget.h"
+#include "utils.h"
+#include "convert.h"
+#include "html-url.h"
+#include "css-tokens.h"
+
+/* from lex.yy.c */
+extern char *yytext;
+extern int yyleng;
+typedef struct yy_buffer_state *YY_BUFFER_STATE;
+extern YY_BUFFER_STATE yy_scan_bytes (const char *bytes,int len  );
+extern int yylex (void);
+
+#if 1
+const char *token_names[] = {
+  "CSSEOF",
+  "S",
+  "CDO",
+  "CDC",
+  "INCLUDES",
+  "DASHMATCH",
+  "LBRACE",
+  "PLUS",
+  "GREATER",
+  "COMMA",
+  "STRING",
+  "INVALID",
+  "IDENT",
+  "HASH",
+  "IMPORT_SYM",
+  "PAGE_SYM",
+  "MEDIA_SYM",
+  "CHARSET_SYM",
+  "IMPORTANT_SYM",
+  "EMS",
+  "EXS",
+  "LENGTH",
+  "ANGLE",
+  "TIME",
+  "FREQ",
+  "DIMENSION",
+  "PERCENTAGE",
+  "NUMBER",
+  "URI",
+  "FUNCTION"
+};
+#endif
+
+/*
+  Given a detected URI token, get only the URI specified within.
+  Also adjust the starting position and length of the string.
+
+  A URI can be specified with or without quotes, and the quotes
+  can be single or double quotes.  In addition there can be
+  whitespace after the opening parenthesis and before the closing
+  parenthesis.
+*/
+char *
+get_uri_string (const char *at, int *pos, int *length)
+{
+  char *uri;
+  /*char buf[1024];
+  strncpy(buf,at + *pos, *length);
+  buf[*length] = '\0';
+  DEBUGP (("get_uri_string: \"%s\"\n", buf));*/
+
+  if (0 != strncasecmp (at + *pos, "url(", 4))
+    return NULL;
+
+  *pos += 4;
+  *length -= 5; /* url() */
+  /* skip leading space */
+  while (isspace (at[*pos]))
+    {
+    (*pos)++;
+    (*length)--;
+    }
+  /* skip trailing space */
+  while (isspace (at[*pos + *length - 1]))
+    {
+      (*length)--;
+    }
+  /* trim off quotes */
+  if (at[*pos] == '\'' || at[*pos] == '"')
+    {
+      (*pos)++;
+      *length -= 2;
+    }
+
+  uri = xmalloc (*length + 1);
+  if (uri)
+    {
+      strncpy (uri, at + *pos, *length);
+      uri[*length] = '\0';      
+    }
+
+  return uri;
+}
+
+void
+get_urls_css (struct map_context *ctx, int offset, int buf_length)
+{
+  int token;
+  /*char tmp[2048];*/
+  int buffer_pos = 0;
+  int pos, length;
+  char *uri;
+
+  /*
+  strncpy(tmp,ctx->text + offset, buf_length);
+  tmp[buf_length] = '\0';
+  DEBUGP (("get_urls_css: \"%s\"\n", tmp));
+  */
+
+  /* tell flex to scan from this buffer */
+  yy_scan_bytes (ctx->text + offset, buf_length);
+
+  while((token = yylex()) != CSSEOF)
+    {
+      /*DEBUGP (("%s ", token_names[token]));*/
+      /* @import "foo.css"
+         or @import url(foo.css)
+      */
+      if(token == IMPORT_SYM)
+        {
+          do {
+            buffer_pos += yyleng;
+          } while((token = yylex()) == S);
+
+          /*DEBUGP (("%s ", token_names[token]));*/
+
+          if (token == STRING || token == URI)
+            {
+              /*DEBUGP (("Got URI "));*/
+              pos = buffer_pos + offset;
+              length = yyleng;
+
+              if (token == URI)
+                {
+                  uri = get_uri_string (ctx->text, &pos, &length);
+                }
+              else
+                {
+                  /* cut out quote characters */
+                  pos++;
+                  length -= 2;
+                  uri = xmalloc (length + 1);
+                  strncpy (uri, yytext + 1, length);
+                  uri[length] = '\0';
+                }
+
+              if (uri)
+                {
+                  struct urlpos *up = append_url (uri, pos, length, ctx);
+                  DEBUGP (("Found @import: [%s] at %d [%s]\n", yytext, buffer_pos, uri));
+
+                  if (up)
+                    {
+                      up->link_inline_p = 1;
+                      up->link_css_p = 1;
+                      up->link_expect_css = 1;
+                    }
+
+                  xfree(uri);
+                }
+            }
+        }
+      /* background-image: url(foo.png)
+         note that we don't care what
+         property this is actually on.
+      */
+      else if(token == URI)
+        {
+          pos = buffer_pos + offset;
+          length = yyleng;
+          uri = get_uri_string (ctx->text, &pos, &length);
+
+          if (uri)
+            {
+              struct urlpos *up = append_url (uri, pos, length, ctx);
+              DEBUGP (("Found URI: [%s] at %d [%s]\n", yytext, buffer_pos, uri));
+              if (up)
+                {
+                  up->link_inline_p = 1;
+                  up->link_css_p = 1;
+                }
+
+              xfree (uri);
+            }
+        }
+      buffer_pos += yyleng;
+    }
+  DEBUGP (("\n"));
+}
+
+struct urlpos *
+get_urls_css_file (const char *file, const char *url)
+{
+  struct file_memory *fm;
+  struct map_context ctx;
+
+  /* Load the file. */
+  fm = read_file (file);
+  if (!fm)
+    {
+      logprintf (LOG_NOTQUIET, "%s: %s\n", file, strerror (errno));
+      return NULL;
+    }
+  DEBUGP (("Loaded %s (size %s).\n", file, number_to_static_string (fm->length)));
+
+  ctx.text = fm->content;
+  ctx.head = ctx.tail = NULL;
+  ctx.base = NULL;
+  ctx.parent_base = url ? url : opt.base_href;
+  ctx.document_file = file;
+  ctx.nofollow = 0;
+
+  get_urls_css (&ctx, 0, fm->length);
+  read_file_free (fm);
+  return ctx.head;
+}
diff --git a/src/css-url.h b/src/css-url.h

new file mode 100644 (file)

index 0000000..772e2fd
--- /dev/null
+++ b/src/css-url.h
@@ -0,0 +1,36 @@
+/* Declarations for css-url.c.
+   Copyright (C) 2006 Free Software Foundation, Inc.
+
+This file is part of GNU Wget.
+
+GNU Wget is free software; you can redistribute it and/or modify
+it under the terms of the GNU General Public License as published by
+the Free Software Foundation; either version 2 of the License, or
+(at your option) any later version.
+
+GNU Wget is distributed in the hope that it will be useful,
+but WITHOUT ANY WARRANTY; without even the implied warranty of
+MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+GNU General Public License for more details.
+
+You should have received a copy of the GNU General Public License
+along with Wget; if not, write to the Free Software
+Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
+
+In addition, as a special exception, the Free Software Foundation
+gives permission to link the code of its release of Wget with the
+OpenSSL project's "OpenSSL" library (or with modified versions of it
+that use the same license as the "OpenSSL" library), and distribute
+the linked executables.  You must obey the GNU General Public License
+in all respects for all of the code used other than "OpenSSL".  If you
+modify this file, you may extend this exception to your version of the
+file, but you are not obligated to do so.  If you do not wish to do
+so, delete this exception statement from your version.  */
+
+#ifndef CSS_URL_H
+#define CSS_URL_H
+
+void get_urls_css (struct map_context *, int, int);
+struct urlpos *get_urls_css_file (const char *, const char *);
+
+#endif /* CSS_URL_H */
diff --git a/src/css.l b/src/css.l

new file mode 100644 (file)

index 0000000..8d1477a
--- /dev/null
+++ b/src/css.l
@@ -0,0 +1,137 @@
+%option case-insensitive
+%option noyywrap
+%option never-interactive
+
+%{
+/* Lex source for CSS tokenizing.
+   Taken from http://www.w3.org/TR/CSS21/grammar.html#q2
+   Copyright (C) 2006 Free Software Foundation, Inc.
+
+This file is part of GNU Wget.
+
+GNU Wget is free software; you can redistribute it and/or modify
+it under the terms of the GNU General Public License as published by
+the Free Software Foundation; either version 2 of the License, or
+(at your option) any later version.
+
+GNU Wget is distributed in the hope that it will be useful,
+but WITHOUT ANY WARRANTY; without even the implied warranty of
+MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+GNU General Public License for more details.
+
+You should have received a copy of the GNU General Public License
+along with Wget; if not, write to the Free Software
+Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
+
+In addition, as a special exception, the Free Software Foundation
+gives permission to link the code of its release of Wget with the
+OpenSSL project's "OpenSSL" library (or with modified versions of it
+that use the same license as the "OpenSSL" library), and distribute
+the linked executables.  You must obey the GNU General Public License
+in all respects for all of the code used other than "OpenSSL".  If you
+modify this file, you may extend this exception to your version of the
+file, but you are not obligated to do so.  If you do not wish to do
+so, delete this exception statement from your version.  */
+
+#include "css-tokens.h"
+
+/* {s}+\/\*[^*]*\*+([^/*][^*]*\*+)*\/      {unput(' '); } */
+/*replace by space*/
+%}
+
+h               [0-9a-f]
+nonascii        [\200-\377]
+unicode         \\{h}{1,6}(\r\n|[ \t\r\n\f])?
+escape          {unicode}|\\[^\r\n\f0-9a-f]
+nmstart         [_a-z]|{nonascii}|{escape}
+nmchar          [_a-z0-9-]|{nonascii}|{escape}
+string1         \"([^\n\r\f\\"]|\\{nl}|{escape})*\"
+string2         \'([^\n\r\f\\']|\\{nl}|{escape})*\'
+invalid1        \"([^\n\r\f\\"]|\\{nl}|{escape})*
+invalid2        \'([^\n\r\f\\']|\\{nl}|{escape})*
+
+comment         \/\*[^*]*\*+([^/*][^*]*\*+)*\/
+ident           -?{nmstart}{nmchar}*
+name            {nmchar}+
+num             [0-9]+|[0-9]*"."[0-9]+
+string          {string1}|{string2}
+invalid         {invalid1}|{invalid2}
+url             ([!#$%&*-~]|{nonascii}|{escape})*
+s               [ \t\r\n\f]
+w               ({s}|{comment})*
+nl              \n|\r\n|\r|\f
+
+A               a|\\0{0,4}(41|61)(\r\n|[ \t\r\n\f])?
+C               c|\\0{0,4}(43|63)(\r\n|[ \t\r\n\f])?
+D               d|\\0{0,4}(44|64)(\r\n|[ \t\r\n\f])?
+E               e|\\0{0,4}(45|65)(\r\n|[ \t\r\n\f])?
+G               g|\\0{0,4}(47|67)(\r\n|[ \t\r\n\f])?|\\g
+H               h|\\0{0,4}(48|68)(\r\n|[ \t\r\n\f])?|\\h
+I               i|\\0{0,4}(49|69)(\r\n|[ \t\r\n\f])?|\\i
+K               k|\\0{0,4}(4b|6b)(\r\n|[ \t\r\n\f])?|\\k
+M               m|\\0{0,4}(4d|6d)(\r\n|[ \t\r\n\f])?|\\m
+N               n|\\0{0,4}(4e|6e)(\r\n|[ \t\r\n\f])?|\\n
+P               p|\\0{0,4}(50|70)(\r\n|[ \t\r\n\f])?|\\p
+R               r|\\0{0,4}(52|72)(\r\n|[ \t\r\n\f])?|\\r
+S               s|\\0{0,4}(53|73)(\r\n|[ \t\r\n\f])?|\\s
+T               t|\\0{0,4}(54|74)(\r\n|[ \t\r\n\f])?|\\t
+X               x|\\0{0,4}(58|78)(\r\n|[ \t\r\n\f])?|\\x
+Z               z|\\0{0,4}(5a|7a)(\r\n|[ \t\r\n\f])?|\\z
+
+%%
+
+{s}                     {return S;}
+
+\/\*[^*]*\*+([^/*][^*]*\*+)*\/          {return S;} /* ignore comments */
+
+"<!--"          {return CDO;}
+"-->"                   {return CDC;}
+"~="                    {return INCLUDES;}
+"|="                    {return DASHMATCH;}
+
+{w}"{"                  {return LBRACE;}
+{w}"+"                  {return PLUS;}
+{w}">"                  {return GREATER;}
+{w}","                  {return COMMA;}
+
+{string}                {return STRING;}
+{invalid}               {return INVALID; /* unclosed string */}
+
+{ident}                 {return IDENT;}
+
+"#"{name}               {return HASH;}
+
+"@import"               {return IMPORT_SYM;}
+"@page"                 {return PAGE_SYM;}
+"@media"                {return MEDIA_SYM;}
+"@charset "             {return CHARSET_SYM;}
+
+"!"{w}"important"       {return IMPORTANT_SYM;}
+
+{num}{E}{M}             {return EMS;}
+{num}{E}{X}             {return EXS;}
+{num}{P}{X}             {return LENGTH;}
+{num}{C}{M}             {return LENGTH;}
+{num}{M}{M}             {return LENGTH;}
+{num}{I}{N}             {return LENGTH;}
+{num}{P}{T}             {return LENGTH;}
+{num}{P}{C}             {return LENGTH;}
+{num}{D}{E}{G}          {return ANGLE;}
+{num}{R}{A}{D}          {return ANGLE;}
+{num}{G}{R}{A}{D}       {return ANGLE;}
+{num}{M}{S}             {return TIME;}
+{num}{S}                {return TIME;}
+{num}{H}{Z}             {return FREQ;}
+{num}{K}{H}{Z}          {return FREQ;}
+{num}{ident}            {return DIMENSION;}
+
+{num}%                  {return PERCENTAGE;}
+{num}                   {return NUMBER;}
+
+"url("{w}{string}{w}")" {return URI;}
+"url("{w}{url}{w}")"    {return URI;}
+{ident}"("              {return FUNCTION;}
+
+.                       {return *yytext;}
+
+%%
diff --git a/src/ftp.c b/src/ftp.c

index 7256faa360b24252906cf1f7a91394807bb74e88..482651be62cd769f94ab404d25fc69297a0cc03f 100644 (file)
--- a/src/ftp.c
+++ b/src/ftp.c
@@ -805,8 +805,14 @@ Error in server response, closing control connection.\n"));
                     }
                   f = f->next;
                 }
-             if (!exists)
-               {
+              if (exists)
+                {
+                  logputs (LOG_VERBOSE, "\n");
+                  logprintf (LOG_NOTQUIET, _("File %s exists.\n"),
+                             quote (u->file));
+                }
+             else
+                {
                   logputs (LOG_VERBOSE, "\n");
                   logprintf (LOG_NOTQUIET, _("No such file %s.\n"),
                              quote (u->file));
diff --git a/src/html-parse.c b/src/html-parse.c

index fdf5b99bc791c8c420633be631903b8ab3f1be64..692e0de4b8c463d0eaa36bd4d3d34f7f42147f41 100644 (file)
--- a/src/html-parse.c
+++ b/src/html-parse.c
@@ -272,6 +272,94 @@ struct pool {
     to "<foo", but "&lt,foo" to "<,foo".  */
  #define SKIP_SEMI(p, inc) (p += inc, p < end && *p == ';' ? ++p : p)
  
+struct tagstack_item {
+  const char *tagname_begin;
+  const char *tagname_end;
+  const char *contents_begin;
+  struct tagstack_item *prev;
+  struct tagstack_item *next;
+};
+
+struct tagstack_item *
+tagstack_push (struct tagstack_item **head, struct tagstack_item **tail)
+{
+  struct tagstack_item *ts = xmalloc(sizeof(struct tagstack_item));
+  if (*head == NULL)
+    {
+      *head = *tail = ts;
+      ts->prev = ts->next = NULL;
+    }
+  else
+    {
+      (*tail)->next = ts;
+      ts->prev = *tail;
+      *tail = ts;
+      ts->next = NULL;
+    }
+
+  return ts;
+}
+
+/* remove ts and everything after it from the stack */
+void
+tagstack_pop (struct tagstack_item **head, struct tagstack_item **tail,
+              struct tagstack_item *ts)
+{
+  if (*head == NULL)
+    return;
+
+  if (ts == *tail)
+    {
+      if (ts == *head)
+        {
+          xfree (ts);
+          *head = *tail = NULL;
+        }
+      else
+        {
+          ts->prev->next = NULL;
+          *tail = ts->prev;
+          xfree (ts);
+        }
+    }
+  else
+    {
+      if (ts == *head)
+        {
+          *head = NULL;
+        }
+      *tail = ts->prev;
+
+      if (ts->prev)
+        {
+          ts->prev->next = NULL;
+        }
+      while (ts)
+        {
+          struct tagstack_item *p = ts->next;
+          xfree (ts);
+          ts = p;
+        }
+    }
+}
+
+struct tagstack_item *
+tagstack_find (struct tagstack_item *tail, const char *tagname_begin,
+               const char *tagname_end)
+{
+  int len = tagname_end - tagname_begin;
+  while (tail)
+    {
+      if (len == (tail->tagname_end - tail->tagname_begin))
+        {
+          if (0 == strncasecmp (tail->tagname_begin, tagname_begin, len))
+            return tail;
+        }
+      tail = tail->prev;
+    }
+  return NULL;
+}
+
  /* Decode the HTML character entity at *PTR, considering END to be end
     of buffer.  It is assumed that the "&" character that marks the
     beginning of the entity has been seen at *PTR-1.  If a recognized
@@ -757,6 +845,9 @@ map_html_tags (const char *text, int size,
    bool attr_pair_resized = false;
    struct attr_pair *pairs = attr_pair_initial_storage;
  
+  struct tagstack_item *head = NULL;
+  struct tagstack_item *tail = NULL;
+
    if (!size)
      return;
  
@@ -823,6 +914,18 @@ map_html_tags (const char *text, int size,
        goto look_for_tag;
      tag_name_end = p;
      SKIP_WS (p);
+
+    if (!end_tag)
+      {
+        struct tagstack_item *ts = tagstack_push (&head, &tail);
+        if (ts)
+          {
+            ts->tagname_begin  = tag_name_begin;
+            ts->tagname_end    = tag_name_end;
+            ts->contents_begin = NULL;
+          }
+      }
+
      if (end_tag && *p != '>')
        goto backout_tag;
  
@@ -984,6 +1087,11 @@ map_html_tags (const char *text, int size,
          ++nattrs;
        }
  
+    if (!end_tag && tail && (tail->tagname_begin == tag_name_begin))
+      {
+        tail->contents_begin = p+1;
+      }
+
      if (uninteresting_tag)
        {
          ADVANCE (p);
@@ -995,6 +1103,7 @@ map_html_tags (const char *text, int size,
      {
        int i;
        struct taginfo taginfo;
+      struct tagstack_item *ts = NULL;
  
        taginfo.name      = pool.contents;
        taginfo.end_tag_p = end_tag;
@@ -1011,6 +1120,23 @@ map_html_tags (const char *text, int size,
        taginfo.attrs = pairs;
        taginfo.start_position = tag_start_position;
        taginfo.end_position   = p + 1;
+      taginfo.contents_begin = NULL;
+      taginfo.contents_end = NULL;
+
+      if (end_tag)
+        {
+          ts = tagstack_find (tail, tag_name_begin, tag_name_end);
+          if (ts)
+            {
+              if (ts->contents_begin)
+                {
+                  taginfo.contents_begin = ts->contents_begin;
+                  taginfo.contents_end   = tag_start_position;
+                }
+              tagstack_pop (&head, &tail, ts);
+            }
+        }
+
        mapfun (&taginfo, maparg);
        ADVANCE (p);
      }
@@ -1030,6 +1156,8 @@ map_html_tags (const char *text, int size,
    POOL_FREE (&pool);
    if (attr_pair_resized)
      xfree (pairs);
+  /* pop any tag stack that's left */
+  tagstack_pop (&head, &tail, head);
  }
  
  #undef ADVANCE
diff --git a/src/html-parse.h b/src/html-parse.h

index abe3b08de935423373036b441a9f8c20cdfb2809..ed1c685562a868a39f2c55dbbcbf802f7e904b4a 100644 (file)
--- a/src/html-parse.h
+++ b/src/html-parse.h
@@ -52,6 +52,9 @@ struct taginfo {
  
    const char *start_position;  /* start position of tag */
    const char *end_position;    /* end position of tag */
+
+  const char *contents_begin;   /* delimiters of tag contents */
+  const char *contents_end;     /* only valid if end_tag_p */
  };
  
  struct hash_table;             /* forward declaration */
diff --git a/src/html-url.c b/src/html-url.c

index 5a0682d3f80eaab886cdc997fa08ab3c5d26ef54..ef93a7e49c5c61ded555c2b6c022be3346bbec2c 100644 (file)
--- a/src/html-url.c
+++ b/src/html-url.c
@@ -41,11 +41,11 @@ as that of the covered work.  */
  #include "utils.h"
  #include "hash.h"
  #include "convert.h"
-#include "recur.h"              /* declaration of get_urls_html */
+#include "recur.h"
+#include "html-url.h"
+#include "css-url.h"
  #include "iri.h"
  
-struct map_context;
-
  typedef void (*tag_handler_t) (int, struct taginfo *, struct map_context *);
  
  #define DECLARE_TAG_HANDLER(fun)                                \
@@ -164,11 +164,12 @@ static struct {
     from the information above.  However, some places in the code refer
     to the attributes not mentioned here.  We add them manually.  */
  static const char *additional_attributes[] = {
-  "rel",                        /* used by tag_handle_link */
-  "http-equiv",                 /* used by tag_handle_meta */
-  "name",                       /* used by tag_handle_meta */
-  "content",                    /* used by tag_handle_meta */
-  "action"                      /* used by tag_handle_form */
+  "rel",                        /* used by tag_handle_link  */
+  "http-equiv",                 /* used by tag_handle_meta  */
+  "name",                       /* used by tag_handle_meta  */
+  "content",                    /* used by tag_handle_meta  */
+  "action",                     /* used by tag_handle_form  */
+  "style"                       /* used by check_style_attr */
  };
  
  static struct hash_table *interesting_tags;
@@ -247,28 +248,20 @@ find_attr (struct taginfo *tag, const char *name, int *attrind)
    return NULL;
  }
  
-struct map_context {
-  char *text;                   /* HTML text. */
-  char *base;                   /* Base URI of the document, possibly
-                                   changed through <base href=...>. */
-  const char *parent_base;      /* Base of the current document. */
-  const char *document_file;    /* File name of this document. */
-  bool nofollow;                /* whether NOFOLLOW was specified in a
-                                   <meta name=robots> tag. */
-
-  struct urlpos *head, *tail;   /* List of URLs that is being
-                                   built. */
-};
+/* used for calls to append_url */
+#define ATTR_POS(tag, attrind, ctx) \
+ (tag->attrs[attrind].value_raw_beginning - ctx->text)
+#define ATTR_SIZE(tag, attrind) \
+ (tag->attrs[attrind].value_raw_size)
  
  /* Append LINK_URI to the urlpos structure that is being built.
  
-   LINK_URI will be merged with the current document base.  TAG and
-   ATTRIND are the necessary context to store the position and
-   size.  */
+   LINK_URI will be merged with the current document base.
+*/
  
-static struct urlpos *
-append_url (const char *link_uri,
-            struct taginfo *tag, int attrind, struct map_context *ctx)
+struct urlpos *
+append_url (const char *link_uri, int position, int size,
+            struct map_context *ctx)
  {
    int link_has_scheme = url_has_scheme (link_uri);
    struct urlpos *newel;
@@ -330,8 +323,8 @@ append_url (const char *link_uri,
  
    newel = xnew0 (struct urlpos);
    newel->url = url;
-  newel->pos = tag->attrs[attrind].value_raw_beginning - ctx->text;
-  newel->size = tag->attrs[attrind].value_raw_size;
+  newel->pos = position;
+  newel->size = size;
  
    /* A URL is relative if the host is not named, and the name does not
       start with `/'.  */
@@ -351,6 +344,18 @@ append_url (const char *link_uri,
    return newel;
  }
  \f
+static void
+check_style_attr (struct taginfo *tag, struct map_context *ctx)
+{
+  int attrind;
+  char *style = find_attr (tag, "style", &attrind);
+  if (!style)
+    return;
+
+  /* raw pos and raw size include the quotes, hence the +1 -2 */
+  get_urls_css (ctx, ATTR_POS(tag,attrind,ctx)+1, ATTR_SIZE(tag,attrind)-2);
+}
+
  /* All the tag_* functions are called from collect_tags_mapper, as
     specified by KNOWN_TAGS.  */
  
@@ -399,7 +404,8 @@ tag_find_urls (int tagid, struct taginfo *tag, struct map_context *ctx)
            if (0 == strcasecmp (tag->attrs[attrind].name,
                                 tag_url_attributes[i].attr_name))
              {
-              struct urlpos *up = append_url (link, tag, attrind, ctx);
+              struct urlpos *up = append_url (link, ATTR_POS(tag,attrind,ctx),
+                                              ATTR_SIZE(tag,attrind), ctx);
                if (up)
                  {
                    int flags = tag_url_attributes[i].flags;
@@ -424,7 +430,8 @@ tag_handle_base (int tagid, struct taginfo *tag, struct map_context *ctx)
    if (!newbase)
      return;
  
-  base_urlpos = append_url (newbase, tag, attrind, ctx);
+  base_urlpos = append_url (newbase, ATTR_POS(tag,attrind,ctx),
+                            ATTR_SIZE(tag,attrind), ctx);
    if (!base_urlpos)
      return;
    base_urlpos->ignore_when_downloading = 1;
@@ -445,9 +452,11 @@ tag_handle_form (int tagid, struct taginfo *tag, struct map_context *ctx)
  {
    int attrind;
    char *action = find_attr (tag, "action", &attrind);
+
    if (action)
      {
-      struct urlpos *up = append_url (action, tag, attrind, ctx);
+      struct urlpos *up = append_url (action, ATTR_POS(tag,attrind,ctx),
+                                      ATTR_SIZE(tag,attrind), ctx);
        if (up)
          up->ignore_when_downloading = 1;
      }
@@ -470,14 +479,23 @@ tag_handle_link (int tagid, struct taginfo *tag, struct map_context *ctx)
    */
    if (href)
      {
-      struct urlpos *up = append_url (href, tag, attrind, ctx);
+      struct urlpos *up = append_url (href, ATTR_POS(tag,attrind,ctx),
+                                      ATTR_SIZE(tag,attrind), ctx);
        if (up)
          {
            char *rel = find_attr (tag, "rel", NULL);
-          if (rel
-              && (0 == strcasecmp (rel, "stylesheet")
-                  || 0 == strcasecmp (rel, "shortcut icon")))
-            up->link_inline_p = 1;
+          if (rel)
+            {
+              if (0 == strcasecmp (rel, "stylesheet"))
+                {
+                  up->link_inline_p = 1;
+                  up->link_expect_css = 1;
+                }
+              else if (0 == strcasecmp (rel, "shortcut icon"))
+                {
+                  up->link_inline_p = 1;
+                }
+            }
            else
              /* The external ones usually point to HTML pages, such as
                 <link rel="next" href="..."> */
@@ -531,7 +549,8 @@ tag_handle_meta (int tagid, struct taginfo *tag, struct map_context *ctx)
        while (c_isspace (*p))
          ++p;
  
-      entry = append_url (p, tag, attrind, ctx);
+      entry = append_url (p, ATTR_POS(tag,attrind,ctx),
+                          ATTR_SIZE(tag,attrind), ctx);
        if (entry)
          {
            entry->link_refresh_p = 1;
@@ -595,11 +614,26 @@ collect_tags_mapper (struct taginfo *tag, void *arg)
    struct map_context *ctx = (struct map_context *)arg;
  
    /* Find the tag in our table of tags.  This must not fail because
-     map_html_tags only returns tags found in interesting_tags.  */
+     map_html_tags only returns tags found in interesting_tags.
+     
+     I've changed this for now, I'm passing NULL as interesting_tags
+     to map_html_tags.  This way we can check all tags for a style
+     attribute.
+  */
    struct known_tag *t = hash_table_get (interesting_tags, tag->name);
-  assert (t != NULL);
  
-  t->handler (t->tagid, tag, ctx);
+  if (t != NULL)
+    t->handler (t->tagid, tag, ctx);
+
+  check_style_attr (tag, ctx);
+
+  if (tag->end_tag_p && (0 == strcasecmp (tag->name, "style")) &&
+      tag->contents_begin && tag->contents_end)
+  {
+    /* parse contents */
+    get_urls_css (ctx, tag->contents_begin - ctx->text,
+                  tag->contents_end - tag->contents_begin);
+  }
  }
  \f
  /* Analyze HTML tags FILE and construct a list of URLs referenced from
@@ -643,8 +677,9 @@ get_urls_html (const char *file, const char *url, bool *meta_disallow_follow)
    if (opt.strict_comments)
      flags |= MHT_STRICT_COMMENTS;
  
+  /* the NULL here used to be interesting_tags */
    map_html_tags (fm->content, fm->length, collect_tags_mapper, &ctx, flags,
-                 interesting_tags, interesting_attributes);
+                 NULL, interesting_attributes);
  
    DEBUGP (("no-follow in %s: %d\n", file, ctx.nofollow));
    if (meta_disallow_follow)
diff --git a/src/html-url.h b/src/html-url.h

new file mode 100644 (file)

index 0000000..a94f0db
--- /dev/null
+++ b/src/html-url.h
@@ -0,0 +1,51 @@
+/* Declarations for html-url.c.
+   Copyright (C) 1995, 1996, 1997 Free Software Foundation, Inc.
+
+This file is part of GNU Wget.
+
+GNU Wget is free software; you can redistribute it and/or modify
+it under the terms of the GNU General Public License as published by
+the Free Software Foundation; either version 2 of the License, or
+(at your option) any later version.
+
+GNU Wget is distributed in the hope that it will be useful,
+but WITHOUT ANY WARRANTY; without even the implied warranty of
+MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+GNU General Public License for more details.
+
+You should have received a copy of the GNU General Public License
+along with Wget; if not, write to the Free Software
+Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
+
+In addition, as a special exception, the Free Software Foundation
+gives permission to link the code of its release of Wget with the
+OpenSSL project's "OpenSSL" library (or with modified versions of it
+that use the same license as the "OpenSSL" library), and distribute
+the linked executables.  You must obey the GNU General Public License
+in all respects for all of the code used other than "OpenSSL".  If you
+modify this file, you may extend this exception to your version of the
+file, but you are not obligated to do so.  If you do not wish to do
+so, delete this exception statement from your version.  */
+
+#ifndef HTML_URL_H
+#define HTML_URL_H
+
+struct map_context {
+  char *text;                  /* HTML text. */
+  char *base;                  /* Base URI of the document, possibly
+                                  changed through <base href=...>. */
+  const char *parent_base;     /* Base of the current document. */
+  const char *document_file;   /* File name of this document. */
+  bool nofollow;               /* whether NOFOLLOW was specified in a
+                                   <meta name=robots> tag. */
+
+  struct urlpos *head, *tail;  /* List of URLs that is being
+                                  built. */
+};
+
+struct urlpos *get_urls_file (const char *);
+struct urlpos *get_urls_html (const char *, const char *, bool *);
+struct urlpos *append_url (const char *, int, int, struct map_context *);
+void free_urlpos (struct urlpos *);
+
+#endif /* HTML_URL_H */
diff --git a/src/http.c b/src/http.c

index f79327c32ebfd4b30d5deb700fe593eda753219c..5ec70d27ba206132a53446aab5bae0f370891f4f 100644 (file)
--- a/src/http.c
+++ b/src/http.c
@@ -70,11 +70,13 @@ as that of the covered work.  */
  extern char *version_string;
  
  /* Forward decls. */
+struct http_stat;
  static char *create_authorization_line (const char *, const char *,
                                          const char *, const char *,
                                          const char *, bool *);
  static char *basic_authentication_encode (const char *, const char *);
  static bool known_authentication_scheme_p (const char *, const char *);
+static void ensure_extension (struct http_stat *, const char *, int *);
  static void load_cookies (void);
  
  #ifndef MIN
@@ -87,6 +89,7 @@ static struct cookie_jar *wget_cookie_jar;
  
  #define TEXTHTML_S "text/html"
  #define TEXTXHTML_S "application/xhtml+xml"
+#define TEXTCSS_S "text/css"
  
  /* Some status code validation macros: */
  #define H_20X(x)        (((x) >= 200) && ((x) < 300))
@@ -2130,34 +2133,25 @@ File %s already there; not retrieving.\n\n"), quote (hs->local_file));
    else
      *dt &= ~TEXTHTML;
  
-  if (opt.html_extension && (*dt & TEXTHTML))
-    /* -E / --html-extension / html_extension = on was specified, and this is a
-       text/html file.  If some case-insensitive variation on ".htm[l]" isn't
-       already the file's suffix, tack on ".html". */
-    {
-      char *last_period_in_local_filename = strrchr (hs->local_file, '.');
+  if (type &&
+      0 == strncasecmp (type, TEXTCSS_S, strlen (TEXTCSS_S)))
+    *dt |= TEXTCSS;
+  else
+    *dt &= ~TEXTCSS;
  
-      if (last_period_in_local_filename == NULL
-          || !(0 == strcasecmp (last_period_in_local_filename, ".htm")
-               || 0 == strcasecmp (last_period_in_local_filename, ".html")))
+  if (opt.html_extension)
+    {
+      if (*dt & TEXTHTML)
+        /* -E / --html-extension / html_extension = on was specified,
+           and this is a text/html file.  If some case-insensitive
+           variation on ".htm[l]" isn't already the file's suffix,
+           tack on ".html". */
          {
-          int local_filename_len = strlen (hs->local_file);
-          /* Resize the local file, allowing for ".html" preceded by
-             optional ".NUMBER".  */
-          hs->local_file = xrealloc (hs->local_file,
-                                     local_filename_len + 24 + sizeof (".html"));
-          strcpy(hs->local_file + local_filename_len, ".html");
-          /* If clobbering is not allowed and the file, as named,
-             exists, tack on ".NUMBER.html" instead. */
-          if (!ALLOW_CLOBBER && file_exists_p (hs->local_file))
-            {
-              int ext_num = 1;
-              do
-                sprintf (hs->local_file + local_filename_len,
-                         ".%d.html", ext_num++);
-              while (file_exists_p (hs->local_file));
-            }
-          *dt |= ADDED_HTML_EXTENSION;
+          ensure_extension (hs, ".html", dt);
+        }
+      else if (*dt & TEXTCSS)
+        {
+          ensure_extension (hs, ".css", dt);
          }
      }
  
@@ -3222,6 +3216,42 @@ http_cleanup (void)
      cookie_jar_delete (wget_cookie_jar);
  }
  
+void
+ensure_extension (struct http_stat *hs, const char *ext, int *dt)
+{
+  char *last_period_in_local_filename = strrchr (hs->local_file, '.');
+  char shortext[8];
+  int len = strlen (ext);
+  if (len == 5)
+    {
+      strncpy (shortext, ext, len - 1);
+      shortext[len - 2] = '\0';
+    }
+
+  if (last_period_in_local_filename == NULL
+      || !(0 == strcasecmp (last_period_in_local_filename, shortext)
+           || 0 == strcasecmp (last_period_in_local_filename, ext)))
+    {
+      int local_filename_len = strlen (hs->local_file);
+      /* Resize the local file, allowing for ".html" preceded by
+         optional ".NUMBER".  */
+      hs->local_file = xrealloc (hs->local_file,
+                                 local_filename_len + 24 + len);
+      strcpy (hs->local_file + local_filename_len, ext);
+      /* If clobbering is not allowed and the file, as named,
+         exists, tack on ".NUMBER.html" instead. */
+      if (!ALLOW_CLOBBER && file_exists_p (hs->local_file))
+        {
+          int ext_num = 1;
+          do
+            sprintf (hs->local_file + local_filename_len,
+                     ".%d%s", ext_num++, ext);
+          while (file_exists_p (hs->local_file));
+        }
+      *dt |= ADDED_HTML_EXTENSION;
+    }
+}
+
  
  #ifdef TESTING
  
diff --git a/src/main.c b/src/main.c

index bf49bf8973b3bff84882ecd49f023bb9ea0a21a8..6135a67d8fb1e17894b6d55ad575799bf523be25 100644 (file)
--- a/src/main.c
+++ b/src/main.c
@@ -422,7 +422,7 @@ Logging and input file:\n"),
      N_("\
    -nv, --no-verbose          turn off verboseness, without being quiet.\n"),
      N_("\
-  -i,  --input-file=FILE     download URLs found in FILE.\n"),
+  -i,  --input-file=FILE     download URLs found in local or external FILE.\n"),
      N_("\
    -F,  --force-html          treat input file as HTML.\n"),
      N_("\
@@ -615,7 +615,8 @@ Recursive download:\n"),
      N_("\
         --delete-after       delete files locally after downloading them.\n"),
      N_("\
-  -k,  --convert-links      make links in downloaded HTML point to local files.\n"),
+  -k,  --convert-links      make links in downloaded HTML or CSS point to\n\
+                            local files.\n"),
      N_("\
    -K,  --backup-converted   before converting file X, back up as X.orig.\n"),
      N_("\
diff --git a/src/recur.c b/src/recur.c

index 6f5da2ae77b9b33dfa612a10c2e8a847adc47f2f..24b80ad475878203680ec355f9ea818758e00b0c 100644 (file)
--- a/src/recur.c
+++ b/src/recur.c
@@ -48,6 +48,8 @@ as that of the covered work.  */
  #include "hash.h"
  #include "res.h"
  #include "convert.h"
+#include "html-url.h"
+#include "css-url.h"
  #include "spider.h"
  #include "iri.h"
  \f
@@ -60,6 +62,8 @@ struct queue_element {
    bool html_allowed;            /* whether the document is allowed to
                                     be treated as HTML. */
    char *remote_encoding;
+  bool css_allowed;             /* whether the document is allowed to
+                                   be treated as CSS. */
    struct queue_element *next;   /* next element in queue */
  };
  
@@ -92,7 +96,8 @@ url_queue_delete (struct url_queue *queue)
  
  static void
  url_enqueue (struct url_queue *queue,
-             const char *url, const char *referer, int depth, bool html_allowed)
+             const char *url, const char *referer, int depth,
+             bool html_allowed, bool css_allowed)
  {
    struct queue_element *qel = xnew (struct queue_element);
    char *charset = get_current_charset ();
@@ -100,6 +105,7 @@ url_enqueue (struct url_queue *queue,
    qel->referer = referer;
    qel->depth = depth;
    qel->html_allowed = html_allowed;
+  qel->css_allowed = css_allowed;
    qel->next = NULL;
  
    if (charset)
@@ -130,7 +136,7 @@ url_enqueue (struct url_queue *queue,
  static bool
  url_dequeue (struct url_queue *queue,
               const char **url, const char **referer, int *depth,
-             bool *html_allowed)
+             bool *html_allowed, bool *css_allowed)
  {
    struct queue_element *qel = queue->head;
  
@@ -149,6 +155,7 @@ url_dequeue (struct url_queue *queue,
    *referer = qel->referer;
    *depth = qel->depth;
    *html_allowed = qel->html_allowed;
+  *css_allowed = qel->css_allowed;
  
    --queue->count;
  
@@ -216,7 +223,7 @@ retrieve_tree (const char *start_url)
  
    /* Enqueue the starting URL.  Use start_url_parsed->url rather than
       just URL so we enqueue the canonical form of the URL.  */
-  url_enqueue (queue, xstrdup (start_url_parsed->url), NULL, 0, true);
+  url_enqueue (queue, xstrdup (start_url_parsed->url), NULL, 0, true, false);
    string_set_add (blacklist, start_url_parsed->url);
  
    while (1)
@@ -224,7 +231,8 @@ retrieve_tree (const char *start_url)
        bool descend = false;
        char *url, *referer, *file = NULL;
        int depth;
-      bool html_allowed;
+      bool html_allowed, css_allowed;
+      bool is_css = false;
        bool dash_p_leaf_HTML = false;
  
        if (opt.quota && total_downloaded_bytes > opt.quota)
@@ -236,7 +244,7 @@ retrieve_tree (const char *start_url)
  
        if (!url_dequeue (queue,
                          (const char **)&url, (const char **)&referer,
-                        &depth, &html_allowed))
+                        &depth, &html_allowed, &css_allowed))
          break;
  
        /* ...and download it.  Note that this download is in most cases
@@ -254,10 +262,21 @@ retrieve_tree (const char *start_url)
            DEBUGP (("Already downloaded \"%s\", reusing it from \"%s\".\n",
                     url, file));
  
+          /* this sucks, needs to be combined! */
            if (html_allowed
                && downloaded_html_set
                && string_set_contains (downloaded_html_set, file))
-            descend = true;
+            {
+              descend = true;
+              is_css = false;
+            }
+          if (css_allowed
+              && downloaded_css_set
+              && string_set_contains (downloaded_css_set, file))
+            {
+              descend = true;
+              is_css = true;
+            }
          }
        else
          {
@@ -268,7 +287,21 @@ retrieve_tree (const char *start_url)
  
            if (html_allowed && file && status == RETROK
                && (dt & RETROKF) && (dt & TEXTHTML))
-            descend = true;
+            {
+              descend = true;
+              is_css = false;
+            }
+
+          /* a little different, css_allowed can override content type
+             lots of web servers serve css with an incorrect content type
+          */
+          if (file && status == RETROK
+              && (dt & RETROKF) &&
+              ((dt & TEXTCSS) || css_allowed))
+            {
+              descend = true;
+              is_css = true;
+            }
  
            if (redirected)
              {
@@ -322,14 +355,15 @@ retrieve_tree (const char *start_url)
              }
          }
  
-      /* If the downloaded document was HTML, parse it and enqueue the
+      /* If the downloaded document was HTML or CSS, parse it and enqueue the
           links it contains. */
  
        if (descend)
          {
            bool meta_disallow_follow = false;
            struct urlpos *children
-            = get_urls_html (file, url, &meta_disallow_follow);
+            = is_css ? get_urls_css_file (file, url) :
+                       get_urls_html (file, url, &meta_disallow_follow);
  
            if (opt.use_robots && meta_disallow_follow)
              {
@@ -363,7 +397,8 @@ retrieve_tree (const char *start_url)
                      {
                        url_enqueue (queue, xstrdup (child->url->url),
                                     xstrdup (referer_url), depth + 1,
-                                   child->link_expect_html);
+                                   child->link_expect_html,
+                                   child->link_expect_css);
                        /* We blacklist the URL we have enqueued, because we
                           don't want to enqueue (and hence download) the
                           same URL twice.  */
@@ -412,9 +447,9 @@ retrieve_tree (const char *start_url)
    {
      char *d1, *d2;
      int d3;
-    bool d4;
+    bool d4, d5;
      while (url_dequeue (queue,
-                        (const char **)&d1, (const char **)&d2, &d3, &d4))
+                        (const char **)&d1, (const char **)&d2, &d3, &d4, &d5))
        {
          xfree (d1);
          xfree_null (d2);
diff --git a/src/recur.h b/src/recur.h

index d2c8e6148a64a1eedb40691acd7805c684e1f828..5ab26a950011366d6888c544908ef834312929bb 100644 (file)
--- a/src/recur.h
+++ b/src/recur.h
@@ -44,9 +44,4 @@ struct urlpos;
  void recursive_cleanup (void);
  uerr_t retrieve_tree (const char *);
  
-/* These are really in html-url.c. */
-struct urlpos *get_urls_file (const char *);
-struct urlpos *get_urls_html (const char *, const char *, bool *);
-void free_urlpos (struct urlpos *);
-
  #endif /* RECUR_H */
diff --git a/src/retr.c b/src/retr.c

index dd4978a7acb17297d54b81e51fe4abc50b4f7384..7a28ea32e76e825d452417d104fb538a178e79a8 100644 (file)
--- a/src/retr.c
+++ b/src/retr.c
@@ -52,6 +52,7 @@ as that of the covered work.  */
  #include "convert.h"
  #include "ptimer.h"
  #include "iri.h"
+#include "html-url.h"
  
  /* Total size of downloaded files.  Used to enforce quota.  */
  SUM_SIZE_INT total_downloaded_bytes;
@@ -795,6 +796,16 @@ retrieve_url (const char *origurl, char **file, char **newloc,
          register_redirection (origurl, u->url);
        if (*dt & TEXTHTML)
          register_html (u->url, local_file);
+      if (*dt & RETROKF)
+        {
+          register_download (u->url, local_file);
+          if (redirection_count && 0 != strcmp (origurl, u->url))
+            register_redirection (origurl, u->url);
+          if (*dt & TEXTHTML)
+            register_html (u->url, local_file);
+          if (*dt & TEXTCSS)
+            register_css (u->url, local_file);
+        }
      }
  
    if (file)
@@ -835,10 +846,24 @@ retrieve_from_file (const char *file, bool html, int *count)
    uerr_t status;
    struct urlpos *url_list, *cur_url;
  
-  url_list = (html ? get_urls_html (file, NULL, NULL)
-              : get_urls_file (file));
+  char *input_file = NULL;
+  const char *url = file;
+
    status = RETROK;             /* Suppose everything is OK.  */
    *count = 0;                  /* Reset the URL count.  */
+  
+  if (url_has_scheme (url))
+    {
+      uerr_t status;
+      status = retrieve_url (url, &input_file, NULL, NULL, NULL, false);
+      if (status != RETROK)
+        return status;
+    }
+  else
+    input_file = (char *) file;
+
+  url_list = (html ? get_urls_html (input_file, NULL, NULL)
+              : get_urls_file (input_file));
  
    for (cur_url = url_list; cur_url; cur_url = cur_url->next, ++*count)
      {
diff --git a/src/wget.h b/src/wget.h

index 8e2d3872f5a8260c01a03a13332af9fcf9d3ae29..d87dfcac85198b4b5f0f305c703029fba68d080d 100644 (file)
--- a/src/wget.h
+++ b/src/wget.h
@@ -317,7 +317,8 @@ enum
    HEAD_ONLY            = 0x0004,       /* only send the HEAD request */
    SEND_NOCACHE         = 0x0008,       /* send Pragma: no-cache directive */
    ACCEPTRANGES         = 0x0010,       /* Accept-ranges header was found */
-  ADDED_HTML_EXTENSION = 0x0020         /* added ".html" extension due to -E */
+  ADDED_HTML_EXTENSION = 0x0020,        /* added ".html" extension due to -E */
+  TEXTCSS              = 0x0040                /* document is of type text/css */
  };
  
  /* Universal error type -- used almost everywhere.  Error reporting of
diff --git a/ylwrap b/ylwrap

new file mode 100755 (executable)

index 0000000..102bd89
--- /dev/null
+++ b/ylwrap
@@ -0,0 +1,223 @@
+#! /bin/sh
+# ylwrap - wrapper for lex/yacc invocations.
+
+scriptversion=2005-05-14.22
+
+# Copyright (C) 1996, 1997, 1998, 1999, 2001, 2002, 2003, 2004, 2005
+#   Free Software Foundation, Inc.
+#
+# Written by Tom Tromey <tromey@cygnus.com>.
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2, or (at your option)
+# any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write to the Free Software
+# Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA
+# 02110-1301, USA.
+
+# As a special exception to the GNU General Public License, if you
+# distribute this file as part of a program that contains a
+# configuration script generated by Autoconf, you may include it under
+# the same distribution terms that you use for the rest of that program.
+
+# This file is maintained in Automake, please report
+# bugs to <bug-automake@gnu.org> or send patches to
+# <automake-patches@gnu.org>.
+
+case "$1" in
+  '')
+    echo "$0: No files given.  Try \`$0 --help' for more information." 1>&2
+    exit 1
+    ;;
+  --basedir)
+    basedir=$2
+    shift 2
+    ;;
+  -h|--h*)
+    cat <<\EOF
+Usage: ylwrap [--help|--version] INPUT [OUTPUT DESIRED]... -- PROGRAM [ARGS]...
+
+Wrapper for lex/yacc invocations, renaming files as desired.
+
+  INPUT is the input file
+  OUTPUT is one file PROG generates
+  DESIRED is the file we actually want instead of OUTPUT
+  PROGRAM is program to run
+  ARGS are passed to PROG
+
+Any number of OUTPUT,DESIRED pairs may be used.
+
+Report bugs to <bug-automake@gnu.org>.
+EOF
+    exit $?
+    ;;
+  -v|--v*)
+    echo "ylwrap $scriptversion"
+    exit $?
+    ;;
+esac
+
+
+# The input.
+input="$1"
+shift
+case "$input" in
+  [\\/]* | ?:[\\/]*)
+    # Absolute path; do nothing.
+    ;;
+  *)
+    # Relative path.  Make it absolute.
+    input="`pwd`/$input"
+    ;;
+esac
+
+pairlist=
+while test "$#" -ne 0; do
+  if test "$1" = "--"; then
+    shift
+    break
+  fi
+  pairlist="$pairlist $1"
+  shift
+done
+
+# The program to run.
+prog="$1"
+shift
+# Make any relative path in $prog absolute.
+case "$prog" in
+  [\\/]* | ?:[\\/]*) ;;
+  *[\\/]*) prog="`pwd`/$prog" ;;
+esac
+
+# FIXME: add hostname here for parallel makes that run commands on
+# other machines.  But that might take us over the 14-char limit.
+dirname=ylwrap$$
+trap "cd `pwd`; rm -rf $dirname > /dev/null 2>&1" 1 2 3 15
+mkdir $dirname || exit 1
+
+cd $dirname
+
+case $# in
+  0) $prog "$input" ;;
+  *) $prog "$@" "$input" ;;
+esac
+ret=$?
+
+if test $ret -eq 0; then
+  set X $pairlist
+  shift
+  first=yes
+  # Since DOS filename conventions don't allow two dots,
+  # the DOS version of Bison writes out y_tab.c instead of y.tab.c
+  # and y_tab.h instead of y.tab.h. Test to see if this is the case.
+  y_tab_nodot="no"
+  if test -f y_tab.c || test -f y_tab.h; then
+    y_tab_nodot="yes"
+  fi
+
+  # The directory holding the input.
+  input_dir=`echo "$input" | sed -e 's,\([\\/]\)[^\\/]*$,\1,'`
+  # Quote $INPUT_DIR so we can use it in a regexp.
+  # FIXME: really we should care about more than `.' and `\'.
+  input_rx=`echo "$input_dir" | sed 's,\\\\,\\\\\\\\,g;s,\\.,\\\\.,g'`
+
+  while test "$#" -ne 0; do
+    from="$1"
+    # Handle y_tab.c and y_tab.h output by DOS
+    if test $y_tab_nodot = "yes"; then
+      if test $from = "y.tab.c"; then
+       from="y_tab.c"
+      else
+       if test $from = "y.tab.h"; then
+         from="y_tab.h"
+       fi
+      fi
+    fi
+    if test -f "$from"; then
+      # If $2 is an absolute path name, then just use that,
+      # otherwise prepend `../'.
+      case "$2" in
+       [\\/]* | ?:[\\/]*) target="$2";;
+       *) target="../$2";;
+      esac
+
+      # We do not want to overwrite a header file if it hasn't
+      # changed.  This avoid useless recompilations.  However the
+      # parser itself (the first file) should always be updated,
+      # because it is the destination of the .y.c rule in the
+      # Makefile.  Divert the output of all other files to a temporary
+      # file so we can compare them to existing versions.
+      if test $first = no; then
+       realtarget="$target"
+       target="tmp-`echo $target | sed s/.*[\\/]//g`"
+      fi
+      # Edit out `#line' or `#' directives.
+      #
+      # We don't want the resulting debug information to point at
+      # an absolute srcdir; it is better for it to just mention the
+      # .y file with no path.
+      #
+      # We want to use the real output file name, not yy.lex.c for
+      # instance.
+      #
+      # We want the include guards to be adjusted too.
+      FROM=`echo "$from" | sed \
+            -e 'y/abcdefghijklmnopqrstuvwxyz/ABCDEFGHIJKLMNOPQRSTUVWXYZ/'\
+            -e 's/[^ABCDEFGHIJKLMNOPQRSTUVWXYZ]/_/g'`
+      TARGET=`echo "$2" | sed \
+            -e 'y/abcdefghijklmnopqrstuvwxyz/ABCDEFGHIJKLMNOPQRSTUVWXYZ/'\
+            -e 's/[^ABCDEFGHIJKLMNOPQRSTUVWXYZ]/_/g'`
+
+      sed -e "/^#/!b" -e "s,$input_rx,," -e "s,$from,$2," \
+          -e "s,$FROM,$TARGET," "$from" >"$target" || ret=$?
+
+      # Check whether header files must be updated.
+      if test $first = no; then
+       if test -f "$realtarget" && cmp -s "$realtarget" "$target"; then
+         echo "$2" is unchanged
+         rm -f "$target"
+       else
+          echo updating "$2"
+          mv -f "$target" "$realtarget"
+        fi
+      fi
+    else
+      # A missing file is only an error for the first file.  This
+      # is a blatant hack to let us support using "yacc -d".  If -d
+      # is not specified, we don't want an error when the header
+      # file is "missing".
+      if test $first = yes; then
+        ret=1
+      fi
+    fi
+    shift
+    shift
+    first=no
+  done
+else
+  ret=$?
+fi
+
+# Remove the directory.
+cd ..
+rm -rf $dirname
+
+exit $ret
+
+# Local Variables:
+# mode: shell-script
+# sh-indentation: 2
+# eval: (add-hook 'write-file-hooks 'time-stamp)
+# time-stamp-start: "scriptversion="
+# time-stamp-format: "%:y-%02m-%02d.%02H"
+# time-stamp-end: "$"
+# End:
author	Saint Xavier <wget@sxav.eu>
	Sun, 20 Jul 2008 19:45:09 +0000 (21:45 +0200)
committer	Saint Xavier <wget@sxav.eu>
	Sun, 20 Jul 2008 19:45:09 +0000 (21:45 +0200)
.hgignore		patch \| blob \| history
AUTHORS		patch \| blob \| history
ChangeLog		patch \| blob \| history
NEWS		patch \| blob \| history
configure.ac		patch \| blob \| history
doc/ChangeLog		patch \| blob \| history
doc/wget.texi		patch \| blob \| history
src/ChangeLog		patch \| blob \| history
src/Makefile.am		patch \| blob \| history
src/convert.c		patch \| blob \| history
src/convert.h		patch \| blob \| history
src/css-tokens.h	[new file with mode: 0644]	patch \| blob
src/css-url.c	[new file with mode: 0644]	patch \| blob
src/css-url.h	[new file with mode: 0644]	patch \| blob
src/css.l	[new file with mode: 0644]	patch \| blob
src/ftp.c		patch \| blob \| history
src/html-parse.c		patch \| blob \| history
src/html-parse.h		patch \| blob \| history
src/html-url.c		patch \| blob \| history
src/html-url.h	[new file with mode: 0644]	patch \| blob
src/http.c		patch \| blob \| history
src/main.c		patch \| blob \| history
src/recur.c		patch \| blob \| history
src/recur.h		patch \| blob \| history
src/retr.c		patch \| blob \| history
src/wget.h		patch \| blob \| history
ylwrap	[new file with mode: 0755]	patch \| blob