Other seemingly sensible approaches fall short:
- Doing a "View Page Source," "Select All," "Copy," and "Paste" into an empty Emacs buffer doesn't cut it because you want the HTML that has been generated rather than the original HTML.
- Copying the DOM as XML from the DOM Inspector gives you the current state in XML rather than HTML. I think you can get this to work if you add an XHTML doctype to your output, but it hardly seems worth it because then you have to work in XHTML instead of HTML
- "Save Page As..." doesn't work on all pages, probably because it is trying to pull down all of the stylesheets and images in addition to dumping the HTML, which is something that my script does not do. Because I own the pages that I'm trying to dump with this script, I can copy all of the images and CSS into the folder that contains the HTML file that my script outputs. Since those are static files that do not change often, this works well, in practice, and is likely faster than "Save Page As..." would be if it worked, anyway.
My script also lets you scrub the <SCRIPT> tags from the HTML, which is often what you want in cases like this. Those tags often contain JavaScript to generate HTML for the page, but if the HTML has already been generated, then you don't want that JavaScript to run again when you load the dumped HTML in a web browser.
No comments:
Post a Comment