No internet connection
  1. Home
  2. Support

How to use HTTrack to make any website static (with link rewriting)

By Leon Stafford @leonstafford2018-11-30 05:10:56.641Z

Sometimes, when the WP2Static plugin isn't able to be used in your situation, HTTrack can be a good alternative.

A very mature application, with packages available for most operating systems and both GUI and CLI components, it's a really powerful tool.

It can also be piped into other commands or used in conjunction with other tools to do things like auto-deployment to Netlify, S3, etc.

So, how to use it?

  • 15 replies

There are 15 replies. Estimated reading time: 16 minutes

  1. B
    @brackenhill_mob2018-12-05 06:55:05.461Z

    Interesting that you should have brought up the idea of using HTTrack. I've just needed to go down this route. So in the hope that it helps others I thought I'd outline my experience. Note that I would much rather use WP2Static as it is a) faster and b) much easier to use. However if you need to use HTTrack read on.

    At the time of writing, I've got a weird problem with the output of WP2Static. Everything works except that it doesn't produce pages 3 and 4 of a 5 page home page (i.e. the links at the bottom of the screen to additional pages). I needed to verify whether the problem was with my Wordpress site or the plugin - I'd deleted a lot of old posts so I wasn't sure if the issue might be a broken database. Enter HTTrack which I've used before to transfer static sites between hosts so I have some experience of how it works.

    If you do an online search for Wordpress and HTTack you will see a lot of posts saying the HTTrack fails for one reason or another. When I started on this, I used a test server and I got garbage as the output. I also tried the wget cli app and got similar results so the issue was obviously Wordpress not playing nice. I then tried it on the live server and got sensible (if useless - see below) results. After a bit of head scratching I realised that my live server used Clouflare (http://www.cloudflare.com) as a cache/CDN - if you haven't tried Cloudflare I strongly recommend that you do as it can dramatically cut down Wordpress load times. Why does Cloudflare work and native Wordpress doesn't? I presume that it is because a caching engine delivers pure HTML to start with whereas Wordpress often serves gzipped HTML files to speed up load times a) because a plugin has been installed or b) because the webserver (often nginx) has been configured to do this by default. Which explains why so many people report problems using HTTrack. But I digress...

    So how do you get static files?

    Step 1. Use WebHHtrack

    HTTrack has a lot of switches! Rather than spend ages testing them, WebHTTrack has some decent defaults that will get you started. When you are happy, you can copy the call to HTTrack with all the switches from the log file that is produced. WebHTTrack is not part of the HTTrack package (at least not on Ubuntu Linux which is what I use) so you will probably need to download and install it separately.

    Tip: Each time I change the switch settings and I start the run and then abort it, then copy the call to HTTrack with the switches and run that from the command line. The cli version is about 50% fast than the web version; on my i7 laptop with a fast SSD, my 500 page site takes over 2 minutes using the web version to finish.

    Step 2. Tweak the "ignore" settings

    HTTrack follows every link and downloads what it finds. This means that if your theme uses Google fonts or data from Twitter for example, you will find a folder for these fonts or Twitter stuff on your hard drive which is probably not what you want.

    When you let HTTrack run to completion you will probably see a number of folders at the same level as the name you gave the folder to hold the downloaded site. Add each of these to the HTTrack switch using the "-" setting.

    Tip: Don't forget to add "/*" (without the quotes) to the end of the folder name otherwise you won't stop these extraneous files from being downloaded.

    Step 2. Edit functions.php

    Most Wordpress sites generate a lot of unnecessary output which results in HTTrack creating a number of files which are never refereenced by your site and so just take up room. You will know if your site does because you will have a number of indexXXXXXXX.html files in your home directory when all you need is the main index.html one.

    To get rid of these add one or more of the following lines to functions.php that comes with your theme folder (create it if it doesn't exist):

    remove_action('wp_head', 'wp_generator');
    remove_action('wp_head', 'rsd_link');
    remove_action('wp_head', 'wlwmanifest_link');
    remove_action('wp_head', 'wp_shortlink_wp_head');
    remove_action('wp_head', 'index_rel_link');
    remove_action('wp_head', 'start_post_rel_link');
    remove_action('wp_head', 'adjacent_posts_rel_link_wp_head');
    

    And that's it. Assuming all goes well you will have your Wordpress site on your hard drive available to upload to the host of your choice.

    But WP2Static is a much better option ...

    1. LJoseph Toman @lillibolero2018-12-23 08:32:44.952Z

      Why is WP2Static a much better option? I'm not criticizing the product, but I am having trouble getting it to crawl my site and it seems like using something like HTTrack and crawling externally instead of using a product that might conflict with other plugins or care about the version of PHP installed is a better idea. I don't care how the sausage, or the page, is made, just that it gets served to the viewer. I also don't care if viewers think they are looking at a WP site. In fact I would rather have bad actors under that false impression. So, I'm wondering if you could amplify on your statement. Thanks.

      1. B@brackenhill_mob2018-12-23 16:18:09.116Z

        @lillibolero I'm a bit late this this.Apologies.

        Between you and Leon I think you've covered most/all of the points. What I was driving at was that there is no getting away from the fact that HTTrack is a powerful, and therefore, complex tool. To get it to crawl my site successfully I had to spend about an hour tweaking all the settings and rerunning the app to see if they worked. What a PIA!!. Most of us only need to crawl one, possibly two or three of our sites so using a plugin is a) simpler and b) less prone to human error.

        HTH

        1. In reply tolillibolero:
          Leon Stafford @leonstafford2018-12-23 08:48:46.177Z

          Hi Joseph,

          Good question. It may not always be the better option. HTTrack is certainly an awesome tool.

          I'll try to give some scenarios where I might choose the plugin over HTTrack:

          • want the auto-deployment to Netlify, S3, GitHub Pages, that come from the plugin
          • not comfortable with CLI or configuring HTTrack options
          • no permissions to install software on the server
          • HTTrack missing pieces of the site (knowledge of the WP internals, along with plugin and theme detection is building up more solutions in the plugin codebase)
          • the rewriting/stripping of WP elements you mentioned (sometimes you still want to strip things, even if the WP paths are maintained, ie, wp-admin and wp-json things can give 404's on your static site)

          And... coming soon... auto-export on post publish

          I think those are some valid scenarios, but I do love HTTrack, too!

          I find the new WP-CLI integration in the plugin my preferred way to run deploys now and I think I'm setting less options via the CLI than I would be for HTTrack.

          Definitely give the WP-CLI a run if you're a HTTrack fan, I've just soft-released version 6.0 of the plugin this morning:

          https://wordpress.org/plugins/static-html-output-plugin/
          1. LJoseph Toman @lillibolero2018-12-23 09:02:59.664Z

            Thanks for the fast response. I think you missed one: there's a button that says 'Export' on it. For people whose computer knowledge is navigating the back end of a WP site, that gives them a tremendous feeling of control. The best HTTrack solution I have come up with is to simply run a cron job that does the HTTrack export and the upload to S3 . Not optimal, but an easy enough work around. I have not dug deeply into HTTrack configs yet, but can it do a differential crawl? That's another key feature if the site has been up for any length. Oh, for a multisite set up, does the plugin have per site configs? Anyhow, I will check out the release of v. 6.0, thanks.

            1. Leon Stafford @leonstafford2018-12-23 10:47:23.922Z

              Indeed, good point!

              I’m just about to fall into a nap here, but you’ve given me an idea. I was just going to say about how there was an implementation of diff-based deploys a few versions ago, but I dropped it as it wasn’t reliable enough. With WP-CLI breaking the stages into “generate” and “deploy”, we’re able to trivially do diff-based deploys (think I have a pseudo-code example on these forums). But, what you made me think about is the diff-based crawls. This should also be possible today, with WP-CLI and the previous export’s crawled links list. Feeding that into the next export’s exclusions list would work in some cases (where header/footer/sidebar isn’t changing on each crawl or new post addition (think Latest Posts widget).

              1. In reply tolillibolero:
                Leon Stafford @leonstafford2018-12-23 10:49:27.889Z

                And for multisite, it just installs into each site, so they’ll all have their own plugin’s settings. There’s no centralized control, via the UI, but again, easy to script with WP-CLI.

                Interested to hear your experiences with the plugin!

                1. LJoseph Toman @lillibolero2018-12-27 05:29:24.927Z

                  It percolated for a while, then threw a fit. It was configured to save a relative URL version to a ZIP file. No other URL mangling was configured. Error log below is edited "xxx" for sensitive URL data.

                  STARTING EXPORT: 2018-12-27 05:19:45
                  STARTING EXPORT: PHP VERSION 7.2.10-0ubuntu0.18.04.1
                  STARTING EXPORT: OS VERSION Linux xxx 4.15.0-42-generic #45-Ubuntu SMP Thu Nov 15 19:32:10 UTC 2018 i686
                  STARTING EXPORT: WP VERSION 5.0.2
                  STARTING EXPORT: WP URL http://xxx
                  STARTING EXPORT: WP SITEURL http://xxx
                  STARTING EXPORT: WP HOME http://xxx
                  STARTING EXPORT: WP ADDRESS http://xxx
                  STARTING EXPORT: PLUGIN VERSION 6.1
                  STARTING EXPORT: VIA WP-CLI?
                  STARTING EXPORT: STATIC EXPORT URL http://example.com/
                  BAD RESPONSE STATUS (404): /sitemap.xml
                  BAD RESPONSE STATUS (500): /wp-content/uploads/paypal-wp-button-manager-logs/index.html
                  SAVING URL: FILE IS EMPTY /wp-content/plugins/advanced-twenty-seventeen/inc/libraries/kirki/assets/css/kirki-styles.css
                  Failed trying to rename: Source: to:

                  1. Leon Stafford @leonstafford2018-12-27 05:36:26.040Z

                    Hi Joseph,

                    As it's trying to do some renaming of directories, do you have something in the settings for rewriting and renaming?

                    Maybe a path issue there.

                    Cheers,

                    Leon

                    1. LJoseph Toman @lillibolero2018-12-27 05:38:52.956Z

                      Sorry, that's what I meant by no URL mangling. I didn't add any path fragments to replace for, e.g. /wp-content, /wp-admin, etc. .

                      1. Leon Stafford @leonstafford2018-12-27 05:46:07.480Z

                        OK, for some reason, it seems to be trying to.

                        Empty space left in the field?

                        If not, try the "Reset to defaults" to ensure it's cleared.

                        I hope that's just what it is...

                        1. In reply tolillibolero:
                          Leon Stafford @leonstafford2018-12-27 05:46:43.845Z

                          Specifically, the "Rename Exported Directories" textarea

                          1. LJoseph Toman @lillibolero2018-12-27 06:15:11.460Z

                            I have to admit I was wrong. There were paths set, I just didn't see them until I went to check for spaces. I reset to defaults, then set it to create a ZIP file and checked "Allow offline usage". The ZIP was created this time. However when I expanded the ZIP the front page jumbo header image was prepended with 'example.com'. Looking at the HTML generated this is from an img srcset attribute that had the full path to the file in the original. The original domain got translated to example.com.
                            Some of the images did not have relative paths, i.e. they started with /wp-content/uploads/... in the original. These retained the /wp-content/... and so failed to load.
                            I hope that's helpful.

                            1. Leon Stafford @leonstafford2018-12-27 07:54:39.625Z

                              OK, the example.com is a mechanism for creating an Offline ZIP.

                              Try turning that off, inputting your intended domain and then re-exporting.

                              Via the WP-CLI, you can do the export without the deploy to save a bit of time and do quick checking from within the terminal.

                              Else, folder export is quick to check in the browser.

                              I'll look into the Offline ZIP option, I haven't tested it in a while and not covered by automated tests.

                              Cheers,

                              Leon

                              1. Leon Stafford @leonstafford2018-12-27 07:55:15.785Z
                                • explanation: for offline zip, we need to normalize all URLs first, then replace segments with things like ../

                                Should be a quick fix once I look into it...