Bash and Drupal XML Sitemap Module for a simple Cache Warmer Script

Caching content on a website is critical to high performance websites.  For a quick background, let's discuss quickly what caching is, and how it can impact your site.

Let's take a quick look at our friends at W3C and see what they officially define cache as:

Caching is a required part of any efficient Internet access applications as it saves bandwidth and improves access performance significantly in almost all types of accesses.
- W3C

Essentially, what this really means is content that a website serves to you doesn't get repeatedly downloaded every time you visit the website.  Only the new and changed content will appear, while old content that didn't change is pulled from cache on your local machine / device.  This is ideal, especially in the world today where everything is moving to mobile.  Imagine for a moment that you only had a data plan of 500 MB per month.  (Yes, this number seems low, but there are places all over the world that 500 MB is a lot of data).  With a website that has a relatively large page weight, this can very quickly become a problem for a lot of people.  If you're curious exactly how much your website costs others to use, you can view some interesting statistics about your site at  Long story short, having to load a page that is anywhere from 1 - 5 MB doesn't really seem like a lot, but taking the scenario of someone with a 500 MB data plan, and if your site is a big heavier (closer to 5 MB) then every time someone visits your site, you are literally using 1% of their data plan to load your site.  In other words, it adds up quickly.  We want to reduce the data load as much as possible, and reuse resources as much as possible, so the user gets a good experience and doesn't land with overages on their phone bill.

That small tangent aside, let's dive into getting your Drupal website cached and how to keep it in cache.  Normally, these types of cache controls are done for external cache systems, such as a reverse proxy cache like Varnish.  It can also be helpful if you do not want to flush the cache of your entire website when you make small or insignificant changes.  Think of the performance downside for the poor soul that visits your website after you clear the cache.  Literally every single page they navigate to has to be generated again and rendered for them.  In fact, the website would be painful for the first user to hit the site since clearing the cache.  The solution is to make sure the cached content is always up to date, and that end users know to grab the new content.  We are going to implement a couple of modules into our Drupal website to make sure we can adequately control the cache for our users, then we will worry about getting all the content cached.

You will need a couple modules to make this work:

Installing modules is outside the scope of this tutorial, but more information can be found here if you are uncomfortable installing modules or need to brush up a bit.  The modules you will need are:

  • Advanced Page Expiration (link)
  • Cache Expiration (link)
  • XML Sitemap (link)

Please feel free to visit the links to the modules required, and brush up on any specifics about the modules in question.  Once you have the modules installed and enabled, the very first thing we want to do is make sure our Advanced Page Expiration settings are sane for testing.  Advanced page expiration can be administered by visiting your website and navigating to /admin/config/development/ape  If you look at our image provided, you can see what our eventual results are going to be:

Advanced Page Expiration Settings

Notice how we have a couple of options to set.  The first is if we are going to cache pages for anonymous users or not.  By default, we want to do this, as anonymous (not logged in) users will benefit the most from having content cached.  Pay attention to Global page expiration, as this is what we are setting all the pages on the entire site when to expire.  One of the really nice things about this module is we can cache server responses as well.  So things like 302, 302, and 404 responses can be cached and sent back to the user without having to hit the web server.  This is also a great performance gain if you have a site with a lot of changing content, moving content, or even the fringe case of a poorly designed site with a lot of 404 errors.  This will not overwhelm the backend server, and as such, you will see better performance for the site.  These settings can be tweaked, but the default values you see should be fine:

Server Response Caching

Remember, if the page isn't expired, the cached copy will be served.  Also remember, having a page expiration of 1 year with a faulty cache system setup could be disastrous.

If a client visits your website and you have the cache set to expire in a year, without cache control headers or ETags to tell the browser to check for new content, the page will never change for that user in the year period they visit the website.

Obviously this is bad, because it essentially means any work you do on your website won't be seen for a year!  We have a solution:  enter the Cache Expiration module.

The Cache Expiration module

Since we are explicitly telling the website to cache content for as long as possible (a year in our case), we need to be able to tell people browsing the website if the content has changed.  The Cache Expiration module allows you to define custom rules for when to remove content from cache, including integration with other third party modules which control other cache systems (Varnish, Boost, Fastly, Akami, Memcache, etc).  Once you've enabled the Cache Expiration module, you can administer it by navigating in your website to /admin/config/system/expire and should see something similar to this image:

Cache Expiration Settings

We currently have our expiration set to External expiration because we use a cache system outside of Drupal (we have Varnish installed).  If you are not using an external cache system, you can still benefit from this module by using the internal expiration.  Internal expiration will expire the cache of your site, but only if the URL is flagged as expired.  While there is nothing wrong with internal caching, it does tend to be a lot slower than external caching methods because internal cache relies on the database for caching.  External caching can be done, for example with Varnish, to store the entire contents in memory.  You no longer have to access your slow disk drive to get data, query the database server for content, and will generally see a great improvement in site speed.

When you are first using this module, we recommend setting the Debug level to Watchdog + site message.  Watchdog will send the information about anything the Cache Expiration module does to the database log, whereas site message will also show a nice message on node save / update / delete actions which lets you know exactly what URLs were expired.  This is beyond helpful if you are struggling to figure out what's happening (or not happening), and why.

If you take a look at the different options in the vertical tab to the left, you can set specific rules for the expiration of node content, comments, files, menu links, comments, and more if you have other modules that hook into the Cache Expiration module.  With the "Node expiration" tab, you can set the defaults for every node on your website.  You can then edit the content type itself (like an article or basic page) and override the default settings for finite control over individual content types.

Once you've setup your Cache Expiration defaults, and have the Advanced Page Expiration setup in some sane manner, it's time to start testing your website.  From the command line, we can issue a curl command to check our headers to see if we are getting cached content or not.  Let's try it out on our website as an example:

Check Headers for a Website

Start out by issuing the following command (let's use our site as an example):

curl -I

The response we get looks like this:

HTTP/1.1 200 OK
Date: Mon, 11 Sep 2017 22:10:17 GMT
X-Powered-By: PHP/5.6.31
Content-Language: en
X-Frame-Options: SAMEORIGIN
X-Generator: Binary Computer Solutions, Inc (
Link: <https: 400xinfinate="" background.jpg="" default="" files="""" public="" sites="" styles="">; rel=&quot;image_src&quot;,<https:"">; rel=&quot;canonical&quot;,<https:"">; rel=&quot;shortlink&quot;
Cache-Control: public, max-age=2592000
Expires: Sun, 19 Nov 1978 05:00:00 GMT
Vary: Accept-Encoding,User-Agent
Last-Modified: Mon, 11 Sep 2017 17:58:06 GMT
Content-Type: text/html; charset=utf-8
Age: 73193
ETag: W/&quot;1505152686-1&quot;
Accept-Ranges: bytes
Connection: keep-alive

As you can see from the above response, the first thing we notice is the line "Cache-Control".  In our case, it is set to public, max-age=2592000 which is a great place to have our homepage resting.  The max-age in the Cache-Control header is always counted in seconds, so 2,592,000 seconds is exactly 30 days.  We can also see an Age: 73193 which is telling us the content is just over 20 hours old.  

We also have an ETag, which is a unique piece of code sent to let browsers know if content has changed since their last visit.  In other words, our website homepage is now being cached for 30 days.  When we make changes to the homepage, the Cache Expiration module will expire the homepage, and the next person who views the homepage will load a fresh copy.  That new fresh copy is now cached for the world until the next update which expires the URL again, etc etc.

Cache Warming

Now that we understand how the caching works on the website, it would be ideal to have the content always cached.  The immediate downside as explained before is having a cache lifetime too long and now nobody notices changes on your website, but if you setup the Cache Expiration module and Advanced Page Expiration to set you cache-control headers, you can cache the page for a year safely without worrying about the content not updating.  Ideally, when we expire a page, we want to re-cache it almost immediately.  The obvious upside is page load speed, and if you have the entire website in cache, a significant decrease in page load speed will be seen.

So what happens after I expire a URL?  Well honestly, there has been many people over the years who have done nothing except deal with caching systems, expiring cache, and how to control cache; all will agree that cache is the fine art of knowing how long to keep something, and when to re-validate it.  With our setup above, we can keep a URL in cache indefinitely, then only change it if there was an actual change on the page.  Gone are the days of overly complex logic and hard to understand traffic flow on a website, and now we can get away with a simple BASH script to warm our caches.

In order for this script to work, you will need to make sure you have the XML Sitemap module installed and working properly.  The simple crawler that we are about to make will use the sitemap for a list of links to navigate (and ultimately put into cache).  While this can work with other "sitemaps" it has only been tested on the sitemap generated by XML Sitemap.  Okay, that being said, let's dive into the code a bit.

Start out by making a new script file which we will execute to warm our caches.  Some common places you may think of saving this script are /usr/local/bin or an equivalent location that is accessible (unless you purposely do not want others to access the cache warming script).  We will run with the assumption that you want this command to be accessible to all, and will save it in /usr/local/bin.  To start, create the new script file:

vim /usr/local/bin/

Then in that file, we want to have a setup similar to the following code:

wget --quiet https://$URL/sitemap.xml --no-cache --output-document - | egrep -o "https://$URL[^<]+" | while read line; do
   time curl -A 'Cache Warmer' -s -L $line  >/dev/null 2>&1    
   echo $line

In this script, there two things you want to pay attention to.  First, make sure you set the URL at the top to your URL (no protocol like http or https should be in the URL), then make sure you change the wget line so it says https or http depending on your website.  As a side note, if you are still running on strictly http or mixed http/https, you should look into upgrading to https everywhere.  Google has recently said sites that exhibit non-https behavior will be flagged and likely lose ranking in the SERPs.  This would be bad for many reasons, the primary one being you are not seen anywhere.  For more information on setting up a high performance system in CentOS 7, check out the topic on Installing Varnish, Pound, Apache, PHP-FPM, Percona, and Letsencrypt SSL as a complete bundle.

Getting back to our script, what is happening is the script loads our sitemap.xml file, then looks for any line that starts with and grabs the entire URL.  It then issues a loop (see while read line; do) and for each URL in the sitemap, it issues a curl command to ping the URL.  This in turn loads the page back into cache, and the cache is now warmed!  Be sure to save the file, then mark it as executable.  To do so, issue the following command:

chmod +x /usr/local/bin/

After making it executable, try running it to make sure it works.  Running this script, you will see it gives you the time to load each page, and the URL of each page (see: echo $line).  This is helpful when running the script by hand, but in order to keep our caches warmed we are likely wanting to run this script in cron.  So if the output of the script looks okay, we need to edit crontab and add the script to execute like this:

crontab -e

Now once we have the crontab open, we want to add the following line:

15,45        *       *       *       *       /usr/local/bin/  >/dev/null 2>&1

This particular cron task will run at 15 minutes past the hour, and 45 minutes past the hour; or every 30 minutes.  The very first time we run this command, it will take some time as we are putting all the content of our website (that is on the sitemap anyway) into cache.  The second time it runs, there should be almost no time taken at all, and it will power through all the URLs quickly.  The reason for this is the page is already in cache, and the curl command is only grabbing a cached copy.  The server load will be a pinch higher because of the surge of URLs being hit, but if you are using something like Varnish, it's designed for thousands of connections at any given time so this script will not really cause many issues.

The Gotcha(s)

  • This assumes you are using Drupal 7.  Without Drupal 7, you will need to find other ways to expire your cache, OR use purge with Drupal 8. :)
  • This assumes you are using the XML Sitemap.  While it hasn't been tested with other sitemaps, it should work fine as it's only pulling out URLs to crawl
  • This assumes you've setup the XML Sitemap correctly.  Remember, if it's not in the sitemap, the crawler script will have no idea the URL exists.
  • This assumes you have the patience to wait for the script to run the first time, as none of your pages will initially be cached until you run the script.  Also, if you have a really low cache time on a page, this script may actually cause a Denial of Service because you are literally hitting every page you publicly show in rapid fire succession.  If you're finding your caches are not staying "warmed" you should make sure that your cache control headers are setting a time (in seconds) that is reasonable.  Too low of a time in cache, and this script will be loading everything into cache on every call - again leading to a potential Denial of Service issue.