What is a site schema, why am I using it and why should you too?

My last post about letting a robot update your website was based on some of the questions I got presenting the talk "Less human interaction, more security?" last year at Drupalcon North America. The talk is showing how we can discover and deploy updates to our Drupal websites automatically, using tools like violinist.io and Gitlab CI. In addition to the feedback about automation being scary, the most asked question I get is around database updates. What about database updates? How do we handle them? How can they be done correct automatically? In this blog post I will outline my approach, why I use this approach, and why you might consider doing the same.

The problem

I use violinist.io to update all of my PHP projects (both at work and personal, including this blog). So let's say I receive an automated pull request from violinist.io with an update to Drupal core. Maybe this update has this piece of new code in it:

/**
 * Clear caches due to behavior change in DefaultPluginManager.
 */
function system_update_8201() {
  // Empty update to cause a cache rebuild.
}

As we can see, this database update is actually empty, and only exists to make sure the cache is being cleared. Would this be OK to deploy automatically? For sure, no problem at all.

Let's look at something different. Let's say I receive an automated pull request containing this update:

/**
 * Notes on update for multilingual sites.
 */
function layout_builder_restrictions_update_8210() {
  $moduleHandler = \Drupal::service('module_handler');
  if ($moduleHandler->moduleExists('locale')) {
    $message = t("Please note: since your site uses the Locale module, you will likely need to manually resave each Layout Builder Restriction entity configuration, due to new code that makes Layout Builder Restrictions' configuration multilingual compatible. For more information, see <a href='https://www.drupal.org/project/layout_builder_restrictions/releases/8.x-2.6'>Layout Builder Restrictions version 2.6 release notes</a>.");
    \Drupal::logger('layout_builder_restrictions')->warning($message);
    return $message;
  }
}

This database update is taken from a contributed module, and serves as an example of an update hook that would literally require human interaction. It says that you will likely need to resave and re-export your configuration. So trying to apply that database update automatically would most likely work great, but our configuration will now possibly be wrong since the automated update only contain changes to composer files.

So the problem is that deploying database updates introduces unknown side effects. They might be harmless, they might not.

Another problem is that it's hard to even know if an update contains a database update. A diff of a typical dependency update would probably only show something like this:

},
         {
             "name": "drupal/core",
-            "version": "9.3.2",
+            "version": "9.3.3",
             "source": {
                 "type": "git",
                 "url": "https://github.com/drupal/core.git",
-                "reference": "6c9ba6b6314550e7efb8f5f4e2a40f54cfd6aee1"
+                "reference": "a9bd68be9a4e39724ea555f8040114759a8faf7f"
             },
             "dist": {
                 "type": "zip",
-                "url": "https://api.github.com/repos/drupal/core/zipball/6c9ba6b6314550e7efb8f5f4e2a40f54cfd6aee1",
-                "reference": "6c9ba6b6314550e7efb8f5f4e2a40f54cfd6aee1",
+                "url": "https://api.github.com/repos/drupal/core/zipball/a9bd68be9a4e39724ea555f8040114759a8faf7f",
+                "reference": "a9bd68be9a4e39724ea555f8040114759a8faf7f",
                 "shasum": ""
             },
             "require": {

Can you tell from that diff if there was a database update or not just by looking at it?

The toolbox

So the question then is: How do we identify these database updates and determine if we can apply them safely and automatically? The answer for me is: we don't. Instead we just always assume the worst, and err on the side of caution. So let's instead look at how we can identify a pending database update in the first place. When detected, we want to avoid the update being deployed automatically.

To do that, I am going to use a technique we can combine with detecting another potentially disruptive trait of dependency updates: Changed files. This is not quite as common as database updates, and hopefully you mostly see this from either Drupal core or from a distribution you are using. Sometimes an upgrade can change some of your project files, for example robots.txt or .gitignore. In most cases these changes are useful and something you want to commit to your codebase. But just like database updates, we have no reliable way of determining what is useful and what is disruptive without human interaction. Which is why I am an advocate for failing a test suite when the working tree is not clean after running integration tests. This will then take care of the case of avoiding deploying an update that changes other files than composer.json / composer.lock. Now let's see how it can help us with database updates.

The characterisation testing approach

Enter site schema. A site schema is a big list of the "current" schema of your production website. The idea of a site schema borrows from a testing paradigm called characterisation testing, (or golden-master testing, snapshot testing and probably other names). It contains a list of all the currently applied hook_update_N updates, in addition to all the currently applied post updates. The site schema package is a drush command that makes it possible to produce a file for the site that represents your current working state. This in turn makes it possible to commit a file containing this state. Which in turn makes it easy to see which automated updates contain database updates. All we have to do is dump the site schema as part of our tests, and it would produce a diff in the committed site schema file, failing the tests, showing the dependency update contains a database update. Then when we want to deploy a dependency update that includes a database update, we also would have to commit an update to our site schema file. It's a way of manually approving the unknown side effect.

So that sums up the why and how. We have looked at why applying database updates automatically on automated dependency updates can be harmful. We have touched on why it's hard to know whether a given database update can be harmful or not. Finally, we have looked at one way of detecting database updates, so we can avoid deploying these unknown results automatically.

I also hope this has inspired some ideas on how you can do similar things in your own projects so you can have more automation in both dependency updates and deployment. If you need more inspiration, the next blog post will include examples of actual implementations of this in CI workflows.

What I can say personally is that using this approach has greatly improved my confidence in deploying automated dependency updates that I am continuously getting from violinist.io. Are you still not convinced about automating boring maintenance tasks? Still think it sounds scary? Please let me know in the comments ✌️

I guess all that remains now is to finish off with an animated gif called "scary"!