When Deploys Go Wrong: Building a Rollback Mechanism

deployment automation bash reliability

The Problem

The original deployment pipeline had one direction: forward. Push code, rsync it to the server, reload Gunicorn. If the new code was broken — a bad import, a missing dependency, a syntax error — the site went down and stayed down until I SSH'd in and fixed it manually. That could mean hours of downtime for a mistake that took seconds to make.

No backups. No validation. No way to undo. The pipeline trusted every push blindly.

What Rollback Actually Means

Rollback isn't just "put the old files back." The site depends on three things working together: the application code, the Python packages in the virtual environment, and the Gunicorn process serving requests. Restoring code without restoring dependencies leaves a mismatch — old code trying to import new packages, or missing packages entirely. All three must be restored as a unit, and the application server must be reloaded only after both code and venv are consistent.

Capturing State Before Deploying

Before the pipeline touches anything, it now captures two snapshots. First, pip freeze records the exact version of every installed package — not what requirements.txt requests, but what's actually installed. The difference matters: requirements.txt might say flask>=3.0, but pip freeze records flask==3.0.2. Only the freeze output is reproducible. This gets saved as requirements.lock.

Second, the entire site directory is copied to a backup location. Both snapshots live in /var/www/backups/, owned by the deployer user. The backup and the lock file represent one consistent, known-good state.

Four Layers of Validation

After deploying new code and installing dependencies, the pipeline validates before declaring success.

The first check is a Python import test. A small script runs from app import app inside the virtual environment. If the Flask application can't be imported — missing module, syntax error, circular import — this catches it before Gunicorn ever sees the new code. This alone catches the majority of broken deploys.

The second check confirms Gunicorn is still alive after reloading. A reload with fatally broken code can crash workers. The pipeline waits two seconds, then checks systemctl is-active. If Gunicorn died, the code killed it.

The third check makes an actual HTTP request to the local server and expects a 200 response. The application can be importable and Gunicorn can be running, but the site might still return errors when handling real requests. This verifies the full stack end-to-end: Flask, Jinja2 templates, and the WSGI pipeline.

The fourth check hits a specific blog post URL to confirm that markdown parsing, YAML frontmatter extraction, and dynamic routing all work. This is a smoke test — not exhaustive, but enough to catch breakage in the content pipeline.

If any of the first three layers fail, rollback is triggered automatically.

The Rollback Script

The rollback script is a standalone file. It runs both automatically from the post-receive hook and manually from an SSH session. The separation matters — a script embedded inside the hook can't be called independently when you need a manual recovery at 3am.

The script first checks that backup files actually exist. Without this preflight check, a failed rollback would overwrite a broken site with nothing, turning a bad deploy into data loss.

It then restores code via rsync with --delete so that files added by the broken deploy are removed. Permissions and SELinux contexts are reapplied — restored files need the same treatment as freshly deployed ones. The virtual environment is rebuilt from the lock file. Only after both code and dependencies are restored does the script reload Gunicorn.

If reload fails — because Gunicorn crashed entirely rather than running with bad code — the script falls back to a full restart. If that also fails, it sends a critical alert via ntfy and exits. Some failures require human hands.

After reload, the script validates the rollback by curling the site. A rollback that silently fails is worse than no rollback at all — you'd believe recovery happened while the site is still broken.

Every outcome generates an ntfy notification: success, failure, or critical, there's a record of what happened.

What This Doesn't Solve

The pipeline still has no tests. Validation catches crashes and import errors, but it can't catch logic bugs — a template that renders but shows the wrong data, a route that returns 200 but with broken HTML. That's a CI/CD problem, not a deployment problem.

There's also no multi-version rollback. The backup holds exactly one previous state. If two bad pushes happen in a row and the first overwrites the backup, the original good state is gone. For a solo blog, one backup is an acceptable tradeoff. For a production service with a team, it wouldn't be.

The Lesson

The interesting part wasn't the bash scripting. It was realizing that "rollback" is a coordination problem, not a file copy problem. Code, dependencies, and process state form a unit. Restoring one without the others creates a mismatch that can be harder to debug than the original failure. The mechanism works because it treats all three as a single atomic operation — restore together, validate together, or fail loudly together.