Chapter 8 — Release Cadence, Rollback, and the 2 a.m. Playbook

Key insight

Rollback is not a feature; it is a property of the release model. If your release pattern requires a meeting to decide how to roll back, you have not really designed for rollback. The goal is a single command that returns the system to a known-good digest.

The shape of a release day

For a signed-image appliance shipping to one or a handful of customers, a release day is divided into three short phases. None should take longer than its allotted time on a healthy project; if any phase routinely overruns, it is usually a sign of a missing automation rather than a missing process.

Phase	Operator effort	Wall-clock time
Pre-cut — pre-flight, version-bump, changelog	~10 min	~10 min
Cut — tag, wait for CI, merge pin-file PR, tag deploy repo	~5 min	~10 min
Post-cut — smoke test on staging, customer notification	~10 min	~15 min

The total operator effort is roughly twenty-five minutes. The wall-clock total, including CI waits, is roughly forty-five minutes. Anything longer is worth investigating.

The pre-cut checklist

Run the pre-flight hygiene script (chapter 7). If it fails, stop here.
Confirm the source-repository main branch is green — tests are passing, dependency alerts are zero or accepted, code-scanning findings are zero or accepted.
Bump the version in whatever file the project uses as canonical. The version string should appear in at most one place; everything else reads from it.
Update the changelog. Add a new entry with the version, date, and a customer-readable summary. The summary is the same text you will paste into the release-note template at the end.
Commit and push the version-bump and changelog. Wait for CI to pass.
Smoke test the latest build of main against your own staging environment. This is the last chance to catch a regression before it leaves your hands.

The cut itself

Tag the source repository with the new version. An annotated tag, with a one-sentence message that matches the changelog summary.
Push the tag. This triggers the release pipeline.
Watch the pipeline. The expected duration is the project’s known signed-build time, typically six to ten minutes. If the pipeline fails, the failure is almost always one of the same handful of issues (chapter 6); resolve and re-run.
Review the auto-generated pull request on the bootstrap repository. Confirm the pin file has the correct version, the correct digest, and the correct signing-identity expression. Confirm the changelog entry matches what you wrote in pre-cut.
Merge the pull request.
Tag the bootstrap repository with the matching version.

At this point the release exists. Customers running the install script would now see the new version. They have not yet been told.

The smoke test

The release is not real until somebody has installed it. The smoke test runs the new release end-to-end in a clean environment you control. The script is short:

Clone the bootstrap repository at the new tag, in a fresh directory.
Run the install script against a staging tenant.
Hit the running agent’s /version endpoint. Confirm the version, digest, and source revision are the values the release intended.
Run two or three known-good operations through the agent — whichever operations exercise the largest fraction of the integration surface in the smallest amount of time.
Roll back to the previous version using the documented rollback procedure (below). Confirm /version reports the previous version. Roll forward again.

The smoke test should take ninety seconds end-to-end on a healthy project. If it takes substantially longer, the install script is doing too much work that should have been pushed into the image. If it fails, customers are not notified; the release is treated as a failed cut and revisited.

The rollback procedure

Rollback for a signed-image appliance is structurally simple, which is the entire point of the model. The procedure has three steps:

Check out the previous tag of the bootstrap repository.
Re-run the install script.
Confirm /version reports the previous version.

The reason this works is that the pin file at the previous tag pins to the previous image digest. The install script verifies the signature, pulls the digest, and the container runtime rolls the previous version. No build is required; no source is touched; the rollback target is exactly what shipped before.

The discipline that makes this work is that every release ships an installable previous-tag. Never delete the previous tag of the bootstrap repository. Never overwrite a published image digest. Never publish a release that depends on a previous release being upgraded out from under it.

For customers running the agent on their own infrastructure, the rollback procedure is exactly the same. The customer’s installation does not depend on any vendor-side state at the moment of rollback; everything they need is in the previous tag of the bootstrap repository and the corresponding image in the registry.

What the customer needs to know — and when

The customer-facing release-note is small: three sections of two or three sentences each. The shape:

vX.Y.Z — release-date

What changed
— One or two sentences in customer-friendly language. No code references.
— If a behaviour changed, name the behaviour, not the code path.

What you need to do
— If nothing, say "nothing - run `git pull` and re-run the install script
  at your next convenience".
— If a configuration change is required, link to the exact change.
— If a downtime window is required, name the duration.

If something goes wrong
— The rollback command (one line).
— The contact for support, with a response-time expectation.

The release-note lives on the bootstrap repository’s release page, alongside the SBOM and any other release artefacts. The customer-facing email points at the release page, does not duplicate its content.

The 2 a.m. playbook

Eventually a release will be the proximate cause of a customer incident. The 2 a.m. playbook is the small set of pages your on-call engineer reaches for in that case.

The playbook has four steps. Each is one paragraph in the bootstrap repository’s troubleshooting guide.

Step 1, identify what is running. Hit /version on the agent. Capture the response. The response uniquely identifies the binary; with it, every subsequent step is precise rather than approximate.

Step 2, decide whether to roll back. If the incident’s symptoms match a known issue in the new release’s changelog, roll back. If the symptoms are new and severe, roll back. If the symptoms are new but tolerable, capture diagnostics and stay on the current version while you investigate.

Step 3, roll back (if decided). The procedure above. Two minutes, including the smoke test against the previous version.

Step 4, file a release-incident report. The report names what was deployed, what was observed, what was rolled back to, and what the root-cause hypothesis is. The report ships in the source repository, in the docs-internal folder, with a copy attached to the next post-mortem meeting.

Versioning policy

The choice of versioning scheme matters less than the choice to communicate it. Semantic versioning is the most widely understood and is appropriate for most agent projects. Calendar versioning works for projects with a strict regular release cadence.

Whichever scheme you use, document it in the bootstrap repository’s README, with a one-paragraph explanation of what a major-, minor-, and patch-level bump implies. Customers reading the changelog should be able to decide, in fifteen seconds, whether the new release is one they can install during business hours or one they need to schedule a maintenance window for.

Deprecation

Old releases do not stay supported forever. The policy that works for tenant-deployed agents in practice is “the last two minor versions are supported; older versions can be installed but receive no further fixes.” The policy lives in the bootstrap repository, alongside the versioning policy.

The install script should warn the operator if they are installing a version more than two minor versions behind. The warning is non-fatal — the install proceeds — but it appears in the operator’s terminal and in the install log, so a future customer audit can see whether the customer was warned.

Conclusion

Release cadence is the part of the discipline that is most visible to the customer. Architectural choices in the earlier chapters set the ceiling; operator habits in this chapter decide where, within that ceiling, your project actually lands. The patterns are not exotic. They are short, written down, and run every time. The accumulation of small disciplines, over many releases, is what separates an agent that customers trust to upgrade in production from one they only ever install once.

References & further reading

Semantic Versioning specification. semver.org
Calendar Versioning specification. calver.org
Keep a Changelog — a small, widely-adopted format for the customer-facing changelog. keepachangelog.com
NIST Secure Software Development Framework (SP 800-218), “Respond to Vulnerabilities” family of practices — the wider context for the deprecation and rollback policies above. csrc.nist.gov/Projects/ssdf