If it pages you,
it's already too late.
Monitoring, infrastructure-as-code, deploys, backups, patching, capacity. The dull work that decides whether your team ships or fights fires.
Caught upstream. Documented in writing. Run from git.
99.97%
Median uptime across managed fleets
Last 12 months · 14 customer environments
22 min
Median restore time on tested backups
Drilled monthly, signed off each time
0
Outages caused by missed patches
Calendar year to date
The view from a real NOC, mid-incident
Ten services. Live status. One incident.
Tap any node to see its dependencies light up. The amber one is auto-scaling. The red one has a fallback already routing traffic.
What the on-call rotation covers
Six disciplines. One on-call rotation.
Three incident reports we wrote
Where stacks were on fire. Then weren't.
30 servers, no observability
P1 INCIDENTState before
No central monitoring meant outages were discovered by customer complaints. Senior engineer paged four nights a week. Root causes always elusive.
State after
Prometheus + Grafana rolled out. Alert routing to on-call rotation. Pages dropped 70% in the first month. Engineers slept.
Runbook · Observability rollout · 8 weeks · 3 engineers
Deploy fear
P2 INCIDENTState before
Releases happened once a fortnight, after-hours, with rollback meaning a database restore. Engineers refused to touch prod on Fridays.
State after
Blue-green deploys via GitHub Actions. Database migrations gated. Eleven safe deploys in week one. Friday releases became boring.
Runbook · CI/CD modernisation · 4 weeks · 2 engineers
Backup theatre
P1 INCIDENTState before
Client paid for offsite backups. Quarterly DR test revealed nothing had been restorable for 14 months. Backup job had been silently failing.
State after
Switched to immutable backups + monthly restore drills. Restore time now 22 minutes, tested, signed off.
Runbook · Backup + DR overhaul · 6 weeks · 2 engineers
The tools the on-call team trusts at 3am
What's on the control panel.
Open source by default. Your repo, your cloud, your data.
What an engagement looks like, week by week
Five steps. Then you own it.
Inventory
Map every system, every dependency, every owner. First deliverable is a diagram you can hand to a new hire and they understand the stack.
Observe
Roll out monitoring first. We need a baseline before we change anything. Two weeks of data tells the truth that 'it feels slow' doesn't.
Stabilise
Top three pain points fixed. Pager noise reduced. Deploy pipeline trusted. The team stops dreading prod work.
Automate
Repetitive ops moved to code. Backups, scaling, certificate rotation, OS patching. Humans for judgement, code for the rest.
Handover
Runbooks, dashboards, repos, credentials, on-call rotation. You can keep us, you can fire us, you can hire your own. The system runs either way.