IT Operations  ·  Uptime

If it pages you,
it's already too late.

Monitoring, infrastructure-as-code, deploys, backups, patching, capacity. The dull work that decides whether your team ships or fights fires.

Caught upstream. Documented in writing. Run from git.

99.97%

Median uptime across managed fleets

Last 12 months · 14 customer environments

22 min

Median restore time on tested backups

Drilled monthly, signed off each time

0

Outages caused by missed patches

Calendar year to date

The view from a real NOC, mid-incident

Ten services. Live status. One incident.

Tap any node to see its dependencies light up. The amber one is auto-scaling. The red one has a fallback already routing traffic.

NOC · Live07:50:29
8 healthy1 watching1 incident
Tap a node to see what depends on it. Watch the data-path light up.

What the on-call rotation covers

Six disciplines. One on-call rotation.

01MONITORING

24/7 Monitoring

Prometheus, Grafana, alerting via PagerDuty or WhatsApp. We know it broke before you do, and we know why.

02INFRASTRUCTURE

Infrastructure as Code

Terraform, Pulumi, Ansible. Your whole stack defined in git. Rebuild it on a new cloud in an afternoon.

03CI/CD

Build & Deploy Pipelines

GitHub Actions, GitLab CI, Jenkins where it has to live. Test, build, deploy, rollback. Every push.

04BACKUPS

Backups & Disaster Recovery

3-2-1 backups. Tested restores. Documented RTO and RPO. Restore drills run monthly so the surprises happen in rehearsal.

05PATCHING

Patching & Hardening

Monthly maintenance windows. CIS benchmarks applied. Critical CVEs patched in 48 hours.

06CAPACITY

Capacity & Cost

Autoscaling rules that save money. Spot instances where they fit. Quarterly cost reviews to kill what nobody uses.

Three incident reports we wrote

Where stacks were on fire. Then weren't.

30 servers, no observability

P1 INCIDENT

State before

No central monitoring meant outages were discovered by customer complaints. Senior engineer paged four nights a week. Root causes always elusive.

State after

Prometheus + Grafana rolled out. Alert routing to on-call rotation. Pages dropped 70% in the first month. Engineers slept.

Runbook · Observability rollout · 8 weeks · 3 engineers

Deploy fear

P2 INCIDENT

State before

Releases happened once a fortnight, after-hours, with rollback meaning a database restore. Engineers refused to touch prod on Fridays.

State after

Blue-green deploys via GitHub Actions. Database migrations gated. Eleven safe deploys in week one. Friday releases became boring.

Runbook · CI/CD modernisation · 4 weeks · 2 engineers

Backup theatre

P1 INCIDENT

State before

Client paid for offsite backups. Quarterly DR test revealed nothing had been restorable for 14 months. Backup job had been silently failing.

State after

Switched to immutable backups + monthly restore drills. Restore time now 22 minutes, tested, signed off.

Runbook · Backup + DR overhaul · 6 weeks · 2 engineers

The tools the on-call team trusts at 3am

What's on the control panel.

Open source by default. Your repo, your cloud, your data.

Observability

Prometheus · Grafana · Loki · Tempo · Alertmanager. OpenTelemetry across services. Logs, metrics, traces in one pane.

Config + IaC

Terraform · Pulumi · Ansible · Helm · Kustomize. Every change reviewed in a PR. Every rollback is git revert.

Containers

Docker · Kubernetes (k3s, EKS, GKE) · Nomad when k8s is overkill. ArgoCD for GitOps deploys.

Cloud + On-prem

AWS · GCP · Hetzner · Azure · on-prem VMware · Proxmox. We meet the workload where it lives.

Backups + DR

Restic · Borg · Velero · cloud-native snapshots. Encrypted, immutable, tested. RTO and RPO in writing.

Runbooks

Every alert links to a runbook. Every runbook tested. Lives in the same repo as the code, where engineers find it.

What an engagement looks like, week by week

Five steps. Then you own it.

stage 01

Inventory

Map every system, every dependency, every owner. First deliverable is a diagram you can hand to a new hire and they understand the stack.

stage 02

Observe

Roll out monitoring first. We need a baseline before we change anything. Two weeks of data tells the truth that 'it feels slow' doesn't.

stage 03

Stabilise

Top three pain points fixed. Pager noise reduced. Deploy pipeline trusted. The team stops dreading prod work.

stage 04

Automate

Repetitive ops moved to code. Backups, scaling, certificate rotation, OS patching. Humans for judgement, code for the rest.

stage 05

Handover

Runbooks, dashboards, repos, credentials, on-call rotation. You can keep us, you can fire us, you can hire your own. The system runs either way.

Quiet pagers.
Confident deploys.

30-minute call. Audit the current stack, find the three things bleeding the team, give you a plan.