sev0.help

What We Do

sev0.help provides 'on-call as a service': a subscription-based, 24/7 access to expert Site Reliability Engineers when your team needs help with major production issues.

Get access to world-class incident management and distributed systems troubleshooting without budgeting for full-time SREs!

24/7 On-Call

An experienced Site Reliability Engineer will join your chatroom, conference bridge, or Zoom call within 15 minutes of being paged by your team.

We're here when you need us!

Incident Management

We will serve as incident commander to guide your issue to resolution using best practices enjoyed by mature tech companies.

No production access necessary.

Live Troubleshooting

Share your screen! We will provide you live debugging support. We have decades of experience troubleshooting distributed systems at the largest technology firms!

Linux internals do NOT scare us.

Writeup & Postmortem

We will document the incident and provide recommendations on preventing future recurrences. We facilitate and run blameless postmortems upon request!

We help you learn from failure.

sev0.help is led by Amin Astaneh, an SRE veteran who drove successful reliability transformations at companies from small to hyperscale and has spent many years on-call for business-critical production systems.

Testimonials

"I spent many years working side by side in the distributed systems trenches with Amin. Thanks to a decade and a half of experience, his technical skills and instincts are sharp as a knife, and I am always amazed by how he can jump into a fire and quickly hone in on the problem. But what really stands out about Amin is the calm, supportive, and encouraging demeanor he maintains even during the worst incidents, and how he enables a sense of focus and organization in those around him to conduct troubleshooting and mitigation tactics in an organized way, despite the chaos. There are few other people I'd want at the helm during a big incident."
-Mariano Asselborn, Senior Release Engineer, Hashicorp

"Is your pager rotation a mess? Are your on call engineers burnt out? Do you no longer trust your systems? Do you have haunted graveyards you are afraid to touch? Amin has been there, and can help you move forward to a better place."
-Jeff Stewart, Senior Staff Software Engineer, Google

"He(Amin) can be parachuted into a total mess and still find a way to quickly mitigate immediate issues and then propose a plan for incremental, long-term improvements."
-Mikhail Belov, Senior Engineering Manager, Meta

"Amin knows what a healthy on-call and triage process looks like, and can help you get there. He changed the way an entire company did incident response."
-Jen Norris, Senior Software Engineer

Frequently Asked Questions

What is Incident Management?

The end-to-end business process of addressing an outage, service disruption, or other major incident from its initial conception to its full resolution.

It ensures that: the organization and its customers are made fully aware of incident impact, duration, and status; the right people are involved to troubleshoot and fix; changes to production are made in an orderly fashion and are properly tracked; and the duration/impact of the incident are minimized as much and as swiftly as possible.