Navigating the Chaos - A Holistic Approach to Incident Management by Hila Fish
AI Summary
Summary: Incident Management Talk by Hila Fish
- Introduction
- Speaker: Hila Fish, Senior DevOps Engineer with 15+ years of experience.
- Topic: Importance of Incident Management due to the inevitability of system failures.
- Agenda
- Mindset for managing incidents.
- Structured process for incident management.
- Traits necessary for efficient incident management.
- Proactive measures for incident preparedness.
- What is Incident Management?
- A set of actions to resolve critical incidents.
- Involves detection, communication, responsibility assignment, investigation, response, and resolution.
- Mindset
- Shift from reactive to proactive handling.
- Understand the business impact of incidents.
- Prioritize incidents based on potential loss of revenue, customers, data, and reputation.
- Structured Process
- Business mindset: Understand the “why” behind actions.
- Structured process leads to incident prevention, reduced resolution time, cost reduction, and preservation of business and reputation.
- Five Pillars of Incident Management
- Identify and Categorize
- Assess the full extent and business impact.
- Determine urgency and proper notification channels.
- Notify and Escalate
- Inform relevant stakeholders (customers, internal teams, management).
- Decide if escalation to other teams is necessary.
- Investigate and Diagnose
- Focus on relevant information for resolution.
- Identify and understand the root cause.
- Resolve and Recover
- Choose the best remediation step.
- Address any action items post-resolution.
- Review and Learn
- Notify stakeholders upon incident closure.
- Review and update alerts and runbooks.
- Determine if a postmortem is needed.
- Traits of an Incident Manager
- Think on your feet, differentiate relevant information, operate under pressure, work methodically, ask for help, problem-solving mindset, ownership, good communication, lead without authority, and care.
- Proactive Measures
- Post-incident: Shift handoffs, postmortem notes, new tasks, modify alerts, update runbooks.
- Day-to-day: Read shift handoffs, know escalation contacts, understand system architecture, learn application flows, be aware of team tasks, be a go-to person.
- Conclusion
- Emphasize the business mindset, follow structured processes, develop necessary traits, and be proactive to prepare for and potentially prevent incidents.
- Q&A
- Implementation: Start with a workshop, follow up with documentation, use incident runbooks, and integrate reminders into tools like PagerDuty.