AI Summary

Summary: Incident Management Talk by Hila Fish

  • Introduction
    • Speaker: Hila Fish, Senior DevOps Engineer with 15+ years of experience.
    • Topic: Importance of Incident Management due to the inevitability of system failures.
  • Agenda
    • Mindset for managing incidents.
    • Structured process for incident management.
    • Traits necessary for efficient incident management.
    • Proactive measures for incident preparedness.
  • What is Incident Management?
    • A set of actions to resolve critical incidents.
    • Involves detection, communication, responsibility assignment, investigation, response, and resolution.
  • Mindset
    • Shift from reactive to proactive handling.
    • Understand the business impact of incidents.
    • Prioritize incidents based on potential loss of revenue, customers, data, and reputation.
  • Structured Process
    • Business mindset: Understand the “why” behind actions.
    • Structured process leads to incident prevention, reduced resolution time, cost reduction, and preservation of business and reputation.
  • Five Pillars of Incident Management
    1. Identify and Categorize
      • Assess the full extent and business impact.
      • Determine urgency and proper notification channels.
    2. Notify and Escalate
      • Inform relevant stakeholders (customers, internal teams, management).
      • Decide if escalation to other teams is necessary.
    3. Investigate and Diagnose
      • Focus on relevant information for resolution.
      • Identify and understand the root cause.
    4. Resolve and Recover
      • Choose the best remediation step.
      • Address any action items post-resolution.
    5. Review and Learn
      • Notify stakeholders upon incident closure.
      • Review and update alerts and runbooks.
      • Determine if a postmortem is needed.
  • Traits of an Incident Manager
    • Think on your feet, differentiate relevant information, operate under pressure, work methodically, ask for help, problem-solving mindset, ownership, good communication, lead without authority, and care.
  • Proactive Measures
    • Post-incident: Shift handoffs, postmortem notes, new tasks, modify alerts, update runbooks.
    • Day-to-day: Read shift handoffs, know escalation contacts, understand system architecture, learn application flows, be aware of team tasks, be a go-to person.
  • Conclusion
    • Emphasize the business mindset, follow structured processes, develop necessary traits, and be proactive to prepare for and potentially prevent incidents.
  • Q&A
    • Implementation: Start with a workshop, follow up with documentation, use incident runbooks, and integrate reminders into tools like PagerDuty.