Table of Contents
- Who this guide is for
- What you likely care about
- Before you begin
- Step 1: Use Delivery metrics to identify risky change patterns
- Step 2: Make reliability work visible in Iterations
- Step 3: Use Developer Coaching to spot operational strain
- Step 4: Pair reliability metrics with delivery trends (if configured)
- Step 5: Use gitStream to standardize safe-change patterns
- Recommended operating rhythm
- Recommended next articles
SRE, Infra, Reliability
This guide is for SRE and Infrastructure leaders who need clearer signals on change risk, operational load, and how delivery patterns intersect with reliability outcomes. It focuses on using Metrics…
- Who this guide is for
- What you likely care about
- Before you begin
- Step 1: Use Delivery metrics to identify risky change patterns
- Step 2: Make reliability work visible in Iterations
- Step 3: Use Developer Coaching to spot operational strain
- Step 4: Pair reliability metrics with delivery trends (if configured)
- Step 5: Use gitStream to standardize safe-change patterns
- Recommended operating rhythm
- Recommended next articles
This guide is for SRE and Infrastructure leaders who need clearer signals on change risk, operational load, and how delivery patterns intersect with reliability outcomes. It focuses on using Metrics → Delivery, Teams → Iterations, Developer Coaching, and gitStream (if enabled), plus reliability/incident metrics where configured.
TL;DR – SRE / Infra / Reliability:
- Use Metrics → Delivery to find flow patterns that increase change risk.
- Use Teams → Iterations (Completed) to track unplanned work and reliability-driven scope shifts.
- Use Developer Coaching to spot workload patterns that signal operational strain.
- If configured, pair incident / reliability metrics with Delivery trends to strengthen your story.
- Use gitStream to standardize safe-change behavior with low noise.
Start here in 15 minutes
- Pick one reliability-critical service or team.
- In Metrics → Delivery, set the time window to the last 4–8 weeks.
- Scan for:
- Spikes in PR size.
- Periods with slower Review or Deploy Time.
- Open Teams → Iterations → Completed for the same team and:
- Estimate how much work was unplanned (operational / incident-driven).
- Write a one-line summary:
“When X happens in delivery, we see more reliability load / incidents.” - Use that summary to propose one experiment (e.g., smaller PRs or extra review on a service).
Who this guide is for
This path is for people who:
- Own or influence availability, incident response, and change management.
- Need to show how delivery practices affect reliability and operational load.
- Partner with DevEx, Platform, QA/Release, and PMO.
What you likely care about
- Are change patterns increasing reliability risk?
- Is operational work visible and linked to planning, or hidden as “background noise”?
- Where is unplanned reliability work eroding feature capacity?
- Which low-noise standards reduce risk without slowing flow?
Before you begin
- Git integration and key repos are connected.
- Teams, services, and ownership are clear enough to slice metrics by team or area.
- If available, incident / reliability metrics are configured and mapped to teams/services.
- Developer Coaching is enabled for relevant teams (where available).
- gitStream is enabled on at least some reliability-critical repos (if your org uses it).
Step 1: Use Delivery metrics to identify risky change patterns
Goal: Connect reliability issues to concrete delivery behavior.
Where: Metrics → Delivery
- Select a team or service that has seen incidents or reliability concerns.
- Choose a timeframe that includes recent incidents (e.g., last 4–8 weeks).
- Review:
- Cycle Time stage trends (especially Review and Deploy Time).
- PR size patterns and any spikes in large, late changes.
- Any visible trends around rushes to deploy before cutoffs.
- Mark 1–2 concrete risk signals, such as:
- “Frequent large PRs merged shortly before deploy.”
- “Review Time compressed when incident backlog is high.”
Step 2: Make reliability work visible in Iterations
Goal: Show how unplanned reliability work affects delivery capacity.
Where: Teams → Iterations (Completed)
- Open the last few completed iterations for teams covering critical services.
- Review:
- Unplanned work that came from incidents / reliability tasks.
- Scope removed or delayed because of operational load.
- Patterns across iterations (e.g., every sprint loses 20–30% of capacity to incidents).
- Use these patterns to:
- Quantify reliability work in terms of lost feature capacity.
- Make the case for more SRE capacity or automation.
Step 3: Use Developer Coaching to spot operational strain
Goal: Find hotspots where a few people carry too much reliability burden.
Where: Developer Coaching (if enabled)
- Look for contributors who:
- Handle a disproportionate share of reviews or critical PRs.
- Frequently appear in incident/operational work.
- Compare those hotspots with:
- High Cycle Time or Rework in their services.
- Known incident trends.
- Use this to justify:
- Spreading knowledge via pairing, documentation, or ownership changes.
- Targeted automation or standards for high-risk areas.
Step 4: Pair reliability metrics with delivery trends (if configured)
Goal: Tell a clean “change → incident → improvement” story.
- Identify periods or services with higher incident volume or failure signals.
- Overlay those periods with:
- Spikes in large or rushed PRs.
- Increased unplanned work in Iterations.
- Capture 1–2 specific narratives per quarter to bring to leadership and DevEx/QA:
- “When we tightened review standards and reduced oversized PRs, incidents dropped the next month.”
Step 5: Use gitStream to standardize safe-change patterns
Goal: Turn reliability learnings into guardrails.
Where: gitStream Hub (if enabled)
- Start with patterns that directly reduce risk:
- Flagging changes in critical services for extra review.
- Protecting against massive PRs in sensitive areas.
- Encouraging AI review or additional checks for high-risk files.
- Roll guardrails out to a few services, then expand once teams are comfortable.
- Use Delivery and incident trends to verify impact.
Recommended operating rhythm
Weekly
- Review Delivery stage trends for high-risk services.
- Scan Completed Iterations for reliability-driven unplanned work.
- Bring one reliability+flow observation to your platform/DevEx or EM partners.
Monthly / per release
- Summarize how delivery patterns correlated with incidents.
- Agree on one safe-change experiment (standard or automation) to test.
- Update gitStream and team standards based on outcomes.
Recommended next articles
How did we do?
Role Based Adoption - Start Here
Security & Compliance