Complete Workflow — Incident to Resolution

The question: What does a full Claude Code-augmented network operations workflow look like, using all five layers and MCP?

Scenario: BGP peer down between NYC and LAX data centers

PagerDuty fires: BGP session between core-nyc-01 and core-lax-01 is down. Prefixes from LAX are no longer reachable from NYC. You open your terminal.

Step 1: Start Claude Code (Layer 1 loads automatically)

cd ~/network-automation
claude

CLAUDE.md loads. Claude knows your vendor stack, naming conventions, IP scheme, routing design, and safety rules. The SessionStart hook fires and pulls the latest configs from git. You don't explain anything about your environment.

Step 2: Initial triage (Bash + MCP, 45 seconds)

You: BGP between core-nyc-01 and core-lax-01 is down.
     Check both routers — BGP neighbor detail and interface
     status on the peering link. Also pull the last hour of
     syslog for both devices from the database.

Claude executes SSH show commands to both routers (pre-approved by permissions). It queries the syslog database via the Postgres MCP server. It cross-references timestamps across four data sources:

"Session dropped at 14:32:17 UTC. Both sides show hold timer expired simultaneously — ruling out unilateral reset. core-nyc-01 syslog shows three TCP retransmissions on port 179 starting at 14:32:02. The peering link interface (Ethernet49/1 on core-nyc-01) is up/up but shows 0.3% input errors."

Step 3: Root cause investigation (Skill auto-invokes)

Claude recognizes this as a BGP issue and auto-loads the bgp-troubleshoot skill from Layer 2. It follows the decision tree:

You: The retransmissions and input errors suggest a physical
     layer problem. Check transceiver levels on both ends
     and pull the circuit metrics from Grafana.

Claude SSHes for transceiver data and queries Grafana via MCP:

"Rx power on core-nyc-01 Ethernet49/1: -18.2 dBm. Threshold is -14 dBm — 4 dB over budget. The carrier's interface shows healthy Tx power. Grafana shows the Rx power has been degrading linearly over the past 72 hours, from -12.1 to -18.2 dBm. This is a failing SFP or dirty fiber — the degradation curve is consistent with contaminated optics."

Step 4: Remediation plan (Skills + CLAUDE.md context, 30 seconds)

You: Draft the change ticket with remediation and rollback steps.

Claude generates a ServiceNow-formatted change ticket using your change control conventions from CLAUDE.md:

  • Impact: LAX prefixes unreachable from NYC (14 /24 subnets)
  • Affected: 12 production services with LAX backend dependencies
  • Remediation: Replace QSFP28-100G-LR4 in core-nyc-01 Ethernet49/1
  • Pre-change: capture BGP summary, interface counters, routing table count
  • Post-change: verify Rx power > -10 dBm, BGP Established, all prefixes received
  • Rollback: If SFP swap fails, activate backup transit path via route-map RM-FAILOVER-TRANSIT permit 10 on core-nyc-01
  • Downtime estimate: 60 seconds for hot-swap (Arista 7280R3 supports hot-swap)

Step 5: Post-change validation (Hook + Bash, 30 seconds)

You: I've replaced the SFP. Run post-change validation.

Claude runs the pre-defined validation commands. The PostToolUse hook automatically logs every validation result. Output:

"Ethernet49/1 — up/up, Rx power -8.4 dBm (nominal). BGP session re-established in 18 seconds. All 14 prefixes received. Zero CRC errors since SFP swap. Circuit packet loss at 0.00% (Grafana confirms). Interface error counters reset."

Step 6: Documentation (Export, 10 seconds)

/export

The entire troubleshooting session — every command, every output, every conclusion, every cross-device correlation — exports as a post-incident document. No manual report writing. The investigation is the documentation.

What changed

Total engineer engagement: under 8 minutes from alert to resolution and documented close.

Traditional workflow for the same incident: SSH into both routers manually (3 minutes). Check counters, interpret output, take notes (5 minutes). Open Grafana in a browser, find the right dashboard, query the time range (4 minutes). Open the syslog search interface, write the query, wait for results (3 minutes). Cross-reference timestamps across four browser tabs (5 minutes). Write the change ticket in ServiceNow (10 minutes). Execute the change (5 minutes). Run validation commands and compare against baseline (5 minutes). Write the post-incident report (15 minutes). Total: 55 minutes.

The engineering judgment didn't change. The diagnostic depth didn't change. The safety procedures didn't change. What changed is that the translation layer — from "I know what I need to check" to "I'm now looking at the results" — collapsed from minutes of typing and clicking to seconds of describing.

Knowledge check

Try it yourself