Circuit Breaker
Purpose
If a child server crashes repeatedly, MetaMCP stops trying. The circuit breaker tracks consecutive failures per server and temporarily disables servers that exceed the failure threshold.
This prevents a broken server from consuming resources, slowing down requests, and generating noise. Once the cooldown period expires, MetaMCP probes the server again to see if it has recovered.
How It Works
The circuit breaker (implemented in src/circuit-breaker.ts) manages two effective states using a tripped boolean:
Normal (tripped = false). Requests pass through to the child server. On each failure, a counter increments. When the counter reaches the configured threshold, the breaker sets tripped = true, records the timestamp, and resets the counter.
Open (tripped = true, within cooldown). Requests are rejected immediately without contacting the child server. The LLM receives an error indicating the server is temporarily unavailable.
After cooldown expires. The isOpen() check returns false, allowing a single probe request through. If that probe succeeds, recordSuccess() resets the breaker to normal. If it fails, the breaker trips again.
Configuration
| Parameter | Default | CLI Flag | Description |
|---|---|---|---|
failureThreshold |
5 | --failure-threshold |
Consecutive failures before tripping |
cooldownMs |
30000 | --cooldown |
Cooldown period in milliseconds |
Example:
{
"pool": {
"failureThreshold": 3,
"cooldownMs": 60000
}
}This trips after 3 consecutive failures and waits 60 seconds before allowing a probe.
State Transitions
The breaker moves through states in a predictable cycle:
normal --[failures reach threshold]--> open
open --[cooldown expires]----------> probe
probe --[success]-------------------> normal
probe --[failure]-------------------> open- Normal to Open. Consecutive failures accumulate. When they reach
failureThreshold, the breaker trips. - Open to Probe. After
cooldownMselapses,isOpen()returns false. The next request acts as a probe. - Probe to Normal. If the probe request succeeds,
recordSuccess()resets thetrippedflag and clears the failure counter. - Probe to Open. If the probe request fails, the breaker trips again immediately with a fresh cooldown timer.
Error Classification
Not every error should count toward the circuit breaker. MetaMCP classifies errors before recording them:
| Category | Trips breaker? | Rationale |
|---|---|---|
auth (401, 403) |
No | Auth errors are permanent until credentials change. Tripping the breaker would only delay the inevitable and mask the real problem. |
offline (ECONNREFUSED, timeout) |
Yes | Transient network issues that may resolve after cooldown. |
http (5xx) |
Yes | Server-side errors that typically recover. |
stdio-exit (process crash) |
Yes | Child process failures that warrant backoff. |
other |
Yes | Unrecognized errors default to transient treatment. |
When an auth error occurs, MetaMCP logs a warning with the server name and error details but does not call recordFailure(). The failure counter stays unchanged, and the breaker remains in its current state.
This prevents a misconfigured API key from cycling a server through repeated breaker trips and cooldowns — a pattern that wastes resources without any chance of recovery.
Per-Server Isolation
Each child server has its own circuit breaker instance. A failing sqlite server does not affect the playwright server.
This isolation means one misbehaving server cannot cascade failures across the system. The LLM can continue using healthy servers while a broken one is in cooldown.
Interaction with Retry
In the callTool function (child-manager.ts), vital servers or servers with restartCount < 1 get one automatic retry before a failure is recorded against the circuit breaker.
The retry sequence:
- First attempt fails.
- MetaMCP respawns the child process.
- Second attempt is made against the fresh process.
- If the second attempt also fails, the circuit breaker records the failure.
This prevents transient errors (a server crash on a single unlucky request) from incrementing the failure counter too quickly. A server must fail twice in a row, including on a fresh process, before the breaker counts it.
The retry only applies to servers marked as vital or those that have not yet been restarted. Non-vital servers that have already been restarted once do not get the automatic retry.
Next Steps
- Connection Pool for how server lifecycles are managed
- Adding Servers for configuring child servers