Contenido

Mutex deadlocks in production: the patterns I found in my codebase and how I diagnosed them

It was 11:47 PM and the service wasn't responding. No panic. No error in the logs. Railway showed the container alive, memory stable, CPU at zero. Zero. That's what caught my attention: zero activity on a service that should've been processing queues. I opened a tokio-console session and there it was — four tasks suspended, all waiting on the same MutexGuard. None of them were ever going to move.

That was the first one. Then came the second. Then the third. All three with the same face: total silence, container "healthy," and a chain of locks that was never going to resolve itself.

My thesis is this: mutex deadlocks in async Rust aren't rare or mysterious. They're predictable. They follow patterns. And once you've seen one, you recognize them from a mile away. The problem is that most resources teach you what a deadlock is, not how you diagnose it when it's already in production and you can't just pull a backtrace.

Mutex deadlocks in async Rust: why this isn't a trivial problem

The specific problem with async Rust isn't that locks are hard to understand. It's that tokio::sync::Mutex and std::sync::Mutex behave differently in ways that aren't obvious until something blows up.

When you use std::sync::Mutex inside an async runtime, if a task takes the lock and then does .await, you block the executor's entire thread. Not just your task — the whole thread. Every task running on that worker thread gets suspended. With a single-threaded runtime, the entire program freezes.

// ⚠️ This is poison in async — you're blocking the executor
use std::sync::Mutex;

async fn process_item(state: Arc<Mutex<State>>) {
    let guard = state.lock().unwrap(); // synchronous block
    do_async_io().await;               // while the thread is blocked
    // guard drops here — but the damage is already done
}

With tokio::sync::Mutex the behavior changes: .lock().await suspends the task, not the thread. The executor can keep running other tasks while you wait for the lock. But that doesn't save you from deadlocks — if you have circular dependencies, you're still stuck, just more politely.

What I found in my codebase, after three incidents, is that I had three distinct deadlock patterns. Internally I call them: the Classic Deadly Embrace, the Reentrant Lock, and the Inverted Order Under Pressure.

The three concrete patterns I found and how I reproduced them

Pattern 1: The Classic Deadly Embrace

This is the most well-known one but it still got me. Two tasks, two resources, inverse acquisition order.

// Reproduction of the first real deadlock
// Task A: takes lock_cache, then asks for lock_db
// Task B: takes lock_db, then asks for lock_cache

async fn task_a(
    cache: Arc<Mutex<Cache>>,
    db: Arc<Mutex<DbPool>>,
) {
    let _cache_guard = cache.lock().await;     // Task A takes cache
    tokio::time::sleep(Duration::from_millis(1)).await; // pause = deadlock window
    let _db_guard = db.lock().await;           // Task A waits for db — which Task B holds
}

async fn task_b(
    cache: Arc<Mutex<Cache>>,
    db: Arc<Mutex<DbPool>>,
) {
    let _db_guard = db.lock().await;           // Task B takes db
    tokio::time::sleep(Duration::from_millis(1)).await;
    let _cache_guard = cache.lock().await;     // Task B waits for cache — which Task A holds
}

The fix isn't just "acquire locks in the same order." The real fix is asking yourself whether you need both locks at the same time. In my case, I didn't — I restructured to acquire, operate, release, and only then acquire the second.

// Fixed version: explicit scope, no guard overlap
async fn task_a_fixed(
    cache: Arc<Mutex<Cache>>,
    db: Arc<Mutex<DbPool>>,
) {
    // First we work with cache and release it
    let data = {
        let guard = cache.lock().await;
        guard.get_data()
    }; // guard dropped here

    // Only then do we use db
    let mut db_guard = db.lock().await;
    db_guard.write(data).await;
}

Pattern 2: The Reentrant Lock

This one took me longer because it didn't look like a classic deadlock. A single function, a single mutex. The problem: the function was calling itself (indirectly, through a callback) while it already held the lock.

// The internal callback was calling the same function that already held the lock
async fn process_event(
    state: Arc<Mutex<State>>,
    event: Event,
) {
    let mut guard = state.lock().await;

    // This internal handler calls process_event again
    // with the same Arc<Mutex<State>> — guaranteed deadlock
    guard.run_handlers(&event).await;
}

Rust doesn't have a reentrant RwLock in std, and tokio::sync::Mutex isn't reentrant either. The fix was separating the state the handler needs from the state the main lock holds, or cloning the necessary data before releasing the guard.

// Fix: clone what you need, drop the lock, then run handlers
async fn process_event_fixed(
    state: Arc<Mutex<State>>,
    event: Event,
) {
    // Take what we need and release the lock
    let handlers = {
        let guard = state.lock().await;
        guard.handlers_for(&event).clone() // deliberate clone
    }; // guard dropped

    // Run handlers without holding the lock
    for handler in handlers {
        handler.run(&event).await;
    }
}

Pattern 3: Inverted Order Under Pressure

This is the most treacherous one because the code never fails in development. It only shows up when there's real concurrency, under load, with multiple replicas. I saw it in production when Railway started horizontally scaling the service.

The pattern: you have a lock acquisition order that looks consistent in the code, but under pressure, tasks interleave at exactly the right moment where the effective order inverts. Related to this — in my analysis of Docker Compose in production over 30 days I noticed that concurrency problems didn't appear until the second week, when real traffic started picking up.

The tool that changed everything was tokio-console. With it I could see exactly which tasks were in which state:

# Install tokio-console
cargo install tokio-console

# In code, enable the subscriber
# Cargo.toml:
# console-subscriber = "0.4"
# tokio = { features = ["full", "tracing"] }

# main.rs
fn main() {
    console_subscriber::init(); // one single line
    // ... rest of the runtime
}

The output showed me this:

Task 47: waiting on Mutex (owned by Task 23) — 4m 32s
Task 23: waiting on Mutex (owned by Task 47) — 4m 32s
Task 31: waiting on Mutex (owned by Task 47) — 4m 32s

Four and a half minutes. No log. No error. The service just... breathing.

The mistakes I made before I understood what I was looking for

The first mistake was looking in the wrong place. After that 11:47 PM incident, my instinct was to check Railway logs, look for panics, look for OOM. There was nothing. A healthy container that does nothing is exactly what a well-formed deadlock looks like.

The second mistake was using unwrap() on locks:

// This hides the problem — if the lock is poisoned, it panics
// If it's in deadlock, it never gets to execute
let guard = mutex.lock().unwrap();

// Better: explicit timeout to catch deadlocks in development
use tokio::time::timeout;

match timeout(Duration::from_secs(5), mutex.lock()).await {
    Ok(guard) => { /* use guard */ }
    Err(_) => {
        // This saved me in staging: if it takes more than 5s, something is wrong
        tracing::error!("Possible deadlock detected on MutexX");
        return Err(AppError::LockTimeout);
    }
}

The third mistake was trusting that the agent architecture I built (you can see part of that stack in my post on autonomous deploy agents) wouldn't have concurrency problems because "it's async." Async doesn't protect you from deadlocks. It changes how they express themselves.

I validated this same intuition when I analyzed async Rust edge cases against my real codebase: the language gives you tools to reason about concurrency, but the tools don't think for you.

A pattern I learned to avoid after all this:

// ❌ Lock held across an .await — classic in careless async code
async fn bad_practice(state: Arc<Mutex<State>>) -> Result<()> {
    let guard = state.lock().await;
    
    // Any .await while guard is alive is a potential problem
    let result = external_call().await?; // ← right here
    
    guard.update(result);
    Ok(())
}

// ✅ Lock held for the minimum possible time
async fn good_practice(state: Arc<Mutex<State>>) -> Result<()> {
    // IO first, no lock
    let result = external_call().await?;
    
    // Lock only for the atomic write
    {
        let mut guard = state.lock().await;
        guard.update(result);
    } // guard dropped immediately
    
    Ok(())
}

FAQ: Mutex deadlocks in async Rust

What's the real difference between std::sync::Mutex and tokio::sync::Mutex for detecting deadlocks?

The most important difference for diagnosis is behavior under blocking. std::sync::Mutex blocks the executor's entire thread when calling .lock(), which can freeze the whole runtime. tokio::sync::Mutex only suspends the task, but it's still vulnerable to circular deadlocks. To diagnose which one you have, tokio-console shows you the state of each task — if you see tasks waiting on mutexes longer than makes any sense, the deadlock is right there.

Does tokio-console work in production or only in development?

Both, but the tracing overhead isn't free. In production I only enabled it during the incident, behind a conditional feature flag. In staging I keep it always on. The overhead in development is totally acceptable; in production under high load, I measured around 3–5% additional CPU, which is manageable for a targeted diagnosis.

Does RwLock solve the problem or make it worse?

Depends. RwLock allows multiple simultaneous readers, which reduces contention on read-heavy workloads. But it adds a new deadlock vector: if a task holds a read lock and requests a write lock, and another task holds a write lock waiting for readers to release, you're blocked just the same. I used it where the ratio was 90% reads, 10% writes, and it improved performance without adding deadlocks. The trick is not mixing read and write locks in the same function.

How do you reproduce a deadlock in tests to validate that the fix worked?

The most reliable approach I found is using tokio::time::timeout in tests and simulating the race condition with tokio::task::yield_now():

#[tokio::test]
async fn test_no_deadlock() {
    let state = Arc::new(Mutex::new(State::new()));
    
    let result = timeout(
        Duration::from_secs(2),
        process_event_fixed(state.clone(), Event::Test)
    ).await;
    
    // If there's a deadlock, timeout fires and the test fails
    assert!(result.is_ok(), "Possible deadlock detected");
}

When should you use Arc<Mutex<T>> versus an actor pattern (channels)?

After three incidents, my rule is: if the shared state has more than two concurrent consumers, or if the access logic is complex, I use channels. Arc<Mutex<T>> is simple and correct for shared state with predictable access and low contention. The actor pattern — one task with an mpsc::Receiver that's the only one touching the state — eliminates deadlocks by design: there's no shared lock. I implemented it in my LLM pipeline — you can see part of that design in my analysis of the real cost of training an LLM from scratch — and it eliminated an entire class of problems.

Does Clippy or any static analysis catch deadlocks?

No. Clippy doesn't detect deadlocks in async. Neither does the compiler. It's a runtime behavior problem, not a type problem. There are community proposals to add lock order analysis, but nothing stable yet. The only real tools I have are tokio-console at runtime and explicit timeouts on critical locks. Rust's static analysis is extraordinary for many things — I even validated it against things Chrome does without asking permission in terms of resource access — but deadlocks in async escape the type system.

What I changed in my architecture after the three incidents

The conclusion isn't "avoid mutexes." The conclusion is: mutexes are safe if you're explicit about how long you hold them and in what order you acquire them. What I was missing was structural discipline, not theory.

The concrete changes I made:

Timeout on all critical locks — if something takes more than 10 seconds to acquire a lock in production, I want to know about it.
Explicit scope with braces — every lock has a {} block that defines exactly its lifetime. No guards floating to the end of the function.
tokio-console always active in staging — the deadlocks I caught in staging saved me from three production incidents.
Lock order review in code review — I added a checklist: does this PR acquire more than one lock? In what order? Is it consistent with the rest of the codebase?

What I didn't do was migrate everything to channels. That would be over-engineering. Arc<Mutex<T>> is still the right tool for simple shared state. The difference is that now I know when it's the right tool and when it isn't.

If you're starting out with async Rust and this topic is new to you, the entry point I'd recommend is instrumenting with tokio-console before you have the problem, not after. The cost is low. The information it gives you when something explodes is priceless.

And if you've already had your own silent deadlock incident at 11 PM — welcome to the club. Membership includes a much deeper appreciation for real logs and a healthy distrust of containers that breathe but don't do anything.

Comments (0)

💬

What do you think of this?

Drop your comment in 10 seconds.

We only use your login to show your name and avatar. No spam.

No comments yet. Be the first — your take matters most when we're few.

ExperimentsTypeScriptnpm

npm audit isn't enough: I simulated a supply chain attack on my Node dependencies and found what the scanner can't see

npm audit tells you you're safe. I stress-tested that claim with real methodology against my production dependencies and found three attack vectors the scanner doesn't even register. The Node ecosystem has a structural problem that green badges keep hidden.

9 min26

ExperimentsTypeScriptproduccion

Real guardrails for autonomous agents after one almost destroyed my infrastructure

After an autonomous agent nearly wiped my production database, I built a real guardrails layer. Here are the controls, the code, and the logs that saved my skin.

6 min55

ExperimentsPerformancebackend

Async Rust Never Left MVP: I Validated It Against My Real Codebase and Found Exactly the Edge Cases That HN Post Predicted

434 points on HN argue that Async Rust is still a glorified MVP. I replicated every concrete criticism against my own code: executor leaks, cancellation safety, Pin hell. My conclusion is more uncomfortable than the original post.

8 min108