LetItCrash

A testing library for crash recovery and OTP supervision behavior in Elixir.

Embrace the "let it crash" philosophy in your tests by easily simulating crashes and verifying that your GenServers and supervised processes recover correctly.

Why Use LetItCrash?

We know Elixir/OTP supervision works. LetItCrash doesn't test if processes restart—it tests if your application handles restarts correctly.

Real bugs this library helps catch:

🔍 Resource leaks: Database connections, file handles, ETS entries not cleaned up
🗂️ Registry inconsistencies: Stale entries pointing to dead processes
💾 State corruption: Shared caches with orphaned data after crashes
🔗 Cascade failures: Client processes crashing when servers restart
⚙️ Incomplete initialization: Processes not fully recovering their expected state
🔥 Silent async swallow: Fire-and-forget Tasks, Oban jobs, or handle_async/3 callbacks that raise but never reach the test process

Think of it as integration testing for your crash recovery logic, not unit testing the BEAM.

Installation

Add let_it_crash to your list of dependencies in mix.exs:

def deps do
  [
    {:let_it_crash, "~> 0.5.0", only: :test}
  ]
end

Testing async work in Phoenix + Oban

The most common production bug LetItCrash helps catch is the silent swallow — a Task raises but no one is awaiting it, the supervisor moves on, the test passes, and a real user gets stuck. LetItCrash.Async wraps your test code in an observer block that subscribes to telemetry exception events and lets you name the failure modes explicitly:

defmodule MyAppWeb.WidgetControllerTest do
  use MyAppWeb.ConnCase
  use LetItCrash

  test "POST /api/widgets fires a Task that doesn't silently swallow", %{conn: conn} do
    report =
      observe_async(fn ->
        conn = post(conn, "/api/widgets", %{name: "alpha"})
        assert json_response(conn, 202)
      end)

    assert :ok = assert_no_silent_swallow(report)
    assert :ok = assert_all_completed(report, within: 2_000)
  end

  test "the create-widget Oban worker is idempotent" do
    assert :ok =
             assert_idempotent(
               fn -> MyApp.Widgets.do_create(%{name: "alpha"}) end,
               state: fn -> MyApp.Repo.aggregate(MyApp.Widget, :count) end
             )
  end
end

See LetItCrash.Async for the full surface — failure-mode definitions, options, limitations, and the Ecto.Sandbox interaction note.

Usage

defmodule MyAppTest do
  use ExUnit.Case
  use LetItCrash

  test "supervised genserver recovers after crash" do
    # Start a supervisor with your GenServer
    {:ok, supervisor} = MySupervisor.start_link()
    {:ok, _pid} = MySupervisor.start_worker(supervisor, :my_worker)
    
    # Crash by name (automatic PID tracking)
    LetItCrash.crash(:my_worker)
    
    # Verify recovery - waits for new PID
    assert LetItCrash.recovered?(:my_worker)
    
    # Clean up
    Process.exit(supervisor, :shutdown)
  end

  test "process state resets after restart" do
    {:ok, supervisor} = MySupervisor.start_link()  
    {:ok, _pid} = MySupervisor.start_worker(supervisor, :stateful_server)
    
    LetItCrash.test_restart(:stateful_server, fn ->
      # This function runs before AND after the crash
      # State will be reset to initial after restart
      MyStatefulServer.increment()
      count = MyStatefulServer.get_count()
      IO.puts("Count: #{count}")  # Will be 1 before crash, 1 after (reset + increment)
    end)
    
    Process.exit(supervisor, :shutdown)
  end

  test "manual PID tracking" do
    {:ok, supervisor} = MySupervisor.start_link()
    {:ok, _pid} = MySupervisor.start_worker(supervisor, :manual_worker)
    
    # Store original PID manually
    original_pid = Process.whereis(:manual_worker)
    LetItCrash.crash(:manual_worker)
    
    # Check recovery with original PID and custom timeout
    assert LetItCrash.recovered?(:manual_worker, original_pid, timeout: 2000)
    
    Process.exit(supervisor, :shutdown)
  end
end

API

`crash/1` and `crash/2`

Crashes a process by PID or registered name. Follows the same convention as Process.exit/2 with the process as the first argument to enable easy piping.

# crash/1 - Sends :shutdown signal (can be trapped)
LetItCrash.crash(pid)           # Crash by PID
LetItCrash.crash(:process_name) # Crash by name + auto tracking

# crash/2 - Specify the signal type
LetItCrash.crash(pid, :shutdown)       # Equivalent to crash/1
LetItCrash.crash(pid, :kill)           # :kill signal (cannot be trapped)
LetItCrash.crash(:process_name, :kill) # With registered name

# Piping support:
Process.whereis(:my_process)
|> LetItCrash.crash(:kill)

When to use :kill?

Use crash(process, :kill) when testing processes that use Process.flag(:trap_exit, true), which is common in GenServers that need to perform cleanup logic on normal exits:

defmodule ScoreCoordinator do
  use GenServer

  def init(_) do
    Process.flag(:trap_exit, true)  # Traps normal exits
    {:ok, %{}}
  end

  def handle_info({:EXIT, _pid, _reason}, state) do
    # Cleanup logic here
    {:noreply, state}
  end
end

# In tests:
test "coordinator recovers from forced crash" do
  {:ok, supervisor} = MySupervisor.start_link()
  {:ok, _pid} = MySupervisor.start_coordinator(supervisor, :coordinator)

  # Use :kill to guarantee termination even with trap_exit
  LetItCrash.crash(:coordinator, :kill)

  assert LetItCrash.recovered?(:coordinator)
end

`wait_for_process/1,2`

Waits for a registered process to exist and be alive. Useful in test setup when you need to ensure a process is available before interacting with it.

# Basic usage - waits up to 1000ms (default)
:ok = LetItCrash.wait_for_process(:my_worker)

# With custom timeout for slow-starting processes
:ok = LetItCrash.wait_for_process(:heavy_worker, timeout: 5000)

# With custom polling interval
:ok = LetItCrash.wait_for_process(:worker, timeout: 2000, interval: 100)

Options:

:timeout - Maximum wait time (default: 1000ms)
:interval - Polling interval (default: 50ms)

Returns:

:ok - Process exists and is alive
{:error, :timeout} - Process did not appear within timeout

`recovered?/1,2,3`

Checks if a registered process has recovered after a crash. Multiple signatures available:

# Uses stored PID from crash/1 (recommended)
LetItCrash.recovered?(:process_name)

# With custom timeout/options
LetItCrash.recovered?(:process_name, timeout: 2000, interval: 100)

# Manual PID comparison
LetItCrash.recovered?(:process_name, original_pid)

# Manual PID + options
LetItCrash.recovered?(:process_name, original_pid, timeout: 3000)

Options:

:timeout - Maximum wait time for recovery (default: 1000ms)
:interval - Polling interval (default: 50ms)

`test_restart/2,3`

Tests that a process recovers by running the same function before and after crash.

# Basic usage
LetItCrash.test_restart(:process_name, fn ->
  # Test logic executed before AND after crash
end)

# With options
LetItCrash.test_restart(:process_name, fn ->
  # Test logic
end, timeout: 2000)

`assert_clean_registry/2,3`

Verifies that Registry entries are properly cleaned up when a process crashes and recreated when it recovers.

# Basic usage - verifies cleanup and re-registration
LetItCrash.assert_clean_registry(MyApp.Registry, :process_name)

# With custom timeout
LetItCrash.assert_clean_registry(MyApp.Registry, :process_name, timeout: 3000)

This function ensures your processes properly:

Remove old Registry entries when crashing
Create new Registry entries when recovering
Point to the correct new PID after restart

`verify_ets_cleanup/2,3`

Monitors ETS table entries to verify proper cleanup during process crashes.

# Verify entry is cleaned up (default behavior)
LetItCrash.verify_ets_cleanup(:my_cache, :process_data)

# Custom cleanup expectations
LetItCrash.verify_ets_cleanup(:shared_table, :key, 
  expect_cleanup: true,
  expect_recreate: false,
  timeout: 1500
)

# Verify recreation after cleanup
LetItCrash.verify_ets_cleanup(:cache_table, :data_key,
  expect_cleanup: true,
  expect_recreate: true
)

Options:

:expect_cleanup - Whether entry should be removed (default: true)
:expect_recreate - Whether entry should be recreated (default: false)
:timeout - Maximum wait time (default: 1000ms)

`assert_supervision_impact/3`

Crashes a child process and verifies the expected impact on its siblings within a supervision tree. Validates that your chosen supervision strategy (:one_for_one, :one_for_all, :rest_for_one) behaves as expected for your specific tree.

# Verify one_for_one: only the crashed child restarts
LetItCrash.assert_supervision_impact(:my_supervisor, :worker_a,
  expect: [
    worker_a: :restarted,
    worker_b: :alive,
    worker_c: :alive
  ]
)

# Verify one_for_all: all children restart
LetItCrash.assert_supervision_impact(:my_supervisor, :worker_a,
  expect: [
    worker_a: :restarted,
    worker_b: :restarted,
    worker_c: :restarted
  ]
)

# Verify rest_for_one: crashed child and later siblings restart
LetItCrash.assert_supervision_impact(:my_supervisor, :worker_b,
  expect: [
    worker_a: :alive,
    worker_b: :restarted,
    worker_c: :restarted
  ]
)

Each status can be paired with an assertion function to verify application-level behavior after the supervision event — not just that processes restarted, but that your code actually handles the restart correctly:

LetItCrash.assert_supervision_impact(:my_supervisor, :coordinator,
  expect: [
    coordinator: {:restarted, fn ->
      # Verify the coordinator came back in a valid state
      assert MyCoordinator.get_status() == :idle
    end},
    worker_a: {:alive, fn ->
      # Verify the sibling is still functional
      assert MyWorker.ready?(:worker_a)
    end},
    worker_b: :restarted
  ]
)

Expected Statuses:

:restarted - Process is alive with a different PID (was restarted by supervisor)
:alive - Process is alive with the same PID (unaffected)
:stopped - Process is no longer registered or alive (for :temporary restart)

Options:

:expect - (required) Keyword list of {child_name, expected_status} or {child_name, {expected_status, assertion_fn}}
:signal - Exit signal: :shutdown (default) or :kill
:timeout - Maximum wait time (default: 2000ms)
:interval - Polling interval (default: 50ms)

Advanced Usage Examples

Testing Registry and ETS Cleanup

defmodule MyAppTest do
  use ExUnit.Case
  use LetItCrash

  test "server cleans up resources properly on crash" do
    # Setup: Start Registry and ETS table
    {:ok, _} = Registry.start_link(keys: :unique, name: MyApp.Registry)
    :ets.new(:app_cache, [:set, :public, :named_table])
    
    {:ok, supervisor} = MySupervisor.start_link()
    {:ok, _pid} = MySupervisor.start_worker(supervisor, :resource_server)

    # Server registers itself and creates ETS entries
    assert [{_pid, _}] = Registry.lookup(MyApp.Registry, :resource_server)
    :ets.insert(:app_cache, {:server_data, "important_data"})

    # Crash and verify proper cleanup + recovery
    LetItCrash.crash(:resource_server)
    
    # Verify Registry cleanup and re-registration
    assert :ok = LetItCrash.assert_clean_registry(MyApp.Registry, :resource_server)
    
    # Verify ETS cleanup
    assert :ok = LetItCrash.verify_ets_cleanup(:app_cache, :server_data)

    Process.exit(supervisor, :shutdown)
  end
end

Testing Supervision Strategy Impact

A common real-world scenario: you have a scoring system with a coordinator and multiple workers under a :one_for_all strategy. When the coordinator crashes mid-calculation, you need to verify that workers don't retain stale partial results and that the coordinator comes back ready to accept new work.

defmodule ScoringSystemTest do
  use ExUnit.Case
  use LetItCrash

  test "coordinator crash resets workers to clean state" do
    {:ok, supervisor} = ScoringSupervisor.start_link()

    # Workers are processing scores
    ScoreWorker.submit(:worker_a, %{team: "A", score: 42})
    ScoreWorker.submit(:worker_b, %{team: "B", score: 38})
    ScoreCoordinator.begin_normalization(:coordinator)

    # Coordinator crashes mid-normalization
    LetItCrash.assert_supervision_impact(supervisor, :coordinator,
      signal: :kill,
      expect: [
        coordinator: {:restarted, fn ->
          # Coordinator must come back idle, not stuck in :normalizing
          assert ScoreCoordinator.get_status(:coordinator) == :idle
          # Must be able to accept new work immediately
          assert :ok = ScoreCoordinator.begin_normalization(:coordinator)
        end},
        worker_a: {:restarted, fn ->
          # Workers must not retain partial/stale scores
          assert ScoreWorker.get_pending(:worker_a) == []
        end},
        worker_b: {:restarted, fn ->
          assert ScoreWorker.get_pending(:worker_b) == []
        end}
      ]
    )

    Process.exit(supervisor, :shutdown)
  end
end

Combined Testing Workflow

test "complete crash recovery validation" do
  {:ok, supervisor} = MySupervisor.start_link()
  {:ok, _pid} = MySupervisor.start_worker(supervisor, :full_test_server)

  # Test complete recovery workflow
  LetItCrash.test_restart(:full_test_server, fn ->
    # This runs before AND after crash
    MyServer.increment_counter()
    assert MyServer.get_counter() == 1  # Will be reset to 0, then incremented to 1
  end)

  # Verify additional cleanup
  LetItCrash.assert_clean_registry(MyApp.Registry, :full_test_server)
  LetItCrash.verify_ets_cleanup(:server_cache, :counter_data)

  Process.exit(supervisor, :shutdown)
end

Important Notes

⚠️ Requires Supervision: The recovered?/1 and test_restart/2 functions only work with supervised processes. Unsupervised processes won't restart after crashes.

🔄 State Reset: Process state is reset to initial values after restart (this is normal OTP behavior).

🏷️ Named Processes: Recovery detection only works with registered (named) processes.

Contributing

Contributions are welcome! Please see CONTRIBUTING.md for details on:

🐛 Reporting bugs
💡 Suggesting features
🔧 Submitting pull requests
🧪 Running tests

License

This project is licensed under the MIT License - see the LICENSE file for details.

Support

📝 Open an issue for bug reports or feature requests
🤝 Check our Contributing Guide to help improve the project
⭐ Star the project if you find it useful!

Embrace the crash, test the recovery! 💥➡️✅

LetItCrash

Why Use LetItCrash?

Installation

Testing async work in Phoenix + Oban

Usage

API

crash/1 and crash/2

wait_for_process/1,2

recovered?/1,2,3

test_restart/2,3

assert_clean_registry/2,3

verify_ets_cleanup/2,3

assert_supervision_impact/3

Advanced Usage Examples

Testing Registry and ETS Cleanup

Testing Supervision Strategy Impact

Combined Testing Workflow

Important Notes

Contributing

License

Support

`crash/1` and `crash/2`

`wait_for_process/1,2`

`recovered?/1,2,3`

`test_restart/2,3`

`assert_clean_registry/2,3`

`verify_ets_cleanup/2,3`

`assert_supervision_impact/3`