Skip to content

ADR-002: Async/Await Throughout the Codebase

Status

Accepted

Date

2024-01-15

Context

A MUD engine is inherently concurrent. Hundreds of players connect simultaneously via Telnet and WebSocket, each issuing commands, receiving room descriptions, and interacting with AI-powered NPCs. The engine must also handle a tick-based game loop, periodic AI API calls (with high latency), database I/O, and external bridge connections (IRC, Discord, RSS).

Traditional MUD engines in Python either use threads (one per connection) or blocking I/O with a homegrown event loop. Both approaches have significant drawbacks: threads introduce synchronization complexity and memory overhead, while custom event loops are hard to maintain and lack ecosystem support.

Decision

Use Python's asyncio as the single concurrency model for all I/O operations throughout the entire codebase. Specifically:

  • The game tick loop in GameEngine._tick_loop() (packages/maid-engine/src/maid_engine/core/engine.py) is an async loop using asyncio.sleep() for timing.
  • All ECS System.update(delta) methods are async.
  • All ContentPack lifecycle hooks (on_load, on_unload) are async.
  • Network layers (Telnet in packages/maid-engine/src/maid_engine/net/telnet/, WebSocket in packages/maid-engine/src/maid_engine/net/web/) use asyncio.StreamReader and asyncio.StreamWriter or FastAPI async endpoints.
  • The DocumentStore API and AccountManager authentication methods are all async.
  • External bridges like IRCBridge (packages/maid-engine/src/maid_engine/bridges/irc_bridge.py) use asyncio.open_connection() for TCP and asyncio.Event for coordination.
  • The ConversationManager for AI dialogue has all-async methods to support async DocumentStore persistence.
  • Hot reload operations use asyncio.Event for tick-loop pause/resume handshaking (_hot_reload_pause, _hot_reload_paused_ack in GameEngine).

Consequences

Positive

  • Single-threaded concurrency: No locks needed for most game state mutations. The World, EntityManager, and SystemManager can be accessed without synchronization because only one coroutine runs at a time between await points.
  • Natural I/O multiplexing: Hundreds of Telnet/WebSocket connections are handled by a single thread. AI API calls to Anthropic/OpenAI/Ollama providers naturally yield during network waits.
  • Ecosystem compatibility: FastAPI (for the admin REST API), aiohttp (for AI provider HTTP clients), and asyncio streams (for Telnet/IRC) all share the same event loop without adapter layers.
  • Tick-loop integration: asyncio.sleep() in the tick loop naturally yields to connection handlers and background tasks between ticks.

Negative

  • Viral async: Every caller must await async functions. A single synchronous call in the chain blocks the entire event loop. This is why even the EventBus provides emit_sync for fire-and-forget events within the tick.
  • Testing complexity: Tests require pytest-asyncio and @pytest.mark.asyncio decorators. Mocking async methods requires AsyncMock. Test setup often needs an event loop fixture.
  • CPU-bound work: Long-running computations (e.g., A* pathfinding in GridManager, procedural terrain generation in SimplexNoise) can block the event loop. These must be kept fast or offloaded to thread executors.
  • Stack traces: Async stack traces are harder to read than synchronous ones, especially when tracing through the tick loop, system manager, and individual system updates.

Alternatives Considered

Threading (one thread per connection)

Rejected due to GIL contention, memory overhead of hundreds of threads, and the need for locks around all shared game state (World, EntityManager, rooms).

Gevent / Green Threads

Gevent patches standard library I/O to be cooperative. While this avoids explicit async/await syntax, it introduces implicit yielding that is harder to reason about, and it is incompatible with many modern Python libraries (FastAPI, Pydantic v2).

Synchronous with Select-Based Event Loop

Writing a custom select()-based loop (as many classic MUDs do) was rejected because it would preclude use of FastAPI for the admin API, modern HTTP clients for AI providers, and the broader asyncio ecosystem.