Skip to content

ADR-007: Known Scale Limitations and Tradeoffs

Status

Accepted

Date

2024-02-15

Context

MAID runs as a single Python process using asyncio for concurrency. The GameEngine (packages/maid-engine/src/maid_engine/core/engine.py) drives a tick-based loop where all ECS systems are updated sequentially. The World (packages/maid-engine/src/maid_engine/core/world.py) maintains all game state in-process: the EntityManager holds all entities in a dictionary keyed by UUID, the RoomIndex tracks entity-room mappings, and the GridManager provides coordinate-based spatial indexing.

Python's Global Interpreter Lock (GIL) means only one thread executes Python bytecode at a time. While asyncio provides cooperative concurrency for I/O-bound work, CPU-bound work within a tick (ECS system updates, pathfinding, procedural generation) runs single-threaded.

The tick loop in GameEngine._tick_loop() updates all systems sequentially: the SystemManager.update(delta) iterates through the _enabled_systems list and awaits each system's update() method in priority order. If the sum of all system update times exceeds the tick interval (default: 250ms at 4 ticks/second), the engine records a tick overrun (_tick_overruns).

Decision

Target a scale of 100-500 concurrent players per server instance with a single world. Accept the following design constraints:

  • Single process: All game state lives in one Python process. No distributed state, no inter-process communication for game logic.
  • Single world instance: The World object is a singleton within a GameEngine. While WorldManager (packages/maid-engine/src/maid_engine/core/multiworld.py) supports multiple worlds with portals, all worlds run in the same process.
  • In-memory entity storage: EntityManager stores all entities in a Python dictionary. Queries (with_components, with_tag) use in-memory set intersection. There is no database-backed entity query layer.
  • Sequential tick processing: Systems run one after another within a tick. There is no parallel system execution within a single tick.
  • AI call budgeting: AI-powered NPC dialogue is rate-limited per player (MAID_AI_DIALOGUE_PER_PLAYER_RATE_LIMIT_RPM=10) and globally (MAID_AI_DIALOGUE_GLOBAL_RATE_LIMIT_RPM=60) to prevent AI API calls from consuming the tick budget. The RateLimiter uses datetime/deque for sliding window enforcement.

Profiling tools (@profile, @timing, @memory commands) are built into the engine to help operators identify bottlenecks. The TickCollector (packages/maid-engine/src/maid_engine/profiling/) tracks per-tick timing, and the SystemManager can disable individual systems via disable().

Consequences

Positive

  • Simplicity: No distributed state coordination, no consensus protocols, no cache invalidation across processes. All game state is authoritative and immediately consistent because it lives in one process.
  • Low latency: In-memory entity queries and room lookups are microsecond-scale. There is no network round-trip for game state access.
  • Easy debugging: All state is visible in a single process. The @examine, @stat, and @memory admin commands can inspect any entity or system directly.
  • Deterministic behavior: Sequential tick processing means system execution order is predictable and reproducible, which is critical for game logic correctness (e.g., damage must be applied before death checks).

Negative

  • Vertical scaling only: To handle more players, you need a faster CPU, not more servers. There is no horizontal scaling path without fundamental architectural changes.
  • Memory ceiling: All entities, rooms, and wilderness cache reside in process memory. A world with 100,000 rooms and 50,000 entities may consume several gigabytes. The WildernessManager mitigates this with stale room cleanup, but there is no paging to disk.
  • Tick budget pressure: With 500 players, each system must process all relevant entities within the tick interval. A combat system iterating 500 players with 10 NPCs each must complete in a fraction of 250ms.
  • Single point of failure: If the process crashes, the entire game goes down. Persistence is handled by periodic saves to the DocumentStore, not real-time replication.

Alternatives Considered

Multi-Process Architecture

Running multiple Python processes (e.g., one per zone) with shared state via Redis or a message queue was considered. Rejected because it introduces distributed state consistency problems (what happens when a player moves between zones?), requires serialization of all game state mutations, and dramatically increases operational complexity for the expected 100-500 player target.

Distributed ECS

Systems like SpatialOS or custom distributed ECS frameworks split entity ownership across workers. Rejected because the coordination overhead for a text-based MUD is not justified. MUDs have far fewer entities and far simpler physics than 3D MMOs where distributed ECS is warranted.

Sharding by World/Realm

Running separate server instances per "realm" or "shard" with no shared state was considered. This remains a viable future option (each shard runs its own GameEngine) but was deferred because it requires a meta-service for realm selection and does not help with single-world density. The maid-registry service could potentially evolve to support shard discovery.

PyPy or Cython Optimization

Using PyPy for JIT compilation or Cython for hot paths was considered for improving single-process throughput. Deferred because the current CPython performance is adequate for the target scale, and these runtimes have compatibility constraints with C extensions used by dependencies (Pydantic, cryptographic libraries).