ADR-007: Known Scale Limitations and Tradeoffs¶
Status¶
Accepted
Date¶
2024-02-15
Context¶
MAID runs as a single Python process using asyncio for concurrency. The GameEngine
(packages/maid-engine/src/maid_engine/core/engine.py) drives a tick-based loop
where all ECS systems are updated sequentially. The World
(packages/maid-engine/src/maid_engine/core/world.py) maintains all game state
in-process: the EntityManager holds all entities in a dictionary keyed by UUID,
the RoomIndex tracks entity-room mappings, and the GridManager provides
coordinate-based spatial indexing.
Python's Global Interpreter Lock (GIL) means only one thread executes Python bytecode at a time. While asyncio provides cooperative concurrency for I/O-bound work, CPU-bound work within a tick (ECS system updates, pathfinding, procedural generation) runs single-threaded.
The tick loop in GameEngine._tick_loop() updates all systems sequentially:
the SystemManager.update(delta) iterates through the _enabled_systems list
and awaits each system's update() method in priority order. If the sum of all
system update times exceeds the tick interval (default: 250ms at 4 ticks/second),
the engine records a tick overrun (_tick_overruns).
Decision¶
Target a scale of 100-500 concurrent players per server instance with a single world. Accept the following design constraints:
- Single process: All game state lives in one Python process. No distributed state, no inter-process communication for game logic.
- Single world instance: The
Worldobject is a singleton within aGameEngine. WhileWorldManager(packages/maid-engine/src/maid_engine/core/multiworld.py) supports multiple worlds with portals, all worlds run in the same process. - In-memory entity storage:
EntityManagerstores all entities in a Python dictionary. Queries (with_components,with_tag) use in-memory set intersection. There is no database-backed entity query layer. - Sequential tick processing: Systems run one after another within a tick. There is no parallel system execution within a single tick.
- AI call budgeting: AI-powered NPC dialogue is rate-limited per player
(
MAID_AI_DIALOGUE_PER_PLAYER_RATE_LIMIT_RPM=10) and globally (MAID_AI_DIALOGUE_GLOBAL_RATE_LIMIT_RPM=60) to prevent AI API calls from consuming the tick budget. TheRateLimiterusesdatetime/dequefor sliding window enforcement.
Profiling tools (@profile, @timing, @memory commands) are built into the
engine to help operators identify bottlenecks. The TickCollector
(packages/maid-engine/src/maid_engine/profiling/) tracks per-tick timing,
and the SystemManager can disable individual systems via disable().
Consequences¶
Positive¶
- Simplicity: No distributed state coordination, no consensus protocols, no cache invalidation across processes. All game state is authoritative and immediately consistent because it lives in one process.
- Low latency: In-memory entity queries and room lookups are microsecond-scale. There is no network round-trip for game state access.
- Easy debugging: All state is visible in a single process. The
@examine,@stat, and@memoryadmin commands can inspect any entity or system directly. - Deterministic behavior: Sequential tick processing means system execution order is predictable and reproducible, which is critical for game logic correctness (e.g., damage must be applied before death checks).
Negative¶
- Vertical scaling only: To handle more players, you need a faster CPU, not more servers. There is no horizontal scaling path without fundamental architectural changes.
- Memory ceiling: All entities, rooms, and wilderness cache reside in process
memory. A world with 100,000 rooms and 50,000 entities may consume several
gigabytes. The
WildernessManagermitigates this with stale room cleanup, but there is no paging to disk. - Tick budget pressure: With 500 players, each system must process all relevant entities within the tick interval. A combat system iterating 500 players with 10 NPCs each must complete in a fraction of 250ms.
- Single point of failure: If the process crashes, the entire game goes down.
Persistence is handled by periodic saves to the
DocumentStore, not real-time replication.
Alternatives Considered¶
Multi-Process Architecture¶
Running multiple Python processes (e.g., one per zone) with shared state via Redis or a message queue was considered. Rejected because it introduces distributed state consistency problems (what happens when a player moves between zones?), requires serialization of all game state mutations, and dramatically increases operational complexity for the expected 100-500 player target.
Distributed ECS¶
Systems like SpatialOS or custom distributed ECS frameworks split entity ownership across workers. Rejected because the coordination overhead for a text-based MUD is not justified. MUDs have far fewer entities and far simpler physics than 3D MMOs where distributed ECS is warranted.
Sharding by World/Realm¶
Running separate server instances per "realm" or "shard" with no shared state was
considered. This remains a viable future option (each shard runs its own
GameEngine) but was deferred because it requires a meta-service for realm
selection and does not help with single-world density. The maid-registry service
could potentially evolve to support shard discovery.
PyPy or Cython Optimization¶
Using PyPy for JIT compilation or Cython for hot paths was considered for improving single-process throughput. Deferred because the current CPython performance is adequate for the target scale, and these runtimes have compatibility constraints with C extensions used by dependencies (Pydantic, cryptographic libraries).