AI Players: Research Survey¶

Overview¶

This document synthesizes academic research on autonomous AI agents playing games, with focus on text-based/interactive fiction environments. The research spans 2022–2025 and covers architecture patterns, memory systems, planning approaches, multi-agent coordination, and cost management. These findings inform the design of MAID's AI Player system.

1. Foundational Architectures¶

1.1 Generative Agents: Interactive Simulacra of Human Behavior¶

Authors: Park, O'Brien, Cai, Morris, Liang, Bernstein (Stanford/Google)
Published: 2023, arXiv:2304.03442
Citations: 3000+
Venue: UIST 2023

Summary: The seminal work on believable AI agents in sandbox environments. Deployed 25 agents in a Sims-like town ("Smallville") where they autonomously lived, formed relationships, coordinated events, and exhibited emergent social behavior.

Architecture (the "Cognitive Architecture"):

Perception → Memory Stream → Retrieval → Planning → Reflection → Action

Key Components:

Memory Stream: A comprehensive log of ALL agent experiences stored as natural language entries with timestamps. Each entry has: description, creation timestamp, last access timestamp, and importance score (1-10, rated by LLM).
Retrieval: When the agent needs to act, it retrieves relevant memories using a scoring function combining:
Recency (exponential decay)
Importance (LLM-rated 1-10)
Relevance (embedding similarity to current situation)
Reflection: Periodically (when sum of importance scores exceeds threshold), the agent synthesizes higher-level insights from recent memories. E.g., "I've been spending a lot of time at the café lately" → "I enjoy socializing with the barista." These reflections are stored back into memory stream and can themselves be reflected upon (recursive abstraction).
Planning: Agents create broad day-level plans ("Wake up at 7am, have breakfast, go to work at the library, have lunch, paint in the afternoon"), then recursively decompose into hour-level and action-level plans. Plans are revised when unexpected events occur.

Key Results: - Emergent behavior: One agent autonomously organized a Valentine's Day party — invited others, who then invited their friends, leading to a full social event with no human intervention. - Ablation showed ALL components (observation, planning, reflection) were necessary — removing any one significantly degraded believability. - Human evaluators rated generative agents as more believable than hand-scripted agents.

Relevance to MAID AI Players: - The memory stream + reflection architecture is directly applicable - MUD text output maps to the observation/perception layer - Day-level planning maps to session-level goals (explore area X, level up, complete quest Y) - Reflection enables agents to form opinions about NPCs, learn which areas are dangerous, etc.

1.2 Voyager: An Open-Ended Embodied Lifelong Learning Agent¶

Authors: Wang, Xian, Srinivasa, Yahya, Fan, Zhu (NVIDIA, Caltech, UT Austin)
Published: 2023, arXiv:2305.16291
Citations: 1500+
Venue: NeurIPS 2023 (Spotlight)

Summary: An LLM-powered agent in Minecraft that continuously explores, acquires skills, and makes discoveries without human intervention. First lifelong learning agent in an open-ended game world.

Architecture (3 key components):

Automatic Curriculum: An LLM generates increasingly complex goals based on the agent's current skill set, inventory, and exploration state. Starts simple ("collect wood") and evolves to complex ("build a diamond pickaxe"). The curriculum maximizes exploration by proposing goals just beyond current capability.
Skill Library: Executable JavaScript code snippets stored with descriptions. When a new task is encountered, the agent first searches its skill library for relevant existing skills. If none found, it generates new code, tests it, and if successful, adds it to the library. Skills are composable.
Iterative Prompting: When code execution fails, the environment error is fed back to the LLM for self-debugging. Includes execution errors, game state changes, and self-verification checks.

Key Results: - 3.3x more unique items discovered than prior SOTA - 2.3x longer travel distances - Unlocked entire Minecraft tech tree without human guidance - Skills transfer to new worlds — learned skills in one seed work in another

Relevance to MAID AI Players: - Automatic curriculum → AI Players can auto-generate goals ("explore the forest", "buy a sword", "fight wolves") - Skill library → Store successful command sequences as reusable "plays" (how to buy from a shop, how to navigate to a location, combat tactics) - Iterative prompting → When a command fails, feed error back and try alternative - Open-ended exploration is exactly the MUD gameplay loop

1.3 ReAct: Synergizing Reasoning and Acting in Language Models¶

Authors: Yao, Zhao, Yu, Du, Shafran, Narasimhan, Cao (Princeton, Google)
Published: 2022, arXiv:2210.03629
Citations: 3000+
Venue: ICLR 2023

Summary: Introduces the pattern of interleaving reasoning traces (thinking) with actions (doing) in LLM agents. The reasoning helps the agent plan and track progress; the actions gather information from the environment.

The ReAct Loop:

Thought: I need to find the shop to buy a sword.
Action: move north
Observation: You enter the Town Square. A shop is to the east.
Thought: The shop is east of here. I should go there.
Action: move east
Observation: You enter Ye Olde Shoppe. A merchant stands behind the counter.
Thought: I'm in the shop. I should look at what's available.
Action: list
Observation: [Available items listed]

Key Results: - Outperforms pure reasoning (CoT) and pure acting (RL) approaches - 34% improvement on ALFWorld text game benchmark - 10% improvement on WebShop interactive shopping benchmark - Reasoning traces make agent behavior interpretable and debuggable

Relevance to MAID AI Players: - This is the core interaction loop for MUD gameplay - Thought traces provide debugging/logging capability - The pattern naturally maps to: parse MUD output → reason about it → generate command → observe result - Can be extended with memory retrieval between thought and action

1.4 Reflexion: Language Agents with Verbal Reinforcement Learning¶

Authors: Shinn, Cassano, Gopinath, Narasimhan, Yao (Northeastern, Princeton, MIT)
Published: 2023, arXiv:2303.11366
Citations: 1500+
Venue: NeurIPS 2023

Summary: Agents learn from failures through verbal self-reflection rather than weight updates. After failing a task, the agent generates a natural language reflection on what went wrong and stores it in an episodic memory buffer. On subsequent attempts, these reflections are retrieved and included in the prompt.

Architecture:

Actor → Environment → Evaluator → Self-Reflection → Memory → Actor (next trial)

Key Components: 1. Evaluator: Determines if the task was completed successfully (binary or scored) 2. Self-Reflection: LLM analyzes the trajectory and generates insights: "I tried to fight the wolf but my health was too low. Next time I should heal first." 3. Episodic Memory: Stores reflections for retrieval in future similar situations

Key Results: - 91% on HumanEval (vs 80% for GPT-4 baseline) — through reflection, not fine-tuning - Significant improvements on ALFWorld (text games) and HotPotQA - No gradient updates needed — purely prompt-based learning - Works across multiple attempts at the same task AND transfers to similar tasks

Relevance to MAID AI Players: - Death/failure reflection: "I died fighting the Forest Wolf because I didn't heal first. Lesson: always check health before combat." - Quest failure reflection: "I couldn't complete the quest because I didn't have the required item." - These reflections accumulate into a strategy guide the agent writes for itself - Zero-cost learning (no fine-tuning required)

2. Multi-Agent Systems¶

2.1 Project SID: Many-Agent Simulations Toward AI Civilization¶

Authors: Altera.AI team (14 authors)
Published: 2024, arXiv:2411.00114
Citations: 50+

Summary: Scaled agent simulations to 10–1000+ agents in Minecraft. Introduced the PIANO (Parallel Information Aggregation via Neural Orchestration) architecture for real-time multi-agent coordination.

Key Contributions:

PIANO Architecture: Enables agents to interact with humans and other agents in real-time while maintaining coherence across multiple output streams. Key insight: parallelize perception and action generation across agents using a shared orchestration layer.
Civilizational Benchmarks: Evaluated agents on:
Role specialization (did agents develop distinct roles?)
Rule adherence and evolution (did they follow and modify rules?)
Cultural transmission (did knowledge transfer between agents?)
Economic behavior (did trade emerge?)
Scaling Results:
At 10 agents: basic social interactions, simple role differentiation
At 100 agents: emergent economies, political structures, cultural norms
At 1000+: civilization-level phenomena, religious transmission, institutional formation

Relevance to MAID AI Players: - Architecture for running many AI Players simultaneously - Shared orchestration layer could manage multiple AI Player sessions - Civilizational benchmarks could inspire MUD community dynamics - Scaling considerations directly applicable

2.2 Agents: An Open-source Framework for Autonomous Language Agents¶

Authors: Zhou et al.
Published: 2023, arXiv:2309.07870
Citations: 200+

Summary: Open-source library implementing a modular agent architecture with planning, memory, tool usage, multi-agent communication, and symbolic control.

Architecture Modules: 1. Planning Module: Supports multiple planning strategies (chain-of-thought, tree-of-thought, plan-and-solve) 2. Memory Module: Short-term (context window), long-term (vector store), episodic (trajectory storage) 3. Tool Module: Extensible tool interface for environment interaction 4. Communication Module: Agent-to-agent messaging with structured protocols 5. Symbolic Control: State machines and rule-based overrides for safety

Relevance to MAID AI Players: - Modular architecture pattern to follow - Communication module for multi-AI-player scenarios - Symbolic control for safety (prevent griefing, enforce game rules)

2.3 Experiential Co-Learning of Software-Developing Agents¶

Authors: Qian et al.
Published: 2023, arXiv:2312.17025
Citations: 100+

Summary: Multiple agents learn from historical trajectories. Key insight: agents share experiences through a common knowledge base, allowing one agent's discoveries to benefit all others.

Relevance to MAID AI Players: - Shared knowledge base: one AI Player discovers a quest solution, all others can benefit - Map knowledge sharing: explored rooms/areas pooled across agents - Combat tactics sharing: effective strategies propagated to all agents

3. Text Game & Interactive Fiction Agents¶

3.1 TALES: Text Adventure Learning Environment Suite¶

Authors: Cui, Yuan, Xiao, Ammanabrolu et al.
Published: 2025, arXiv:2504.14128
Citations: 6

Summary: Unified benchmark for LLM agents in text-adventure game environments. Built on Jericho (542 human-written interactive fiction games). Evaluates spatial reasoning, object manipulation, puzzle solving, and narrative comprehension.

Key Findings: - LLMs struggle with: spatial reasoning, long-horizon planning, object state tracking - Best performance with: structured observation parsing, explicit state tracking, plan revision - GPT-4 solves ~30% of games; Claude-3 similar; open models significantly worse

Benchmark Categories: 1. Navigation & Spatial Reasoning 2. Object Manipulation & Puzzles 3. NPC Interaction & Dialogue 4. Combat & Resource Management 5. Multi-step Quest Completion

Relevance to MAID AI Players: - Directly applicable benchmark categories map to MUD gameplay - Known failure modes (spatial reasoning) inform where to add structured state tracking - Suggests AI Players need explicit map/state models, not just LLM reasoning

3.2 Learning to Play Like Humans: LLM Adaptation in Interactive Fiction¶

Authors: Zhang, Long
Published: 2025, ACL Findings
Venue: ACL 2025

Summary: Framework for making LLMs play text games more like humans. Key insight: human players use exploration strategies, form mental models, and build spatial maps — LLMs need explicit scaffolding to replicate these behaviors.

Techniques: 1. Mental Model Construction: Explicitly build and maintain a world model (room graph, inventory state, NPC state) from text observations 2. Exploration Strategies: Systematic exploration (DFS/BFS through rooms) rather than random wandering 3. Human-like Timing: Variable action delays to simulate reading and thinking time

Relevance to MAID AI Players: - Explicit world model construction is essential for MUD navigation - Systematic exploration strategies prevent getting lost - Human-like timing makes AI Players feel natural to human co-players

3.3 Starling: Self-supervised Training of Text-based RL Agent with LLMs¶

Authors: Basavatia, Murugesan
Published: 2024, ACL Findings
Citations: 16

Summary: Self-supervised approach where LLMs generate their own training data for text game agents. Uses LLM to create diverse game scenarios and solutions, then trains a smaller model on these trajectories.

Relevance to MAID AI Players: - Could use large LLM to generate training trajectories, then distill to smaller/cheaper model for runtime - Self-play data generation for offline training

3.4 Digital Player: Evaluating LLM-based Human-like Agent in Games¶

Authors: Wang et al.
Published: 2025, arXiv:2502.20807
Citations: 3

Summary: Evaluates how human-like LLM game agents are. Identifies key challenges: numerical reasoning, long-term planning, and maintaining consistent character personality.

Evaluation Dimensions: 1. Strategic competence (does the agent make good decisions?) 2. Behavioral consistency (does it maintain character?) 3. Social intelligence (does it interact naturally with other players?) 4. Exploration efficiency (does it explore systematically?) 5. Adaptation (does it learn from experience?)

Relevance to MAID AI Players: - Evaluation framework for measuring AI Player quality - Identified failure modes to design around

3.5 TextQuests: How Good are LLMs at Text-Based Video Games?¶

Authors: Phan et al.
Published: 2025

Summary: Systematic evaluation of LLMs playing text-based games. Tests multiple models (GPT-4, Claude, Llama, Mistral) across game types.

Key Findings: - All models struggle with: inventory management, combat tactics, long-term resource planning - Models improve significantly with: structured prompts, explicit state tracking, few-shot examples - Chain-of-thought reasoning helps but isn't sufficient alone - Best approach: ReAct-style with explicit state tracking

3.6 Textatari: 100k Frames Game Playing with Language Agents¶

Authors: Li et al.
Published: 2025, arXiv:2506.04098
Citations: 1

Summary: Language agents playing games over very long horizons (100k frames). Shows that structured exploration spaces simplify planning and that current LLM agents can handle extended gameplay sessions.

Relevance to MAID AI Players: - Long-horizon gameplay is viable - Need efficient context management for extended sessions

4. Memory Systems¶

4.1 Memory in the Age of AI Agents¶

Authors: Hu, Liu, Yue, Zhang, Liu, Zhu, Lin et al.
Published: 2025, arXiv:2512.13564

Summary: Comprehensive survey on agent memory systems. Categorizes memory into types and surveys implementations across the field.

Memory Taxonomy:

Type	Description	Persistence	Example
Sensory Buffer	Raw input, very short-lived	Seconds	Current MUD output line
Working Memory	Active processing context	Minutes	Current room state, recent events
Episodic Memory	Specific experiences	Long-term	"Fought wolf in forest at turn 42"
Semantic Memory	Facts and knowledge	Long-term	"Wolves are weak to fire"
Procedural Memory	How to do things	Long-term	"To buy: enter shop, list, buy "
Reflective Memory	Meta-insights	Long-term	"I tend to die when I fight without healing first"

Memory Operations: 1. Encoding: Convert observation to memory entry (importance scoring, timestamping) 2. Consolidation: Compress and reorganize memories over time (merge duplicates, strengthen important ones) 3. Retrieval: Find relevant memories for current context (similarity search, recency, importance) 4. Forgetting: Remove low-importance, rarely-accessed memories (decay function)

Relevance to MAID AI Players: - Complete memory taxonomy to implement - Consolidation prevents memory bloat during long sessions - Forgetting is essential for cost management (smaller context = fewer tokens)

4.2 Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory¶

Authors: Mem0.ai team
Published: 2025, arXiv:2504.19413

Summary: Production-focused memory architecture using graph-based storage for long-term conversational coherence.

Key Features: 1. Graph-based storage: Memories stored as nodes in a knowledge graph with relationships 2. Automatic extraction: LLM extracts facts from conversations automatically 3. Conflict resolution: When new information contradicts old, the system resolves conflicts 4. Scalable retrieval: Efficient search across large memory stores

Relevance to MAID AI Players: - Graph-based storage maps well to MUD world model (rooms as nodes, exits as edges) - Automatic fact extraction from MUD output - Conflict resolution when game state changes unexpectedly

5. Planning & Goal Systems¶

5.1 Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents¶

Authors: Putta et al.
Published: 2024, arXiv:2408.xxxxx

Summary: Combines Monte Carlo Tree Search (MCTS) with LLM reasoning and self-critique for complex decision-making in interactive environments. Uses RL-style value estimation without weight updates.

Key Contributions: 1. Search-based planning: MCTS explores action trees before committing 2. Self-critique: Agent evaluates its own proposed actions 3. Progressive deepening: Plans at multiple levels of detail

Relevance to MAID AI Players: - Combat decision-making: search over possible attack/defend/flee sequences - Quest planning: evaluate multiple solution paths - Self-critique prevents obviously bad actions (attacking when low health)

5.2 AgentGen: Enhancing Planning Abilities via Environment and Task Generation¶

Authors: Hu, Zhao, Xu, Sun, Lou, Lin
Published: 2025, ACM KDD
Citations: 61

Summary: Uses LLMs to automatically generate training environments and planning tasks for LLM-based agents. Evaluated on Jericho interactive fiction games.

Relevance to MAID AI Players: - Could auto-generate test scenarios for AI Player validation - Training on diverse generated environments improves generalization

5.3 Hierarchical Planning Pattern (Synthesized from Literature)¶

The most effective planning approach across the literature is hierarchical:

Level 0: Session Goals (set once per session)
  "Explore the forest area and reach level 3"

Level 1: Phase Plans (revised every ~30 minutes)
  "Phase 1: Buy equipment in town"
  "Phase 2: Explore forest paths"
  "Phase 3: Fight forest creatures for XP"

Level 2: Task Plans (revised every ~5 minutes)
  "Go to the shop, buy a sword, equip it"

Level 3: Action Plans (immediate)
  "move east" → "list" → "buy sword" → "wield sword"

Plans at each level are: - Generated once, then revised only when invalidated - Lower levels re-planned more frequently than higher levels - Unexpected events trigger re-planning at the appropriate level

6. Cost Management & Efficiency¶

6.1 Affordable Generative Agents¶

Authors: Yu et al.
Published: 2024, arXiv:2402.xxxxx

Summary: Techniques for reducing the cost of running generative agent simulations by 100x while maintaining quality.

Cost Reduction Techniques:

Observation batching: Don't process every game tick — batch observations and process periodically (e.g., every 5 seconds instead of every 0.25 seconds)
Plan caching: Don't re-plan every action. Generate a plan, execute it step-by-step, only re-plan when something unexpected happens or the plan is complete.
Memory summarization: Periodically compress episodic memories into summaries, discarding raw entries. "I fought 3 wolves in the forest over 20 minutes" instead of 20 individual combat entries.
Tiered models: Use cheap models (GPT-3.5, Haiku) for routine actions (movement, basic combat) and expensive models (GPT-4, Opus) for strategic decisions (quest planning, NPC dialogue).
Template actions: Common action sequences (buy item from shop, navigate known path) are stored as templates and executed without LLM calls.
Shared context: When multiple agents are in the same room, share the perception/parsing work.

Cost Estimates (from paper): - Naive approach: ~$10/agent/hour with GPT-4 - Optimized approach: ~$0.10/agent/hour with tiered models + caching - 100x reduction while maintaining 90%+ behavioral quality

Relevance to MAID AI Players: - Critical for running multiple AI Players continuously - Tiered model approach maps directly to MAID's LLMProviderRegistry - Template actions for common MUD command sequences - Observation batching for reducing per-tick LLM calls

7. Game-Specific Surveys & Frameworks¶

7.1 A Survey on LLM-based Game Agents¶

Authors: Hu, Huang, Liu, Kompella, Ilhan et al.
Published: 2024, arXiv:2404.02039
Citations: 128

Summary: Comprehensive survey of LLM agents in games. Categorizes by game type, agent architecture, and capability.

Agent Module Taxonomy: 1. Perception: Processing game output (text parsing, screen reading, API data) 2. Memory: Storing and retrieving past experiences 3. Planning: Goal setting, strategy, multi-step reasoning 4. Action: Generating and executing game commands 5. Learning: Improving over time (reflection, fine-tuning, RL) 6. Communication: Inter-agent messaging (in multi-agent settings)

Game Type Categories: - Adventure/RPG (most relevant to MUD) - Strategy - Simulation/Sandbox - Puzzle - Social deduction

Key Finding: Text-based games are considered "ideal testbeds" for LLM agents because the input/output modality is already language — no vision encoder needed.

7.2 Large Language Models and Games: A Survey and Roadmap¶

Authors: Gallotta, Todd, Zammit, Earle et al.
Published: 2024, IEEE Transactions on Games
Citations: 213

Summary: Most comprehensive LLM+games survey. Covers six roles for LLMs in games: 1. Playing games (as agents) 2. Testing games (as QA bots) 3. Generating content (PCG) 4. Narrating (dynamic storytelling) 5. NPC behavior (dialogue, decision-making) 6. Game design (assisting designers)

Relevance to MAID AI Players: - AI Players fill both role 1 (playing) and role 2 (testing) - The survey identifies key challenges for game-playing agents: - Grounding (connecting language to game state) - Long-term coherence (maintaining goals across sessions) - Exploration-exploitation tradeoff - Handling partial observability

8. Additional Relevant Work¶

8.1 Dynalang: Language Models for World Model Learning¶

Authors: Lin et al.
Published: 2023, arXiv:2308.01399
Venue: ICML 2024

Summary: Agent that learns a multimodal world model predicting future states from text + observations. Can leverage game manuals, rule descriptions, and NPC dialogue to build understanding.

Relevance to MAID AI Players: - MUD help files and game documentation can be fed to the agent - Agent can build predictive models ("if I go north, I'll be in the forest") - World model helps with planning without executing actions

8.2 PokéAI: Multi-Agent LLM System for Pokémon Red¶

Authors: Liu et al.
Published: 2025, arXiv:2506.23689

Summary: Multi-agent system that plays Pokémon Red with goal generation, battle optimization, and navigation. Uses multiple specialized agents (navigator, battler, strategist) that coordinate.

Relevance to MAID AI Players: - Multi-agent decomposition: separate agents for different MUD activities - Navigator agent → room exploration and pathfinding - Battler agent → combat decisions - Strategist agent → overall goal planning

8.3 LLM2TextGame: Multi-Agents for Generating Consistent Text Adventure Games¶

Authors: Bazarvaani, Na
Published: 2024, Korean HCI Conference

Summary: Uses multiple LLM agents to generate consistent text adventure games. Interesting reverse perspective — instead of playing games, generating them. But the consistency mechanisms are relevant.

8.4 WorldWeaver: Procedural World Generation for Text Adventure Games¶

Authors: Jin, Kaul, Ramakrishanan, Jain et al. (UPenn)

Summary: LLM-based procedural world generation for text adventures. Focus on ensuring spatial consistency and narrative coherence.

Relevance to MAID AI Players: - AI Players could potentially help test procedurally generated content - Spatial consistency checking techniques applicable to map building

8.5 PsychoGAT: Psychological Measurement Through Interactive Fiction with LLM Agents¶

Authors: Yang, Wang, Chen, Wang, Pu et al.
Published: 2024, ACL
Citations: 45

Summary: Uses interactive fiction games with LLM agents to measure psychological traits. Demonstrates that LLM agents can exhibit consistent personality traits during gameplay.

Relevance to MAID AI Players: - AI Players can have distinct, consistent personalities - Personality affects gameplay style (aggressive vs cautious, social vs solo)

9. Synthesis: Key Design Principles¶

From the comprehensive literature review, these principles emerge for designing MAID AI Players:

Principle 1: Cognitive Architecture is Essential¶

Every successful game agent uses a perception → memory → planning → action loop. No single LLM call can substitute for this structured approach.

Principle 2: Memory Must Be Multi-Layered¶

Working memory (current context), episodic memory (past events), semantic memory (learned facts), and reflective memory (meta-insights) all serve distinct purposes. Cutting any layer significantly degrades performance.

Principle 3: Planning Must Be Hierarchical¶

Session-level goals, phase-level plans, task-level sequences, and action-level commands. Only re-plan at the level that was invalidated. This is both more effective and more cost-efficient.

Principle 4: Explicit State Tracking is Non-Negotiable¶

LLMs cannot reliably track game state (inventory, health, map) in their context window alone. Explicit structured state must be maintained outside the LLM and fed as context.

Principle 5: Reflection Enables Zero-Cost Learning¶

Verbal self-reflection (Reflexion pattern) allows agents to learn from failures without fine-tuning. This is critical for MUD gameplay where death and failure are frequent.

Principle 6: Cost Control Requires Tiered Architecture¶

Use cheap models for routine actions, expensive models for strategic decisions. Cache plans, batch observations, and use templates for common sequences. Target: <$0.10/agent/hour.

Principle 7: Multi-Agent Benefits from Shared Knowledge¶

When running multiple AI Players, sharing map knowledge, combat tactics, and quest solutions across agents dramatically improves collective performance.

Principle 8: Human-Likeness Requires Deliberate Design¶

Variable timing, personality consistency, exploration patterns, and social behavior must be explicitly designed — they don't emerge from LLM capabilities alone.

Principle 9: Text Games are Ideal for LLM Agents¶

The input/output modality is already language. No vision needed. The challenge shifts to reasoning, planning, and memory — which are LLM strengths.

Principle 10: Observability is Critical¶

Thought traces (ReAct), memory contents, plan state, and decision rationale must all be inspectable for debugging and evaluation.

10. Technology Landscape¶

10.1 Existing Frameworks¶

Framework	Focus	Language	Status
LangChain/LangGraph	General agents	Python	Active, widely used
AutoGen (Microsoft)	Multi-agent	Python	Active
CrewAI	Multi-agent teams	Python	Active
Agents (Zhou et al.)	Modular agents	Python	Research
PIANO (Altera)	Scaled agents	Proprietary	Research
Mem0	Agent memory	Python	Active, open-source

10.2 Relevant Game Environments¶

Environment	Type	Interface	Status
Jericho	542 IF games	Text	Active benchmark
TALES	IF benchmark suite	Text	2025, latest
TextWorld	Procedural text games	Text API	Microsoft, active
ALFWorld	Embodied text games	Text + vision	Active benchmark
NetHack (NLE)	Roguelike	Text/terminal	Active benchmark
MiniHack	NetHack subset	API	Active
Minecraft (STEVE/Voyager)	3D sandbox	API + vision	Active research

10.3 LLM Providers for AI Players¶

Provider	Best Model	Cost/1M tokens	Latency	Best For
Anthropic	Claude Sonnet 4	~$3 in / $15 out	~1s	Strategic planning, reflection
Anthropic	Claude Haiku 3.5	~$0.25 in / $1.25 out	~0.3s	Routine actions, parsing
OpenAI	GPT-4o	~$2.50 in / $10 out	~0.8s	General purpose
OpenAI	GPT-4o-mini	~$0.15 in / $0.60 out	~0.3s	Routine actions
Ollama	Llama 3.2	Free (local)	Varies	Development, testing

11. References¶

Park, J.S., O'Brien, J.C., Cai, C.J., Morris, M.R., Liang, P., & Bernstein, M.S. (2023). Generative Agents: Interactive Simulacra of Human Behavior. UIST 2023. arXiv:2304.03442.
Wang, G., Xian, Y., Srinivasa, G., Yahya, A., Fan, L., & Zhu, Y. (2023). Voyager: An Open-Ended Embodied Lifelong Learning Agent. NeurIPS 2023. arXiv:2305.16291.
Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., & Cao, Y. (2022). ReAct: Synergizing Reasoning and Acting in Language Models. ICLR 2023. arXiv:2210.03629.
Shinn, N., Cassano, F., Gopinath, A., Narasimhan, K., & Yao, S. (2023). Reflexion: Language Agents with Verbal Reinforcement Learning. NeurIPS 2023. arXiv:2303.11366.
Altera.AI et al. (2024). Project SID: Many-Agent Simulations Toward AI Civilization. arXiv:2411.00114.
Lin, J. et al. (2023). Dynalang: Language Models for World Model Learning. ICML 2024. arXiv:2308.01399.
Hu, S., Huang, T., Liu, G., Kompella, R.R., Ilhan, F. et al. (2024). A Survey on LLM-based Game Agents. arXiv:2404.02039.
Gallotta, R., Todd, G., Zammit, M., Earle, S. et al. (2024). Large Language Models and Games: A Survey and Roadmap. IEEE Transactions on Games. arXiv:2402.18659.
Cui, C.Z., Yuan, X., Xiao, Z., Ammanabrolu, P. et al. (2025). TALES: Text Adventure Learning Environment Suite. arXiv:2504.14128.
Zhang, J. & Long, Y. (2025). Learning to Play Like Humans: A Framework for LLM Adaptation in Interactive Fiction Games. ACL Findings 2025.
Basavatia, S. & Murugesan, K. (2024). Starling: Self-supervised Training of Text-based Reinforcement Learning Agent with Large Language Models. ACL Findings 2024.
Wang, J. et al. (2025). Digital Player: Evaluating Large Language Models based Human-like Agent in Games. arXiv:2502.20807.
Yu, J. et al. (2024). Affordable Generative Agents. arXiv:2402.xxxxx.
Zhou, W. et al. (2023). Agents: An Open-source Framework for Autonomous Language Agents. arXiv:2309.07870.
Qian, C. et al. (2023). Experiential Co-Learning of Software-Developing Agents. arXiv:2312.17025.
Hu, Y. et al. (2025). Memory in the Age of AI Agents. arXiv:2512.13564.
Mem0.ai (2025). Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory. arXiv:2504.19413.
Putta, P. et al. (2024). Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents. arXiv:2408.xxxxx.
Liu, H. et al. (2025). PokéAI: A Goal-Generating, Battle-Optimizing Multi-agent LLM System. arXiv:2506.23689.
Hu, M. et al. (2025). AgentGen: Enhancing Planning Abilities for LLM-based Agent via Environment and Task Generation. ACM KDD 2025.