AI Players: Research Survey¶
Overview¶
This document synthesizes academic research on autonomous AI agents playing games, with focus on text-based/interactive fiction environments. The research spans 2022–2025 and covers architecture patterns, memory systems, planning approaches, multi-agent coordination, and cost management. These findings inform the design of MAID's AI Player system.
1. Foundational Architectures¶
1.1 Generative Agents: Interactive Simulacra of Human Behavior¶
- Authors: Park, O'Brien, Cai, Morris, Liang, Bernstein (Stanford/Google)
- Published: 2023, arXiv:2304.03442
- Citations: 3000+
- Venue: UIST 2023
Summary: The seminal work on believable AI agents in sandbox environments. Deployed 25 agents in a Sims-like town ("Smallville") where they autonomously lived, formed relationships, coordinated events, and exhibited emergent social behavior.
Architecture (the "Cognitive Architecture"):
Key Components:
-
Memory Stream: A comprehensive log of ALL agent experiences stored as natural language entries with timestamps. Each entry has: description, creation timestamp, last access timestamp, and importance score (1-10, rated by LLM).
-
Retrieval: When the agent needs to act, it retrieves relevant memories using a scoring function combining:
- Recency (exponential decay)
- Importance (LLM-rated 1-10)
-
Relevance (embedding similarity to current situation)
-
Reflection: Periodically (when sum of importance scores exceeds threshold), the agent synthesizes higher-level insights from recent memories. E.g., "I've been spending a lot of time at the café lately" → "I enjoy socializing with the barista." These reflections are stored back into memory stream and can themselves be reflected upon (recursive abstraction).
-
Planning: Agents create broad day-level plans ("Wake up at 7am, have breakfast, go to work at the library, have lunch, paint in the afternoon"), then recursively decompose into hour-level and action-level plans. Plans are revised when unexpected events occur.
Key Results: - Emergent behavior: One agent autonomously organized a Valentine's Day party — invited others, who then invited their friends, leading to a full social event with no human intervention. - Ablation showed ALL components (observation, planning, reflection) were necessary — removing any one significantly degraded believability. - Human evaluators rated generative agents as more believable than hand-scripted agents.
Relevance to MAID AI Players: - The memory stream + reflection architecture is directly applicable - MUD text output maps to the observation/perception layer - Day-level planning maps to session-level goals (explore area X, level up, complete quest Y) - Reflection enables agents to form opinions about NPCs, learn which areas are dangerous, etc.
1.2 Voyager: An Open-Ended Embodied Lifelong Learning Agent¶
- Authors: Wang, Xian, Srinivasa, Yahya, Fan, Zhu (NVIDIA, Caltech, UT Austin)
- Published: 2023, arXiv:2305.16291
- Citations: 1500+
- Venue: NeurIPS 2023 (Spotlight)
Summary: An LLM-powered agent in Minecraft that continuously explores, acquires skills, and makes discoveries without human intervention. First lifelong learning agent in an open-ended game world.
Architecture (3 key components):
-
Automatic Curriculum: An LLM generates increasingly complex goals based on the agent's current skill set, inventory, and exploration state. Starts simple ("collect wood") and evolves to complex ("build a diamond pickaxe"). The curriculum maximizes exploration by proposing goals just beyond current capability.
-
Skill Library: Executable JavaScript code snippets stored with descriptions. When a new task is encountered, the agent first searches its skill library for relevant existing skills. If none found, it generates new code, tests it, and if successful, adds it to the library. Skills are composable.
-
Iterative Prompting: When code execution fails, the environment error is fed back to the LLM for self-debugging. Includes execution errors, game state changes, and self-verification checks.
Key Results: - 3.3x more unique items discovered than prior SOTA - 2.3x longer travel distances - Unlocked entire Minecraft tech tree without human guidance - Skills transfer to new worlds — learned skills in one seed work in another
Relevance to MAID AI Players: - Automatic curriculum → AI Players can auto-generate goals ("explore the forest", "buy a sword", "fight wolves") - Skill library → Store successful command sequences as reusable "plays" (how to buy from a shop, how to navigate to a location, combat tactics) - Iterative prompting → When a command fails, feed error back and try alternative - Open-ended exploration is exactly the MUD gameplay loop
1.3 ReAct: Synergizing Reasoning and Acting in Language Models¶
- Authors: Yao, Zhao, Yu, Du, Shafran, Narasimhan, Cao (Princeton, Google)
- Published: 2022, arXiv:2210.03629
- Citations: 3000+
- Venue: ICLR 2023
Summary: Introduces the pattern of interleaving reasoning traces (thinking) with actions (doing) in LLM agents. The reasoning helps the agent plan and track progress; the actions gather information from the environment.
The ReAct Loop:
Thought: I need to find the shop to buy a sword.
Action: move north
Observation: You enter the Town Square. A shop is to the east.
Thought: The shop is east of here. I should go there.
Action: move east
Observation: You enter Ye Olde Shoppe. A merchant stands behind the counter.
Thought: I'm in the shop. I should look at what's available.
Action: list
Observation: [Available items listed]
Key Results: - Outperforms pure reasoning (CoT) and pure acting (RL) approaches - 34% improvement on ALFWorld text game benchmark - 10% improvement on WebShop interactive shopping benchmark - Reasoning traces make agent behavior interpretable and debuggable
Relevance to MAID AI Players: - This is the core interaction loop for MUD gameplay - Thought traces provide debugging/logging capability - The pattern naturally maps to: parse MUD output → reason about it → generate command → observe result - Can be extended with memory retrieval between thought and action
1.4 Reflexion: Language Agents with Verbal Reinforcement Learning¶
- Authors: Shinn, Cassano, Gopinath, Narasimhan, Yao (Northeastern, Princeton, MIT)
- Published: 2023, arXiv:2303.11366
- Citations: 1500+
- Venue: NeurIPS 2023
Summary: Agents learn from failures through verbal self-reflection rather than weight updates. After failing a task, the agent generates a natural language reflection on what went wrong and stores it in an episodic memory buffer. On subsequent attempts, these reflections are retrieved and included in the prompt.
Architecture:
Key Components: 1. Evaluator: Determines if the task was completed successfully (binary or scored) 2. Self-Reflection: LLM analyzes the trajectory and generates insights: "I tried to fight the wolf but my health was too low. Next time I should heal first." 3. Episodic Memory: Stores reflections for retrieval in future similar situations
Key Results: - 91% on HumanEval (vs 80% for GPT-4 baseline) — through reflection, not fine-tuning - Significant improvements on ALFWorld (text games) and HotPotQA - No gradient updates needed — purely prompt-based learning - Works across multiple attempts at the same task AND transfers to similar tasks
Relevance to MAID AI Players: - Death/failure reflection: "I died fighting the Forest Wolf because I didn't heal first. Lesson: always check health before combat." - Quest failure reflection: "I couldn't complete the quest because I didn't have the required item." - These reflections accumulate into a strategy guide the agent writes for itself - Zero-cost learning (no fine-tuning required)
2. Multi-Agent Systems¶
2.1 Project SID: Many-Agent Simulations Toward AI Civilization¶
- Authors: Altera.AI team (14 authors)
- Published: 2024, arXiv:2411.00114
- Citations: 50+
Summary: Scaled agent simulations to 10–1000+ agents in Minecraft. Introduced the PIANO (Parallel Information Aggregation via Neural Orchestration) architecture for real-time multi-agent coordination.
Key Contributions:
-
PIANO Architecture: Enables agents to interact with humans and other agents in real-time while maintaining coherence across multiple output streams. Key insight: parallelize perception and action generation across agents using a shared orchestration layer.
-
Civilizational Benchmarks: Evaluated agents on:
- Role specialization (did agents develop distinct roles?)
- Rule adherence and evolution (did they follow and modify rules?)
- Cultural transmission (did knowledge transfer between agents?)
-
Economic behavior (did trade emerge?)
-
Scaling Results:
- At 10 agents: basic social interactions, simple role differentiation
- At 100 agents: emergent economies, political structures, cultural norms
- At 1000+: civilization-level phenomena, religious transmission, institutional formation
Relevance to MAID AI Players: - Architecture for running many AI Players simultaneously - Shared orchestration layer could manage multiple AI Player sessions - Civilizational benchmarks could inspire MUD community dynamics - Scaling considerations directly applicable
2.2 Agents: An Open-source Framework for Autonomous Language Agents¶
- Authors: Zhou et al.
- Published: 2023, arXiv:2309.07870
- Citations: 200+
Summary: Open-source library implementing a modular agent architecture with planning, memory, tool usage, multi-agent communication, and symbolic control.
Architecture Modules: 1. Planning Module: Supports multiple planning strategies (chain-of-thought, tree-of-thought, plan-and-solve) 2. Memory Module: Short-term (context window), long-term (vector store), episodic (trajectory storage) 3. Tool Module: Extensible tool interface for environment interaction 4. Communication Module: Agent-to-agent messaging with structured protocols 5. Symbolic Control: State machines and rule-based overrides for safety
Relevance to MAID AI Players: - Modular architecture pattern to follow - Communication module for multi-AI-player scenarios - Symbolic control for safety (prevent griefing, enforce game rules)
2.3 Experiential Co-Learning of Software-Developing Agents¶
- Authors: Qian et al.
- Published: 2023, arXiv:2312.17025
- Citations: 100+
Summary: Multiple agents learn from historical trajectories. Key insight: agents share experiences through a common knowledge base, allowing one agent's discoveries to benefit all others.
Relevance to MAID AI Players: - Shared knowledge base: one AI Player discovers a quest solution, all others can benefit - Map knowledge sharing: explored rooms/areas pooled across agents - Combat tactics sharing: effective strategies propagated to all agents
3. Text Game & Interactive Fiction Agents¶
3.1 TALES: Text Adventure Learning Environment Suite¶
- Authors: Cui, Yuan, Xiao, Ammanabrolu et al.
- Published: 2025, arXiv:2504.14128
- Citations: 6
Summary: Unified benchmark for LLM agents in text-adventure game environments. Built on Jericho (542 human-written interactive fiction games). Evaluates spatial reasoning, object manipulation, puzzle solving, and narrative comprehension.
Key Findings: - LLMs struggle with: spatial reasoning, long-horizon planning, object state tracking - Best performance with: structured observation parsing, explicit state tracking, plan revision - GPT-4 solves ~30% of games; Claude-3 similar; open models significantly worse
Benchmark Categories: 1. Navigation & Spatial Reasoning 2. Object Manipulation & Puzzles 3. NPC Interaction & Dialogue 4. Combat & Resource Management 5. Multi-step Quest Completion
Relevance to MAID AI Players: - Directly applicable benchmark categories map to MUD gameplay - Known failure modes (spatial reasoning) inform where to add structured state tracking - Suggests AI Players need explicit map/state models, not just LLM reasoning
3.2 Learning to Play Like Humans: LLM Adaptation in Interactive Fiction¶
- Authors: Zhang, Long
- Published: 2025, ACL Findings
- Venue: ACL 2025
Summary: Framework for making LLMs play text games more like humans. Key insight: human players use exploration strategies, form mental models, and build spatial maps — LLMs need explicit scaffolding to replicate these behaviors.
Techniques: 1. Mental Model Construction: Explicitly build and maintain a world model (room graph, inventory state, NPC state) from text observations 2. Exploration Strategies: Systematic exploration (DFS/BFS through rooms) rather than random wandering 3. Human-like Timing: Variable action delays to simulate reading and thinking time
Relevance to MAID AI Players: - Explicit world model construction is essential for MUD navigation - Systematic exploration strategies prevent getting lost - Human-like timing makes AI Players feel natural to human co-players
3.3 Starling: Self-supervised Training of Text-based RL Agent with LLMs¶
- Authors: Basavatia, Murugesan
- Published: 2024, ACL Findings
- Citations: 16
Summary: Self-supervised approach where LLMs generate their own training data for text game agents. Uses LLM to create diverse game scenarios and solutions, then trains a smaller model on these trajectories.
Relevance to MAID AI Players: - Could use large LLM to generate training trajectories, then distill to smaller/cheaper model for runtime - Self-play data generation for offline training
3.4 Digital Player: Evaluating LLM-based Human-like Agent in Games¶
- Authors: Wang et al.
- Published: 2025, arXiv:2502.20807
- Citations: 3
Summary: Evaluates how human-like LLM game agents are. Identifies key challenges: numerical reasoning, long-term planning, and maintaining consistent character personality.
Evaluation Dimensions: 1. Strategic competence (does the agent make good decisions?) 2. Behavioral consistency (does it maintain character?) 3. Social intelligence (does it interact naturally with other players?) 4. Exploration efficiency (does it explore systematically?) 5. Adaptation (does it learn from experience?)
Relevance to MAID AI Players: - Evaluation framework for measuring AI Player quality - Identified failure modes to design around
3.5 TextQuests: How Good are LLMs at Text-Based Video Games?¶
- Authors: Phan et al.
- Published: 2025
Summary: Systematic evaluation of LLMs playing text-based games. Tests multiple models (GPT-4, Claude, Llama, Mistral) across game types.
Key Findings: - All models struggle with: inventory management, combat tactics, long-term resource planning - Models improve significantly with: structured prompts, explicit state tracking, few-shot examples - Chain-of-thought reasoning helps but isn't sufficient alone - Best approach: ReAct-style with explicit state tracking
3.6 Textatari: 100k Frames Game Playing with Language Agents¶
- Authors: Li et al.
- Published: 2025, arXiv:2506.04098
- Citations: 1
Summary: Language agents playing games over very long horizons (100k frames). Shows that structured exploration spaces simplify planning and that current LLM agents can handle extended gameplay sessions.
Relevance to MAID AI Players: - Long-horizon gameplay is viable - Need efficient context management for extended sessions
4. Memory Systems¶
4.1 Memory in the Age of AI Agents¶
- Authors: Hu, Liu, Yue, Zhang, Liu, Zhu, Lin et al.
- Published: 2025, arXiv:2512.13564
Summary: Comprehensive survey on agent memory systems. Categorizes memory into types and surveys implementations across the field.
Memory Taxonomy:
| Type | Description | Persistence | Example |
|---|---|---|---|
| Sensory Buffer | Raw input, very short-lived | Seconds | Current MUD output line |
| Working Memory | Active processing context | Minutes | Current room state, recent events |
| Episodic Memory | Specific experiences | Long-term | "Fought wolf in forest at turn 42" |
| Semantic Memory | Facts and knowledge | Long-term | "Wolves are weak to fire" |
| Procedural Memory | How to do things | Long-term | "To buy: enter shop, list, buy |
| Reflective Memory | Meta-insights | Long-term | "I tend to die when I fight without healing first" |
Memory Operations: 1. Encoding: Convert observation to memory entry (importance scoring, timestamping) 2. Consolidation: Compress and reorganize memories over time (merge duplicates, strengthen important ones) 3. Retrieval: Find relevant memories for current context (similarity search, recency, importance) 4. Forgetting: Remove low-importance, rarely-accessed memories (decay function)
Relevance to MAID AI Players: - Complete memory taxonomy to implement - Consolidation prevents memory bloat during long sessions - Forgetting is essential for cost management (smaller context = fewer tokens)
4.2 Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory¶
- Authors: Mem0.ai team
- Published: 2025, arXiv:2504.19413
Summary: Production-focused memory architecture using graph-based storage for long-term conversational coherence.
Key Features: 1. Graph-based storage: Memories stored as nodes in a knowledge graph with relationships 2. Automatic extraction: LLM extracts facts from conversations automatically 3. Conflict resolution: When new information contradicts old, the system resolves conflicts 4. Scalable retrieval: Efficient search across large memory stores
Relevance to MAID AI Players: - Graph-based storage maps well to MUD world model (rooms as nodes, exits as edges) - Automatic fact extraction from MUD output - Conflict resolution when game state changes unexpectedly
5. Planning & Goal Systems¶
5.1 Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents¶
- Authors: Putta et al.
- Published: 2024, arXiv:2408.xxxxx
Summary: Combines Monte Carlo Tree Search (MCTS) with LLM reasoning and self-critique for complex decision-making in interactive environments. Uses RL-style value estimation without weight updates.
Key Contributions: 1. Search-based planning: MCTS explores action trees before committing 2. Self-critique: Agent evaluates its own proposed actions 3. Progressive deepening: Plans at multiple levels of detail
Relevance to MAID AI Players: - Combat decision-making: search over possible attack/defend/flee sequences - Quest planning: evaluate multiple solution paths - Self-critique prevents obviously bad actions (attacking when low health)
5.2 AgentGen: Enhancing Planning Abilities via Environment and Task Generation¶
- Authors: Hu, Zhao, Xu, Sun, Lou, Lin
- Published: 2025, ACM KDD
- Citations: 61
Summary: Uses LLMs to automatically generate training environments and planning tasks for LLM-based agents. Evaluated on Jericho interactive fiction games.
Relevance to MAID AI Players: - Could auto-generate test scenarios for AI Player validation - Training on diverse generated environments improves generalization
5.3 Hierarchical Planning Pattern (Synthesized from Literature)¶
The most effective planning approach across the literature is hierarchical:
Level 0: Session Goals (set once per session)
"Explore the forest area and reach level 3"
Level 1: Phase Plans (revised every ~30 minutes)
"Phase 1: Buy equipment in town"
"Phase 2: Explore forest paths"
"Phase 3: Fight forest creatures for XP"
Level 2: Task Plans (revised every ~5 minutes)
"Go to the shop, buy a sword, equip it"
Level 3: Action Plans (immediate)
"move east" → "list" → "buy sword" → "wield sword"
Plans at each level are: - Generated once, then revised only when invalidated - Lower levels re-planned more frequently than higher levels - Unexpected events trigger re-planning at the appropriate level
6. Cost Management & Efficiency¶
6.1 Affordable Generative Agents¶
- Authors: Yu et al.
- Published: 2024, arXiv:2402.xxxxx
Summary: Techniques for reducing the cost of running generative agent simulations by 100x while maintaining quality.
Cost Reduction Techniques:
-
Observation batching: Don't process every game tick — batch observations and process periodically (e.g., every 5 seconds instead of every 0.25 seconds)
-
Plan caching: Don't re-plan every action. Generate a plan, execute it step-by-step, only re-plan when something unexpected happens or the plan is complete.
-
Memory summarization: Periodically compress episodic memories into summaries, discarding raw entries. "I fought 3 wolves in the forest over 20 minutes" instead of 20 individual combat entries.
-
Tiered models: Use cheap models (GPT-3.5, Haiku) for routine actions (movement, basic combat) and expensive models (GPT-4, Opus) for strategic decisions (quest planning, NPC dialogue).
-
Template actions: Common action sequences (buy item from shop, navigate known path) are stored as templates and executed without LLM calls.
-
Shared context: When multiple agents are in the same room, share the perception/parsing work.
Cost Estimates (from paper): - Naive approach: ~\(10/agent/hour with GPT-4 - Optimized approach: ~\)0.10/agent/hour with tiered models + caching - 100x reduction while maintaining 90%+ behavioral quality
Relevance to MAID AI Players: - Critical for running multiple AI Players continuously - Tiered model approach maps directly to MAID's LLMProviderRegistry - Template actions for common MUD command sequences - Observation batching for reducing per-tick LLM calls
7. Game-Specific Surveys & Frameworks¶
7.1 A Survey on LLM-based Game Agents¶
- Authors: Hu, Huang, Liu, Kompella, Ilhan et al.
- Published: 2024, arXiv:2404.02039
- Citations: 128
Summary: Comprehensive survey of LLM agents in games. Categorizes by game type, agent architecture, and capability.
Agent Module Taxonomy: 1. Perception: Processing game output (text parsing, screen reading, API data) 2. Memory: Storing and retrieving past experiences 3. Planning: Goal setting, strategy, multi-step reasoning 4. Action: Generating and executing game commands 5. Learning: Improving over time (reflection, fine-tuning, RL) 6. Communication: Inter-agent messaging (in multi-agent settings)
Game Type Categories: - Adventure/RPG (most relevant to MUD) - Strategy - Simulation/Sandbox - Puzzle - Social deduction
Key Finding: Text-based games are considered "ideal testbeds" for LLM agents because the input/output modality is already language — no vision encoder needed.
7.2 Large Language Models and Games: A Survey and Roadmap¶
- Authors: Gallotta, Todd, Zammit, Earle et al.
- Published: 2024, IEEE Transactions on Games
- Citations: 213
Summary: Most comprehensive LLM+games survey. Covers six roles for LLMs in games: 1. Playing games (as agents) 2. Testing games (as QA bots) 3. Generating content (PCG) 4. Narrating (dynamic storytelling) 5. NPC behavior (dialogue, decision-making) 6. Game design (assisting designers)
Relevance to MAID AI Players: - AI Players fill both role 1 (playing) and role 2 (testing) - The survey identifies key challenges for game-playing agents: - Grounding (connecting language to game state) - Long-term coherence (maintaining goals across sessions) - Exploration-exploitation tradeoff - Handling partial observability
8. Additional Relevant Work¶
8.1 Dynalang: Language Models for World Model Learning¶
- Authors: Lin et al.
- Published: 2023, arXiv:2308.01399
- Venue: ICML 2024
Summary: Agent that learns a multimodal world model predicting future states from text + observations. Can leverage game manuals, rule descriptions, and NPC dialogue to build understanding.
Relevance to MAID AI Players: - MUD help files and game documentation can be fed to the agent - Agent can build predictive models ("if I go north, I'll be in the forest") - World model helps with planning without executing actions
8.2 PokéAI: Multi-Agent LLM System for Pokémon Red¶
- Authors: Liu et al.
- Published: 2025, arXiv:2506.23689
Summary: Multi-agent system that plays Pokémon Red with goal generation, battle optimization, and navigation. Uses multiple specialized agents (navigator, battler, strategist) that coordinate.
Relevance to MAID AI Players: - Multi-agent decomposition: separate agents for different MUD activities - Navigator agent → room exploration and pathfinding - Battler agent → combat decisions - Strategist agent → overall goal planning
8.3 LLM2TextGame: Multi-Agents for Generating Consistent Text Adventure Games¶
- Authors: Bazarvaani, Na
- Published: 2024, Korean HCI Conference
Summary: Uses multiple LLM agents to generate consistent text adventure games. Interesting reverse perspective — instead of playing games, generating them. But the consistency mechanisms are relevant.
8.4 WorldWeaver: Procedural World Generation for Text Adventure Games¶
- Authors: Jin, Kaul, Ramakrishanan, Jain et al. (UPenn)
Summary: LLM-based procedural world generation for text adventures. Focus on ensuring spatial consistency and narrative coherence.
Relevance to MAID AI Players: - AI Players could potentially help test procedurally generated content - Spatial consistency checking techniques applicable to map building
8.5 PsychoGAT: Psychological Measurement Through Interactive Fiction with LLM Agents¶
- Authors: Yang, Wang, Chen, Wang, Pu et al.
- Published: 2024, ACL
- Citations: 45
Summary: Uses interactive fiction games with LLM agents to measure psychological traits. Demonstrates that LLM agents can exhibit consistent personality traits during gameplay.
Relevance to MAID AI Players: - AI Players can have distinct, consistent personalities - Personality affects gameplay style (aggressive vs cautious, social vs solo)
9. Synthesis: Key Design Principles¶
From the comprehensive literature review, these principles emerge for designing MAID AI Players:
Principle 1: Cognitive Architecture is Essential¶
Every successful game agent uses a perception → memory → planning → action loop. No single LLM call can substitute for this structured approach.
Principle 2: Memory Must Be Multi-Layered¶
Working memory (current context), episodic memory (past events), semantic memory (learned facts), and reflective memory (meta-insights) all serve distinct purposes. Cutting any layer significantly degrades performance.
Principle 3: Planning Must Be Hierarchical¶
Session-level goals, phase-level plans, task-level sequences, and action-level commands. Only re-plan at the level that was invalidated. This is both more effective and more cost-efficient.
Principle 4: Explicit State Tracking is Non-Negotiable¶
LLMs cannot reliably track game state (inventory, health, map) in their context window alone. Explicit structured state must be maintained outside the LLM and fed as context.
Principle 5: Reflection Enables Zero-Cost Learning¶
Verbal self-reflection (Reflexion pattern) allows agents to learn from failures without fine-tuning. This is critical for MUD gameplay where death and failure are frequent.
Principle 6: Cost Control Requires Tiered Architecture¶
Use cheap models for routine actions, expensive models for strategic decisions. Cache plans, batch observations, and use templates for common sequences. Target: <$0.10/agent/hour.
Principle 7: Multi-Agent Benefits from Shared Knowledge¶
When running multiple AI Players, sharing map knowledge, combat tactics, and quest solutions across agents dramatically improves collective performance.
Principle 8: Human-Likeness Requires Deliberate Design¶
Variable timing, personality consistency, exploration patterns, and social behavior must be explicitly designed — they don't emerge from LLM capabilities alone.
Principle 9: Text Games are Ideal for LLM Agents¶
The input/output modality is already language. No vision needed. The challenge shifts to reasoning, planning, and memory — which are LLM strengths.
Principle 10: Observability is Critical¶
Thought traces (ReAct), memory contents, plan state, and decision rationale must all be inspectable for debugging and evaluation.
10. Technology Landscape¶
10.1 Existing Frameworks¶
| Framework | Focus | Language | Status |
|---|---|---|---|
| LangChain/LangGraph | General agents | Python | Active, widely used |
| AutoGen (Microsoft) | Multi-agent | Python | Active |
| CrewAI | Multi-agent teams | Python | Active |
| Agents (Zhou et al.) | Modular agents | Python | Research |
| PIANO (Altera) | Scaled agents | Proprietary | Research |
| Mem0 | Agent memory | Python | Active, open-source |
10.2 Relevant Game Environments¶
| Environment | Type | Interface | Status |
|---|---|---|---|
| Jericho | 542 IF games | Text | Active benchmark |
| TALES | IF benchmark suite | Text | 2025, latest |
| TextWorld | Procedural text games | Text API | Microsoft, active |
| ALFWorld | Embodied text games | Text + vision | Active benchmark |
| NetHack (NLE) | Roguelike | Text/terminal | Active benchmark |
| MiniHack | NetHack subset | API | Active |
| Minecraft (STEVE/Voyager) | 3D sandbox | API + vision | Active research |
10.3 LLM Providers for AI Players¶
| Provider | Best Model | Cost/1M tokens | Latency | Best For |
|---|---|---|---|---|
| Anthropic | Claude Sonnet 4 | ~$3 in / $15 out | ~1s | Strategic planning, reflection |
| Anthropic | Claude Haiku 3.5 | ~$0.25 in / $1.25 out | ~0.3s | Routine actions, parsing |
| OpenAI | GPT-4o | ~$2.50 in / $10 out | ~0.8s | General purpose |
| OpenAI | GPT-4o-mini | ~$0.15 in / $0.60 out | ~0.3s | Routine actions |
| Ollama | Llama 3.2 | Free (local) | Varies | Development, testing |
11. References¶
-
Park, J.S., O'Brien, J.C., Cai, C.J., Morris, M.R., Liang, P., & Bernstein, M.S. (2023). Generative Agents: Interactive Simulacra of Human Behavior. UIST 2023. arXiv:2304.03442.
-
Wang, G., Xian, Y., Srinivasa, G., Yahya, A., Fan, L., & Zhu, Y. (2023). Voyager: An Open-Ended Embodied Lifelong Learning Agent. NeurIPS 2023. arXiv:2305.16291.
-
Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., & Cao, Y. (2022). ReAct: Synergizing Reasoning and Acting in Language Models. ICLR 2023. arXiv:2210.03629.
-
Shinn, N., Cassano, F., Gopinath, A., Narasimhan, K., & Yao, S. (2023). Reflexion: Language Agents with Verbal Reinforcement Learning. NeurIPS 2023. arXiv:2303.11366.
-
Altera.AI et al. (2024). Project SID: Many-Agent Simulations Toward AI Civilization. arXiv:2411.00114.
-
Lin, J. et al. (2023). Dynalang: Language Models for World Model Learning. ICML 2024. arXiv:2308.01399.
-
Hu, S., Huang, T., Liu, G., Kompella, R.R., Ilhan, F. et al. (2024). A Survey on LLM-based Game Agents. arXiv:2404.02039.
-
Gallotta, R., Todd, G., Zammit, M., Earle, S. et al. (2024). Large Language Models and Games: A Survey and Roadmap. IEEE Transactions on Games. arXiv:2402.18659.
-
Cui, C.Z., Yuan, X., Xiao, Z., Ammanabrolu, P. et al. (2025). TALES: Text Adventure Learning Environment Suite. arXiv:2504.14128.
-
Zhang, J. & Long, Y. (2025). Learning to Play Like Humans: A Framework for LLM Adaptation in Interactive Fiction Games. ACL Findings 2025.
-
Basavatia, S. & Murugesan, K. (2024). Starling: Self-supervised Training of Text-based Reinforcement Learning Agent with Large Language Models. ACL Findings 2024.
-
Wang, J. et al. (2025). Digital Player: Evaluating Large Language Models based Human-like Agent in Games. arXiv:2502.20807.
-
Yu, J. et al. (2024). Affordable Generative Agents. arXiv:2402.xxxxx.
-
Zhou, W. et al. (2023). Agents: An Open-source Framework for Autonomous Language Agents. arXiv:2309.07870.
-
Qian, C. et al. (2023). Experiential Co-Learning of Software-Developing Agents. arXiv:2312.17025.
-
Hu, Y. et al. (2025). Memory in the Age of AI Agents. arXiv:2512.13564.
-
Mem0.ai (2025). Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory. arXiv:2504.19413.
-
Putta, P. et al. (2024). Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents. arXiv:2408.xxxxx.
-
Liu, H. et al. (2025). PokéAI: A Goal-Generating, Battle-Optimizing Multi-agent LLM System. arXiv:2506.23689.
-
Hu, M. et al. (2025). AgentGen: Enhancing Planning Abilities for LLM-based Agent via Environment and Task Generation. ACM KDD 2025.