Observability Guide¶
Overview¶
MAID includes production-grade observability via the maid_engine.observability package. All observability is protocol-based (no singletons), owned by GameEngine, and controllable via configuration. The system is designed to never interfere with game logic — all instrumentation calls are wrapped in safety mechanisms that prevent observability failures from affecting gameplay.
Key features:
- Structured Logging: JSON and console output via structlog with context propagation
- Prometheus Metrics: Tick timing, entity counts, command latency, AI cost tracking
- Health Checks: Kubernetes-compatible /healthz, /readyz, /livez endpoints
- OpenTelemetry Tracing: Optional distributed tracing with configurable verbosity
- AI Cost Tracking: Per-model token budgets, cost estimation, and circuit breakers
- Pre-built Dashboards: Grafana dashboard JSON definitions for common views
- SLO/SLI Definitions: Tick, command, and AI latency targets with burn-rate alerting
- Safe Instrumentation: safe_observe() ensures observability never crashes game logic
Configuration¶
All observability settings use the MAID_OBSERVABILITY__ environment variable prefix.
# Master switch
MAID_OBSERVABILITY__ENABLED=true
# Logging
MAID_OBSERVABILITY__JSON_LOGS=true # JSON (prod) or console (dev)
MAID_OBSERVABILITY__LOG_LEVEL=INFO
MAID_OBSERVABILITY__LOG_SAMPLING_ENABLED=true
# Metrics
MAID_OBSERVABILITY__METRICS_ENABLED=true
MAID_OBSERVABILITY__INTERNAL_HOST=127.0.0.1
MAID_OBSERVABILITY__INTERNAL_PORT=9090
MAID_OBSERVABILITY__METRICS_TOKEN= # Required when host != 127.0.0.1
# Tracing (optional)
MAID_OBSERVABILITY__TRACING_MODE=default # minimal, default, verbose
When MAID_OBSERVABILITY__ENABLED=false, all instrumentation becomes a no-op. Individual subsystems (metrics, tracing) can be toggled independently.
Health Checks¶
Three health check endpoints are served on the internal port (default 9090):
| Endpoint | Purpose | 200 When |
|---|---|---|
/healthz |
Basic health | Process is running |
/readyz |
Readiness | Accepting player connections |
/livez |
Liveness | Tick loop is active |
These are designed for Kubernetes probes or any external monitoring system:
# Example Kubernetes probe configuration
livenessProbe:
httpGet:
path: /livez
port: 9090
initialDelaySeconds: 10
periodSeconds: 5
readinessProbe:
httpGet:
path: /readyz
port: 9090
initialDelaySeconds: 5
periodSeconds: 3
Prometheus Metrics¶
Metrics are exposed at /metrics on the internal port in Prometheus exposition format.
Core Metrics¶
| Metric | Type | Description |
|---|---|---|
maid_tick_duration_seconds |
Histogram | Tick loop timing |
maid_entity_count |
Gauge | Current entity count |
maid_command_duration_seconds |
Histogram | Command execution latency |
maid_connections_active |
Gauge | Active player connections |
maid_save_duration_seconds |
Histogram | Entity persistence save timing |
AI Metrics¶
| Metric | Type | Description |
|---|---|---|
maid_ai_request_duration_seconds |
Histogram | LLM call latency |
maid_ai_tokens_total |
Counter | Token usage (labels: provider, model, direction) |
maid_ai_cost_dollars_total |
Counter | Estimated AI cost tracker |
Content Pack Metrics¶
Content packs can register custom metrics using the pack metrics API:
from maid_engine.observability.metrics import register_pack_counter, register_pack_histogram
# Register a counter for your content pack
events_counter = register_pack_counter(
"my_pack", "events_processed", "Events processed", ["event_type"]
)
# Register a histogram
duration_hist = register_pack_histogram(
"my_pack", "action_duration_seconds", "Action processing time"
)
# Use in your systems
events_counter.labels(event_type="combat").inc()
with duration_hist.time():
await process_action()
All pack metrics are automatically prefixed with maid_pack_<pack_name>_ to avoid collisions.
Structured Logging¶
MAID uses structlog for structured logging with two output modes:
- JSON mode (
MAID_OBSERVABILITY__JSON_LOGS=true): Machine-readable JSON lines for production log aggregators (ELK, Loki, Datadog) - Console mode (
MAID_OBSERVABILITY__JSON_LOGS=false): Human-readable colored output for development
Context Propagation¶
MaidContext automatically attaches contextual fields to every log entry:
| Field | Description |
|---|---|
correlation_id |
Unique ID for tracing a request flow |
player_id |
Current player (anonymized) |
session_id |
Network session identifier |
command |
Command being executed |
tick |
Current game tick number |
Privacy-Aware Redaction¶
The PlayerIDAnonymizer processor automatically redacts player-identifying information from operational logs. Audit logs retain full identifiers through a separate log channel.
Adaptive Log Sampling¶
When MAID_OBSERVABILITY__LOG_SAMPLING_ENABLED=true, high-volume operational messages (e.g., per-tick system logs) are adaptively sampled to reduce log volume without losing visibility into errors or unusual events. Errors and warnings are never sampled.
Audit vs Operational Logs¶
MAID maintains two separate log channels:
- Operational logs: Standard application logging (filtered, sampled, redacted)
- Audit logs: Security-relevant events — logins, permission changes, admin actions (never sampled, full detail)
OpenTelemetry Tracing¶
Optional distributed tracing is available via OpenTelemetry with three verbosity modes:
| Mode | What is traced |
|---|---|
minimal |
Entry point spans only (connections, startup) |
default |
Commands and AI calls |
verbose |
All systems including per-tick processing |
Configure the OTLP exporter endpoint:
Traces integrate with Jaeger, Zipkin, or any OTLP-compatible backend. Spans are automatically correlated with structured log entries via the correlation_id.
AI Cost Tracking¶
The AICostTracker monitors LLM usage and estimated costs across all AI providers.
Features¶
- Per-model pricing: Configurable input/output token prices per model
- Rolling token budgets: Hourly and daily budget enforcement
- Runtime price updates: Admin API at
/admin/ai/pricingfor adjusting prices without restart - Circuit breaker integration: Automatic provider fallback on repeated failures
Budget Configuration¶
# Global daily token budget
MAID_AI_DIALOGUE_DAILY_TOKEN_BUDGET=100000
# Per-player daily budget
MAID_AI_DIALOGUE_PER_PLAYER_DAILY_BUDGET=5000
# Global rate limit
MAID_AI_DIALOGUE_GLOBAL_RATE_LIMIT_RPM=60
# Per-player rate limit
MAID_AI_DIALOGUE_PER_PLAYER_RATE_LIMIT_RPM=10
When a budget is exhausted, AI requests are gracefully rejected with a player-facing message rather than failing silently.
Grafana Dashboards¶
Pre-built Grafana dashboard JSON definitions are located at observability/dashboards/:
- Server Health — Tick timing, entity counts, active connections, error rates
- AI Cost Overview — Token usage by provider/model, cost trends, rate limit hits
Importing Dashboards¶
- Open Grafana → Dashboards → Import
- Upload the JSON file from
observability/dashboards/ - Select your Prometheus data source
- Save
Dashboards can also be provisioned automatically via Grafana's dashboard provisioning configuration.
SLO/SLI Definitions¶
Service Level Objectives are defined in observability/slo/ as Prometheus recording and alerting rules.
| SLI | Target |
|---|---|
| Tick latency p99 | < 100ms |
| Command latency p99 | < 500ms |
| AI call latency p99 | < 5s |
Alerting uses multi-window burn-rate calculation to minimize false positives while catching sustained degradation early.
Runbooks¶
Operational runbooks are located at observability/runbooks/ and provide step-by-step response procedures for common alerts:
| Alert | Trigger | Runbook Summary |
|---|---|---|
MaidTickLoopStalled |
Tick loop not advancing | Check system load, inspect stuck systems |
MaidAIBudgetExhausted |
AI token budget exhausted | Review usage, adjust budgets or disable |
MaidTargetDown |
Server unreachable by Prometheus | Check process, network, firewall |
Each runbook includes severity classification, investigation steps, and resolution procedures.
safe_observe()¶
All instrumentation calls throughout the codebase are wrapped in the safe_observe() context manager. This ensures that observability code never interferes with game logic.
from maid_engine.observability import safe_observe
async def process_tick(self, delta: float) -> None:
with safe_observe():
tick_histogram.observe(delta)
entity_gauge.set(self.world.entity_count)
# Game logic continues regardless of instrumentation errors
await self.run_systems(delta)
Behavior¶
- Catches all exceptions raised by instrumentation code
- Rate-limits error logging to avoid log flooding from repeated failures
- Tracks a meta-counter (
maid_observability_errors_total) for instrumentation errors - Never propagates exceptions to the calling game logic