Observability Guide¶

Overview¶

MAID includes production-grade observability via the maid_engine.observability package. All observability is protocol-based (no singletons), owned by GameEngine, and controllable via configuration. The system is designed to never interfere with game logic — all instrumentation calls are wrapped in safety mechanisms that prevent observability failures from affecting gameplay.

Key features: - Structured Logging: JSON and console output via structlog with context propagation - Prometheus Metrics: Tick timing, entity counts, command latency, AI cost tracking - Health Checks: Kubernetes-compatible /healthz, /readyz, /livez endpoints - OpenTelemetry Tracing: Optional distributed tracing with configurable verbosity - AI Cost Tracking: Per-model token budgets, cost estimation, and circuit breakers - Pre-built Dashboards: Grafana dashboard JSON definitions for common views - SLO/SLI Definitions: Tick, command, and AI latency targets with burn-rate alerting - Safe Instrumentation: safe_observe() ensures observability never crashes game logic

Configuration¶

All observability settings use the MAID_OBSERVABILITY__ environment variable prefix.

# Master switch
MAID_OBSERVABILITY__ENABLED=true

# Logging
MAID_OBSERVABILITY__JSON_LOGS=true        # JSON (prod) or console (dev)
MAID_OBSERVABILITY__LOG_LEVEL=INFO
MAID_OBSERVABILITY__LOG_SAMPLING_ENABLED=true

# Metrics
MAID_OBSERVABILITY__METRICS_ENABLED=true
MAID_OBSERVABILITY__INTERNAL_HOST=127.0.0.1
MAID_OBSERVABILITY__INTERNAL_PORT=9090
MAID_OBSERVABILITY__METRICS_TOKEN=        # Required when host != 127.0.0.1

# Tracing (optional)
MAID_OBSERVABILITY__TRACING_MODE=default  # minimal, default, verbose

When MAID_OBSERVABILITY__ENABLED=false, all instrumentation becomes a no-op. Individual subsystems (metrics, tracing) can be toggled independently.

Health Checks¶

Three health check endpoints are served on the internal port (default 9090):

Endpoint	Purpose	200 When
`/healthz`	Basic health	Process is running
`/readyz`	Readiness	Accepting player connections
`/livez`	Liveness	Tick loop is active

These are designed for Kubernetes probes or any external monitoring system:

# Example Kubernetes probe configuration
livenessProbe:
  httpGet:
    path: /livez
    port: 9090
  initialDelaySeconds: 10
  periodSeconds: 5

readinessProbe:
  httpGet:
    path: /readyz
    port: 9090
  initialDelaySeconds: 5
  periodSeconds: 3

Prometheus Metrics¶

Metrics are exposed at /metrics on the internal port in Prometheus exposition format.

Core Metrics¶

Metric	Type	Description
`maid_tick_duration_seconds`	Histogram	Tick loop timing
`maid_entity_count`	Gauge	Current entity count
`maid_command_duration_seconds`	Histogram	Command execution latency
`maid_connections_active`	Gauge	Active player connections
`maid_save_duration_seconds`	Histogram	Entity persistence save timing

AI Metrics¶

Metric	Type	Description
`maid_ai_request_duration_seconds`	Histogram	LLM call latency
`maid_ai_tokens_total`	Counter	Token usage (labels: provider, model, direction)
`maid_ai_cost_dollars_total`	Counter	Estimated AI cost tracker

Content Pack Metrics¶

Content packs can register custom metrics using the pack metrics API:

from maid_engine.observability.metrics import register_pack_counter, register_pack_histogram

# Register a counter for your content pack
events_counter = register_pack_counter(
    "my_pack", "events_processed", "Events processed", ["event_type"]
)

# Register a histogram
duration_hist = register_pack_histogram(
    "my_pack", "action_duration_seconds", "Action processing time"
)

# Use in your systems
events_counter.labels(event_type="combat").inc()
with duration_hist.time():
    await process_action()

All pack metrics are automatically prefixed with maid_pack_<pack_name>_ to avoid collisions.

Structured Logging¶

MAID uses structlog for structured logging with two output modes:

JSON mode (MAID_OBSERVABILITY__JSON_LOGS=true): Machine-readable JSON lines for production log aggregators (ELK, Loki, Datadog)
Console mode (MAID_OBSERVABILITY__JSON_LOGS=false): Human-readable colored output for development

Context Propagation¶

MaidContext automatically attaches contextual fields to every log entry:

Field	Description
`correlation_id`	Unique ID for tracing a request flow
`player_id`	Current player (anonymized)
`session_id`	Network session identifier
`command`	Command being executed
`tick`	Current game tick number

Privacy-Aware Redaction¶

The PlayerIDAnonymizer processor automatically redacts player-identifying information from operational logs. Audit logs retain full identifiers through a separate log channel.

Adaptive Log Sampling¶

When MAID_OBSERVABILITY__LOG_SAMPLING_ENABLED=true, high-volume operational messages (e.g., per-tick system logs) are adaptively sampled to reduce log volume without losing visibility into errors or unusual events. Errors and warnings are never sampled.

Audit vs Operational Logs¶

MAID maintains two separate log channels:

Operational logs: Standard application logging (filtered, sampled, redacted)
Audit logs: Security-relevant events — logins, permission changes, admin actions (never sampled, full detail)

OpenTelemetry Tracing¶

Optional distributed tracing is available via OpenTelemetry with three verbosity modes:

Mode	What is traced
`minimal`	Entry point spans only (connections, startup)
`default`	Commands and AI calls
`verbose`	All systems including per-tick processing

Configure the OTLP exporter endpoint:

OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317

Traces integrate with Jaeger, Zipkin, or any OTLP-compatible backend. Spans are automatically correlated with structured log entries via the correlation_id.

AI Cost Tracking¶

The AICostTracker monitors LLM usage and estimated costs across all AI providers.

Features¶

Per-model pricing: Configurable input/output token prices per model
Rolling token budgets: Hourly and daily budget enforcement
Runtime price updates: Admin API at /admin/ai/pricing for adjusting prices without restart
Circuit breaker integration: Automatic provider fallback on repeated failures

Budget Configuration¶

# Global daily token budget
MAID_AI_DIALOGUE_DAILY_TOKEN_BUDGET=100000

# Per-player daily budget
MAID_AI_DIALOGUE_PER_PLAYER_DAILY_BUDGET=5000

# Global rate limit
MAID_AI_DIALOGUE_GLOBAL_RATE_LIMIT_RPM=60

# Per-player rate limit
MAID_AI_DIALOGUE_PER_PLAYER_RATE_LIMIT_RPM=10

When a budget is exhausted, AI requests are gracefully rejected with a player-facing message rather than failing silently.

Grafana Dashboards¶

Pre-built Grafana dashboard JSON definitions are located at observability/dashboards/:

Server Health — Tick timing, entity counts, active connections, error rates
AI Cost Overview — Token usage by provider/model, cost trends, rate limit hits

Importing Dashboards¶

Open Grafana → Dashboards → Import
Upload the JSON file from observability/dashboards/
Select your Prometheus data source
Save

Dashboards can also be provisioned automatically via Grafana's dashboard provisioning configuration.

SLO/SLI Definitions¶

Service Level Objectives are defined in observability/slo/ as Prometheus recording and alerting rules.

SLI	Target
Tick latency p99	< 100ms
Command latency p99	< 500ms
AI call latency p99	< 5s

Alerting uses multi-window burn-rate calculation to minimize false positives while catching sustained degradation early.

Runbooks¶

Operational runbooks are located at observability/runbooks/ and provide step-by-step response procedures for common alerts:

Alert	Trigger	Runbook Summary
`MaidTickLoopStalled`	Tick loop not advancing	Check system load, inspect stuck systems
`MaidAIBudgetExhausted`	AI token budget exhausted	Review usage, adjust budgets or disable
`MaidTargetDown`	Server unreachable by Prometheus	Check process, network, firewall

Each runbook includes severity classification, investigation steps, and resolution procedures.

safe_observe()¶

All instrumentation calls throughout the codebase are wrapped in the safe_observe() context manager. This ensures that observability code never interferes with game logic.

from maid_engine.observability import safe_observe

async def process_tick(self, delta: float) -> None:
    with safe_observe():
        tick_histogram.observe(delta)
        entity_gauge.set(self.world.entity_count)

    # Game logic continues regardless of instrumentation errors
    await self.run_systems(delta)

Behavior¶

Catches all exceptions raised by instrumentation code
Rate-limits error logging to avoid log flooding from repeated failures
Tracks a meta-counter (maid_observability_errors_total) for instrumentation errors
Never propagates exceptions to the calling game logic