Skip to content

Observability Guide

Overview

MAID includes production-grade observability via the maid_engine.observability package. All observability is protocol-based (no singletons), owned by GameEngine, and controllable via configuration. The system is designed to never interfere with game logic — all instrumentation calls are wrapped in safety mechanisms that prevent observability failures from affecting gameplay.

Key features: - Structured Logging: JSON and console output via structlog with context propagation - Prometheus Metrics: Tick timing, entity counts, command latency, AI cost tracking - Health Checks: Kubernetes-compatible /healthz, /readyz, /livez endpoints - OpenTelemetry Tracing: Optional distributed tracing with configurable verbosity - AI Cost Tracking: Per-model token budgets, cost estimation, and circuit breakers - Pre-built Dashboards: Grafana dashboard JSON definitions for common views - SLO/SLI Definitions: Tick, command, and AI latency targets with burn-rate alerting - Safe Instrumentation: safe_observe() ensures observability never crashes game logic

Configuration

All observability settings use the MAID_OBSERVABILITY__ environment variable prefix.

# Master switch
MAID_OBSERVABILITY__ENABLED=true

# Logging
MAID_OBSERVABILITY__JSON_LOGS=true        # JSON (prod) or console (dev)
MAID_OBSERVABILITY__LOG_LEVEL=INFO
MAID_OBSERVABILITY__LOG_SAMPLING_ENABLED=true

# Metrics
MAID_OBSERVABILITY__METRICS_ENABLED=true
MAID_OBSERVABILITY__INTERNAL_HOST=127.0.0.1
MAID_OBSERVABILITY__INTERNAL_PORT=9090
MAID_OBSERVABILITY__METRICS_TOKEN=        # Required when host != 127.0.0.1

# Tracing (optional)
MAID_OBSERVABILITY__TRACING_MODE=default  # minimal, default, verbose

When MAID_OBSERVABILITY__ENABLED=false, all instrumentation becomes a no-op. Individual subsystems (metrics, tracing) can be toggled independently.

Health Checks

Three health check endpoints are served on the internal port (default 9090):

Endpoint Purpose 200 When
/healthz Basic health Process is running
/readyz Readiness Accepting player connections
/livez Liveness Tick loop is active

These are designed for Kubernetes probes or any external monitoring system:

# Example Kubernetes probe configuration
livenessProbe:
  httpGet:
    path: /livez
    port: 9090
  initialDelaySeconds: 10
  periodSeconds: 5

readinessProbe:
  httpGet:
    path: /readyz
    port: 9090
  initialDelaySeconds: 5
  periodSeconds: 3

Prometheus Metrics

Metrics are exposed at /metrics on the internal port in Prometheus exposition format.

Core Metrics

Metric Type Description
maid_tick_duration_seconds Histogram Tick loop timing
maid_entity_count Gauge Current entity count
maid_command_duration_seconds Histogram Command execution latency
maid_connections_active Gauge Active player connections
maid_save_duration_seconds Histogram Entity persistence save timing

AI Metrics

Metric Type Description
maid_ai_request_duration_seconds Histogram LLM call latency
maid_ai_tokens_total Counter Token usage (labels: provider, model, direction)
maid_ai_cost_dollars_total Counter Estimated AI cost tracker

Content Pack Metrics

Content packs can register custom metrics using the pack metrics API:

from maid_engine.observability.metrics import register_pack_counter, register_pack_histogram

# Register a counter for your content pack
events_counter = register_pack_counter(
    "my_pack", "events_processed", "Events processed", ["event_type"]
)

# Register a histogram
duration_hist = register_pack_histogram(
    "my_pack", "action_duration_seconds", "Action processing time"
)

# Use in your systems
events_counter.labels(event_type="combat").inc()
with duration_hist.time():
    await process_action()

All pack metrics are automatically prefixed with maid_pack_<pack_name>_ to avoid collisions.

Structured Logging

MAID uses structlog for structured logging with two output modes:

  • JSON mode (MAID_OBSERVABILITY__JSON_LOGS=true): Machine-readable JSON lines for production log aggregators (ELK, Loki, Datadog)
  • Console mode (MAID_OBSERVABILITY__JSON_LOGS=false): Human-readable colored output for development

Context Propagation

MaidContext automatically attaches contextual fields to every log entry:

Field Description
correlation_id Unique ID for tracing a request flow
player_id Current player (anonymized)
session_id Network session identifier
command Command being executed
tick Current game tick number

Privacy-Aware Redaction

The PlayerIDAnonymizer processor automatically redacts player-identifying information from operational logs. Audit logs retain full identifiers through a separate log channel.

Adaptive Log Sampling

When MAID_OBSERVABILITY__LOG_SAMPLING_ENABLED=true, high-volume operational messages (e.g., per-tick system logs) are adaptively sampled to reduce log volume without losing visibility into errors or unusual events. Errors and warnings are never sampled.

Audit vs Operational Logs

MAID maintains two separate log channels:

  • Operational logs: Standard application logging (filtered, sampled, redacted)
  • Audit logs: Security-relevant events — logins, permission changes, admin actions (never sampled, full detail)

OpenTelemetry Tracing

Optional distributed tracing is available via OpenTelemetry with three verbosity modes:

Mode What is traced
minimal Entry point spans only (connections, startup)
default Commands and AI calls
verbose All systems including per-tick processing

Configure the OTLP exporter endpoint:

OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317

Traces integrate with Jaeger, Zipkin, or any OTLP-compatible backend. Spans are automatically correlated with structured log entries via the correlation_id.

AI Cost Tracking

The AICostTracker monitors LLM usage and estimated costs across all AI providers.

Features

  • Per-model pricing: Configurable input/output token prices per model
  • Rolling token budgets: Hourly and daily budget enforcement
  • Runtime price updates: Admin API at /admin/ai/pricing for adjusting prices without restart
  • Circuit breaker integration: Automatic provider fallback on repeated failures

Budget Configuration

# Global daily token budget
MAID_AI_DIALOGUE_DAILY_TOKEN_BUDGET=100000

# Per-player daily budget
MAID_AI_DIALOGUE_PER_PLAYER_DAILY_BUDGET=5000

# Global rate limit
MAID_AI_DIALOGUE_GLOBAL_RATE_LIMIT_RPM=60

# Per-player rate limit
MAID_AI_DIALOGUE_PER_PLAYER_RATE_LIMIT_RPM=10

When a budget is exhausted, AI requests are gracefully rejected with a player-facing message rather than failing silently.

Grafana Dashboards

Pre-built Grafana dashboard JSON definitions are located at observability/dashboards/:

  • Server Health — Tick timing, entity counts, active connections, error rates
  • AI Cost Overview — Token usage by provider/model, cost trends, rate limit hits

Importing Dashboards

  1. Open Grafana → Dashboards → Import
  2. Upload the JSON file from observability/dashboards/
  3. Select your Prometheus data source
  4. Save

Dashboards can also be provisioned automatically via Grafana's dashboard provisioning configuration.

SLO/SLI Definitions

Service Level Objectives are defined in observability/slo/ as Prometheus recording and alerting rules.

SLI Target
Tick latency p99 < 100ms
Command latency p99 < 500ms
AI call latency p99 < 5s

Alerting uses multi-window burn-rate calculation to minimize false positives while catching sustained degradation early.

Runbooks

Operational runbooks are located at observability/runbooks/ and provide step-by-step response procedures for common alerts:

Alert Trigger Runbook Summary
MaidTickLoopStalled Tick loop not advancing Check system load, inspect stuck systems
MaidAIBudgetExhausted AI token budget exhausted Review usage, adjust budgets or disable
MaidTargetDown Server unreachable by Prometheus Check process, network, firewall

Each runbook includes severity classification, investigation steps, and resolution procedures.

safe_observe()

All instrumentation calls throughout the codebase are wrapped in the safe_observe() context manager. This ensures that observability code never interferes with game logic.

from maid_engine.observability import safe_observe

async def process_tick(self, delta: float) -> None:
    with safe_observe():
        tick_histogram.observe(delta)
        entity_gauge.set(self.world.entity_count)

    # Game logic continues regardless of instrumentation errors
    await self.run_systems(delta)

Behavior

  • Catches all exceptions raised by instrumentation code
  • Rate-limits error logging to avoid log flooding from repeated failures
  • Tracks a meta-counter (maid_observability_errors_total) for instrumentation errors
  • Never propagates exceptions to the calling game logic