Production operations | Asset Tokenization Kit

Operating ATK in production

Production operations center on the observability stack as your operational command center. Every critical system—blockchain nodes, settlement vaults, yield processors, RPC gateways, indexers—exposes metrics, logs, and traces that flow into Grafana dashboards. This integration transforms operations from reactive firefighting into proactive monitoring with predictable SLA tracking.

The operational workflow starts with dashboards showing real-time health, drills into Loki logs when anomalies appear, correlates issues through Tempo traces, and ends with kubectl commands to remediate. This playbook walks through common operational scenarios using this observability-first approach.

Observability as competitive advantage

ATK ships with production-grade observability that most tokenization platforms force you to assemble from multiple vendors. The integrated stack provides comprehensive monitoring out of the box—a critical differentiator for production deployments.

Access Grafana at the configured hostname (default: grafana.k8s.orb.local):

kubectl get ingress -n atk grafana

# Default credentials (change immediately)
Username: settlemint
Password: atk

The stack includes:

VictoriaMetrics stores time-series metrics with 1-month retention (configurable via observability.victoria-metrics-single.server.retentionPeriod). All components expose Prometheus-compatible metrics automatically scraped by Alloy agents.

Loki aggregates logs from every pod with 7-day retention (168 hours). Structured metadata enables advanced queries like {namespace="atk", app="dapp"} |= "error" | json to filter JSON logs by severity.

Tempo collects OpenTelemetry traces with 7-day retention, correlating requests across services. Click any trace span to see related logs—critical for debugging distributed transactions through settlement vaults and blockchain confirmations.

Alloy agents run as DaemonSets, scraping metrics from annotated pods, collecting logs, and receiving traces. Configure forwarding to external platforms (Grafana Cloud, Datadog) while keeping local observability operational.

Grafana provides pre-configured datasources and auto-loaded dashboards from ConfigMaps. Every ATK component ships with production-ready dashboards covering critical metrics and SLA indicators.

Production dashboards and SLA targets

Blockchain layer SLAs

Besu Dashboard tracks validator performance with these SLA targets:

Block production rate: >1 block per 15 seconds (target: <10s average)
Peer connectivity: Matches validator count (100% uptime)
Transaction throughput: >100 tx/s sustained (varies by network size)
Chain reorganizations: <1 per day (indicates stability)

Monitor block production in the Besu dashboard. If block times exceed 15 seconds for 10 minutes, the SlowBlockProduction alert fires. Check validator connectivity first—network partitions manifest as peer count drops before block delays.

Portal Metrics dashboard shows RPC gateway authentication and rate limiting. Target SLAs:

Authentication success rate: >99.9%
Rate limit rejections: <0.1% of requests
Gateway availability: 99.95% uptime

Settlement and vault operations

DALP lifecycle monitoring tracks vault operations and settlement flows—key differentiators for production asset tokenization:

Vault operations dashboard monitors:

Deposit confirmations: <30s from blockchain finality to subgraph indexing
Withdrawal processing: <2 minutes end-to-end (includes compliance checks)
Vault balance reconciliation: Real-time vs. blockchain state divergence <0.01%

Query vault deposit latency:

{namespace="atk"} |= "VaultDeposit" | json | latency_ms > 30000

This identifies deposits taking longer than 30 seconds, indicating either blockchain congestion or indexing lag.

Settlement monitoring tracks DvP (Delivery vs Payment) execution:

DvP success rate: >99.5% (excluding legitimate rejections)
Settlement latency: <5s from initiation to on-chain confirmation
Failed settlement reasons: Categorized by compliance rejection, insufficient balance, timeout

Yield processing operations

Yield distribution dashboard monitors dividend and interest payments:

Distribution execution time: <10 minutes for 10,000 token holders
Failed distributions: <0.5% (due to compliance or gas limits)
Gas cost per distribution: <200,000 gas per token holder (optimized batch processing)

Check yield distribution progress:

{namespace="atk", app="yield-processor"} |= "DistributionBatch" | json | status="processing"

Monitor in-progress distributions. Alert if any batch exceeds 15 minutes—indicates blockchain congestion or gas price spikes requiring manual intervention.

Indexing layer SLAs

TheGraph Indexing Status dashboard tracks subgraph synchronization:

Sync lag: <10 blocks behind chain head (target: <5 blocks)
Indexing errors: Zero per hour (any error requires investigation)
Entity processing time: <500ms per block
Query performance: <100ms p50, <500ms p99

TheGraph Query Performance shows GraphQL operation metrics:

Cache hit rate: >60% after 1 hour (improves over time)
Query complexity: Monitor for inefficient queries exceeding cost limits
Response times: <100ms for cached, <500ms for uncached queries

RPC gateway performance

ERPC Dashboard provides comprehensive gateway monitoring:

Request routing: Healthy upstream selection >99.9%
Cache effectiveness: >40% hit rate for read operations
Rate limiting: <1% of legitimate requests rejected
Error rate: <0.1% for production readiness
Latency percentiles: p50 <50ms, p95 <200ms, p99 <500ms

Infrastructure health

Kubernetes Overview dashboard tracks cluster resources:

Node capacity: CPU <80%, memory <85% sustained
Pod health: >99.9% running state
PVC utilization: <85% (alert at 80% for capacity planning)

NGINX Ingress metrics:

Request rate: Baseline established per deployment
Error rate: <0.5% (4xx + 5xx responses)
Response time: p95 <1s for all endpoints

Security hardening

Production deployments require security measures beyond default configurations. Start with secrets management, enforce TLS, restrict network traffic, and apply pod security standards.

Secrets management strategy

Replace inline credentials with Kubernetes Secret references. Create secrets separately from Helm releases to enable rotation without redeployment:

kubectl create secret generic atk-postgres-secret \
  --from-literal=password='GENERATED_SECURE_PASSWORD' \
  -n atk

kubectl create secret generic atk-auth-secret \
  --from-literal=jwt-secret='GENERATED_JWT_SECRET' \
  --from-literal=session-secret='GENERATED_SESSION_SECRET' \
  -n atk

Reference secrets in values-production.yaml:

global:
  datastores:
    default:
      postgresql:
        existingSecret: "atk-postgres-secret"
        existingSecretKeys:
          password: password

dapp:
  env:
    AUTH_SECRET:
      valueFrom:
        secretKeyRef:
          name: atk-auth-secret
          key: jwt-secret

External secret management integrates with HashiCorp Vault, AWS Secrets Manager, or Azure Key Vault using the External Secrets Operator. Configure automatic rotation by updating the secret backend; pods restart automatically on secret changes when configured with annotations.

TLS termination and certificate management

Configure ingress resources with TLS certificates. Use cert-manager for automatic certificate provisioning and renewal from Let's Encrypt or enterprise CAs:

dapp:
  ingress:
    enabled: true
    ingressClassName: "nginx"
    tls:
      - secretName: dapp-tls
        hosts:
          - dapp.example.com
    hosts:
      - host: dapp.example.com
        paths:
          - path: /
            pathType: Prefix

    annotations:
      cert-manager.io/cluster-issuer: "letsencrypt-prod"
      nginx.ingress.kubernetes.io/force-ssl-redirect: "true"

Verify certificate expiration in Grafana: the NGINX Ingress dashboard includes SSL certificate expiration warnings 30 days before renewal.

Network isolation policies

Enable NetworkPolicy resources to implement zero-trust networking. Restrict pod-to-pod traffic to only required communication paths:

erpc:
  networkPolicy:
    enabled: true
    ingress:
      - from:
          - podSelector:
              matchLabels:
                app: dapp
        ports:
          - protocol: TCP
            port: 8545

graph-node:
  networkPolicy:
    enabled: true
    ingress:
      - from:
          - podSelector:
              matchLabels:
                app: hasura
          - podSelector:
              matchLabels:
                app: dapp
        ports:
          - protocol: TCP
            port: 8000

Verify network policies block unauthorized traffic by testing from a pod without matching labels. Network policy violations appear in Kubernetes audit logs queryable in Loki.

Pod security standards compliance

Apply restricted Pod Security Standards to prevent privilege escalation and enforce least-privilege principles:

dapp:
  podSecurityContext:
    runAsNonRoot: true
    runAsUser: 1001
    fsGroup: 1001
    seccompProfile:
      type: RuntimeDefault

  securityContext:
    allowPrivilegeEscalation: false
    readOnlyRootFilesystem: true
    capabilities:
      drop:
        - ALL

  volumeMounts:
    - name: tmp
      mountPath: /tmp
    - name: cache
      mountPath: /app/.cache

  volumes:
    - name: tmp
      emptyDir: {}
    - name: cache
      emptyDir: {}

Read-only root filesystems require explicit volume mounts for writable directories. The DApp needs /tmp and cache directories; other applications may require different paths.

High availability architecture

HA configurations distribute workload across multiple nodes and replicas. Start with critical path components: blockchain validators, RPC gateways, DApp instances, and databases. Then add pod disruption budgets and anti-affinity rules.

Replica configuration strategy

Scale validator nodes for blockchain consensus (minimum 4 for IBFT 2.0), RPC nodes for read capacity, and application instances for request throughput:

network:
  network-nodes:
    validatorReplicaCount: 4
    rpcReplicaCount: 3

dapp:
  replicaCount: 3

hasura:
  replicas: 2

erpc:
  replicaCount: 3

Validator count determines Byzantine fault tolerance. IBFT 2.0 tolerates (n-1)/3 Byzantine nodes, so 4 validators tolerate 1 failure, 7 validators tolerate 2 failures.

RPC node scaling provides read capacity and redundancy. Each RPC node maintains full blockchain state; ERPC gateway load-balances requests across healthy nodes.

Application scaling increases throughput. Stateless DApp instances handle concurrent users; horizontal scaling adds capacity linearly up to database connection limits.

Pod disruption budgets

Most subcharts include PodDisruptionBudget resources preventing simultaneous pod termination during cluster maintenance:

dapp:
  podDisruptionBudget:
    enabled: true
    minAvailable: 1

hasura:
  podDisruptionBudget:
    minAvailable: 1

minAvailable: 1 ensures at least one replica stays running during node drains or upgrades. Adjust for higher availability requirements: minAvailable: 2 with replicaCount: 3 guarantees two instances during maintenance.

Node distribution with anti-affinity

Spread replicas across nodes to survive node failures. Pod anti-affinity rules influence scheduler placement:

dapp:
  affinity:
    podAntiAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
        - weight: 100
          podAffinityTerm:
            labelSelector:
              matchLabels:
                app.kubernetes.io/name: dapp
            topologyKey: kubernetes.io/hostname

preferredDuringSchedulingIgnoredDuringExecution allows scheduling even if anti-affinity cannot be satisfied (useful in small clusters). Change to requiredDuringSchedulingIgnoredDuringExecution for hard requirements in large production clusters.

Monitor replica distribution in the Kubernetes Overview dashboard. Pods concentrated on single nodes indicate scheduler constraints or insufficient cluster capacity.

Database replication

PostgreSQL replication requires external operators (Bitnami PostgreSQL subchart supports replication mode):

support:
  postgresql:
    architecture: replication
    replicaCount: 3
    replication:
      synchronousCommit: "on"
      numSynchronousReplicas: 1

Synchronous replication ensures zero data loss at performance cost. Each write waits for acknowledgment from 1 replica before returning. Async replication improves performance but risks data loss during primary failures.

Monitor replication lag in PostgreSQL dashboard. Lag exceeding 100MB or 10 seconds indicates network issues or insufficient replica resources.

Backup and disaster recovery

Production operations require backup strategies for blockchain data, application databases, and configuration state. Recovery Time Objective (RTO) and Recovery Point Objective (RPO) determine backup frequency and retention.

Database backup automation

Use Velero for Kubernetes-native backups or PostgreSQL-specific backup operators. Velero provides crash-consistent volume snapshots with application-aware hooks:

# Install Velero with cloud provider plugin
velero install \
  --provider aws \
  --plugins velero/velero-plugin-for-aws:v1.9.0 \
  --bucket atk-backups \
  --backup-location-config region=us-east-1

# Create scheduled backup
velero schedule create atk-daily \
  --schedule="0 2 * * *" \
  --include-namespaces atk \
  --storage-location default \
  --snapshot-volumes

Velero backups include PVCs, ConfigMaps, Secrets, and resource definitions. Restore creates namespace with identical state:

velero restore create atk-restore-20250129 \
  --from-backup atk-daily-20250129020000

PostgreSQL logical backups capture consistent database state:

kubectl exec -n atk postgresql-0 -- \
  pg_dump -U postgres -d atk \
  | gzip > atk-db-backup-$(date +%Y%m%d).sql.gz

Automate using CronJobs that upload to S3-compatible storage. Retain daily backups for 7 days, weekly for 4 weeks, monthly for 12 months based on compliance requirements.

Blockchain data snapshots

Besu validator and RPC node data resides in PersistentVolumes. Volume snapshots provide point-in-time recovery:

# List blockchain PVCs
kubectl get pvc -n atk | grep besu

# Create snapshot (requires CSI driver support)
kubectl create -f - <<EOF
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
  name: besu-validator-0-snapshot
  namespace: atk
spec:
  volumeSnapshotClassName: csi-snapshots
  source:
    persistentVolumeClaimName: besu-validator-0
EOF

Snapshot creation time depends on volume size (typically 1-5 minutes for 100GB). Snapshots consume cloud storage incrementally based on changed blocks.

Restore from snapshots by creating new PVCs referencing snapshot as data source:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: besu-validator-0-restored
  namespace: atk
spec:
  dataSource:
    name: besu-validator-0-snapshot
    kind: VolumeSnapshot
    apiGroup: snapshot.storage.k8s.io
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 100Gi

Configuration backup and version control

Store Helm values files in version-controlled repositories. Extract current deployed configuration:

helm get values atk -n atk > atk-values-backup-$(date +%Y%m%d).yaml

Commit to Git alongside infrastructure-as-code definitions. Tag releases matching production deployments for traceable rollback:

git tag -a prod-v1.2.3 -m "Production deployment 2025-01-29"
git push origin prod-v1.2.3

Disaster recovery procedure:

Provision new Kubernetes cluster
Install ATK using backed-up Helm values
Restore database from latest backup
Restore blockchain PVCs from snapshots (or fast-sync from network)
Verify observability dashboards show healthy state

RTO target: <4 hours for full cluster recovery. RPO target: <1 hour data loss (based on hourly database backups and blockchain network fast-sync capability).

Operational scenarios

Real production operations combine dashboard monitoring with log analysis and remediation commands. These scenarios walk through common issues using observability-first workflows.

Scenario: settlement transaction delays

Symptom: Users report vault withdrawals taking longer than expected. DApp shows "pending" status beyond normal 2-minute SLA.

Investigation workflow:

Open Besu dashboard → Check block production rate. If blocks produce normally (<10s average), blockchain is healthy.
Open DALP vault operations dashboard → Check withdrawal processing time percentiles. If p95 >120s, issue is in withdrawal logic not blockchain.
Query Loki for withdrawal transactions:

{namespace="atk", app="dapp"} |= "WithdrawalRequest" | json | status="pending" | line_format "{{.userId}} {{.vaultAddress}} {{.amount}} {{.duration_ms}}"

This shows pending withdrawals with user IDs and processing time. If multiple transactions from same vault show high duration, vault contract may have high gas costs or compliance rule delays.

Open Tempo traces → Search for trace IDs from slow withdrawal logs. Trace spans reveal time spent in each phase: compliance validation, vault unlock, blockchain submission, confirmation waiting.
Identify bottleneck from trace breakdown:

Compliance validation >30s: Compliance rules executing expensive off-chain checks. Review rule complexity; consider caching verification results.
Blockchain submission >60s: Transaction pool congestion or low gas price. Check transaction pool size in Besu dashboard; increase gas price if needed.
Confirmation waiting >45s: Normal for blockchain finality. Communicate expected wait time in UI; no remediation needed.

Remediation: If compliance validation is the bottleneck, review identity registry queries in application logs:

{namespace="atk", app="dapp"} |= "ComplianceCheck" | json | validation_time_ms > 10000

Optimize by adding database indexes on identity lookup fields or implementing compliance rule caching with 5-minute TTL.

Verification: After optimization, monitor withdrawal latency in vault operations dashboard. p95 should drop below 120s within 30 minutes as cache warms up.

Scenario: subgraph indexing lag

Symptom: TheGraph Indexing Status dashboard shows sync lag increasing beyond 10 blocks. Queries return stale data; recent transactions don't appear in DApp.

Investigation workflow:

TheGraph Indexing Status dashboard → Check "Blocks Behind Head" metric. If lag grows continuously, subgraph cannot keep pace with block production.
Check indexing error rate. Any errors require immediate investigation:

{namespace="atk", app="graph-node"} |= "error" | json | line_format "{{.subgraphName}} {{.blockNumber}} {{.errorMessage}}"

Common errors:

"Entity not found": Schema mismatch between subgraph and blockchain events. Verify deployed contract addresses match subgraph configuration.
"Transaction reverted": Subgraph attempting to index failed transactions. Update subgraph handlers to check transaction receipt status.
"Database timeout": PostgreSQL connection saturation. Check database connections in PostgreSQL dashboard.

If no errors appear, check entity processing time. High processing time indicates complex event handlers or database query inefficiency:

{namespace="atk", app="graph-node"} |= "BlockProcessed" | json | processing_time_ms > 500

Review Tempo traces for subgraph handler execution. Slow database queries appear as long-duration spans. Common issues:

Multiple database roundtrips per event (N+1 queries)
Missing indexes on frequently-queried fields
Complex entity relationships requiring optimization

Remediation: For slow database queries, add indexes to subgraph schema:

type Transfer @entity {
  id: ID!
  from: Bytes! @index
  to: Bytes! @index
  value: BigInt!
  blockNumber: BigInt! @index
  timestamp: BigInt! @index
}

Redeploy subgraph after schema changes. Indexing resumes from last successfully processed block.

For persistent lag without errors, scale graph-node replicas to handle parallel indexing:

graph-node:
  replicaCount: 2

Verification: Monitor "Blocks Behind Head" metric. Lag should decrease to <5 blocks within 15 minutes after scaling or optimization. Query performance improves as cache effectiveness increases.

Scenario: high API error rate

Symptom: NGINX Ingress dashboard shows 5xx error rate spike above 1%. Users report intermittent "server error" messages in DApp.

Investigation workflow:

NGINX Ingress dashboard → Check error breakdown by status code and upstream. If errors concentrate on specific backend (e.g., Hasura), issue is in that service not ingress.
Query Loki for error responses:

{namespace="atk", app="hasura"} |= "error" | json | statusCode >= 500

Common 5xx errors:

503 Service Unavailable: Upstream connection refused. Pod may be crashing or not ready.
502 Bad Gateway: Upstream closed connection. Timeout or OOM kill likely.
500 Internal Server Error: Application logic error. Check error details in logs.

Check pod health in Kubernetes Overview dashboard:

Restart count >5: Pod crashlooping. Check logs for panic or fatal errors.
Memory usage >90%: Approaching OOM. Increase memory limits.
CPU throttling: Pod receiving insufficient CPU. Increase requests or limits.

For 503 errors, verify pod readiness:

kubectl get pods -n atk -l app=hasura -o wide

Pods in Running state but failing readiness checks don't receive traffic. Readiness failures appear in describe output:

kubectl describe pod -n atk hasura-PODNAME

Open Tempo traces for failed requests. Trace reveals where error originated:

Error in Hasura GraphQL execution: Database connection exhausted or query timeout
Error in DApp backend: Application logic exception or external API failure
Error in authentication middleware: JWT validation failure or session expiration

Remediation: For database connection exhaustion, check PostgreSQL dashboard connection pool metrics:

hasura:
  config:
    HASURA_GRAPHQL_MAX_CONNECTIONS: 50
    HASURA_GRAPHQL_POOL_TIMEOUT: 180

Increase MAX_CONNECTIONS to 100 if pool utilization exceeds 80% sustained. Scale PostgreSQL replicas if connection limit cannot be increased further.

For memory-related OOM kills, increase pod memory limits:

hasura:
  resources:
    limits:
      memory: 2Gi
    requests:
      memory: 1Gi

Verification: Monitor error rate in NGINX Ingress dashboard. Errors should drop below 0.5% within 5 minutes after remediation. Check pod metrics in Kubernetes dashboard to verify memory usage stays below 85%.

Scenario: validator node connectivity loss

Symptom: Besu dashboard shows peer count drop on one validator. Block production continues but may become unstable if additional validators fail.

Investigation workflow:

Besu dashboard → Identify affected validator by node name. Check peer connectivity graph showing connections between validators.
Query Loki for network errors on affected pod:

{namespace="atk", pod=~"besu-validator-2.*"} |= "peer" |= "disconnect"

Peer disconnections indicate:

Network partition between validators
Firewall rules blocking P2P ports (30303 TCP/UDP)
Pod network policy too restrictive
Validator crashed and restarted (peers reconnect slowly)

Check pod events for network-related issues:

kubectl describe pod -n atk besu-validator-2-0

Look for:

NetworkNotReady: CNI plugin failure
FailedMount: Volume mount issues preventing startup
Unhealthy: Liveness probe failures indicating crash

Test connectivity between validators:

kubectl exec -n atk besu-validator-0-0 -- \
  nc -zv besu-validator-2-0.besu-validator.atk.svc.cluster.local 30303

Connection refused indicates pod not listening. Connection timeout indicates network policy or firewall blocking traffic.

Review network policies:

kubectl get networkpolicy -n atk
kubectl describe networkpolicy -n atk besu-network-policy

Remediation: For network policy issues, verify P2P port ingress allowed:

besu:
  networkPolicy:
    enabled: true
    ingress:
      - from:
          - podSelector:
              matchLabels:
                app: besu
        ports:
          - protocol: TCP
            port: 30303
          - protocol: UDP
            port: 30303

For crashed pods, check logs for errors before restart:

kubectl logs -n atk besu-validator-2-0 --previous

Common crash causes:

OutOfMemory: Increase memory limits from 4Gi to 8Gi
Corrupted database: Restore from PVC snapshot or reinitialize
Genesis mismatch: Regenerate genesis.json and restart all validators

Restart affected validator after remediation:

kubectl rollout restart statefulset/besu-validator -n atk

Verification: Monitor peer count in Besu dashboard. All validators should show full connectivity (peer count = validator count - 1) within 5 minutes. Block production rate returns to normal.

CI/CD integration

Automated deployment pipelines ensure consistent, repeatable releases. The included GitHub Actions workflows provide foundation for extending to production deployment.

QA workflow structure

.github/workflows/qa.yml executes comprehensive testing pipeline:

Artifact generation phase compiles smart contracts, generates ABIs, creates TypeScript bindings, and produces genesis allocation:

- name: Generate artifacts
  run: bun run artifacts

- name: Upload contract ABIs
  uses: actions/upload-artifact@v4
  with:
    name: contract-artifacts
    path: kit/contracts/artifacts/

Artifacts become Docker image layers and Helm chart ConfigMaps for deployment.

Backend services startup launches Docker Compose stack with PostgreSQL, Redis, Besu node, and supporting services. Tests execute against this local environment mimicking production:

- name: Start development environment
  run: bun dev:up
  timeout-minutes: 5

Test execution runs parallel test suites across packages:

- name: Run test suite
  run: bunx turbo run ci:gha --concurrency=3

Includes unit tests (Vitest, Foundry), integration tests (contract interactions), E2E tests (Playwright), linting, and type checking.

Chart validation (conditional on Helm chart changes) uses ct (chart-testing) to lint and validate against Kubernetes API:

- name: Lint and install charts
  run: |
    ct lint --config ct.yaml
    ct install --config ct.yaml

Image builds create container images tagged with commit SHA and push to GitHub Container Registry:

- name: Build and push DApp image
  uses: docker/build-push-action@v5
  with:
    context: .
    file: Dockerfile.dapp
    push: true
    tags: ghcr.io/${{ github.repository }}/dapp:${{ github.sha }}

Production deployment workflow

Extend QA workflow for production deployment triggered by git tags:

name: Deploy to production

on:
  push:
    tags:
      - "v*"

jobs:
  deploy:
    runs-on: ubuntu-latest
    environment: production

    steps:
      - uses: actions/checkout@v4

      - name: Configure kubectl
        run: |
          echo "${{ secrets.KUBE_CONFIG }}" | base64 -d > kubeconfig
          export KUBECONFIG=kubeconfig
          kubectl cluster-info

      - name: Deploy ATK
        working-directory: kit/charts/atk
        run: |
          helm upgrade atk . \
            --install \
            --namespace atk \
            --create-namespace \
            --values values-production.yaml \
            --set dapp.image.tag=${{ github.ref_name }} \
            --set contracts.image.tag=${{ github.ref_name }} \
            --timeout 20m \
            --wait \
            --atomic

--atomic flag rolls back automatically if deployment fails. --wait waits for all resources to reach ready state before considering deployment successful.

Post-deployment verification checks observability dashboards programmatically:

- name: Verify deployment health
  run: |
    # Wait for pods to be ready
    kubectl wait --for=condition=ready pod \
      -l app.kubernetes.io/instance=atk \
      -n atk \
      --timeout=600s

    # Check Besu validator connectivity
    PEER_COUNT=$(kubectl exec -n atk besu-validator-0-0 -- \
      curl -s -X POST http://localhost:8545 \
      -H "Content-Type: application/json" \
      -d '{"jsonrpc":"2.0","method":"net_peerCount","params":[],"id":1}' \
      | jq -r '.result' | xargs printf "%d")

    if [ "$PEER_COUNT" -lt 3 ]; then
      echo "Insufficient peers: $PEER_COUNT"
      exit 1
    fi

GitOps deployment with ArgoCD

Declarative deployments using ArgoCD provide automatic synchronization and drift detection:

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: atk-production
  namespace: argocd
spec:
  project: tokenization

  source:
    repoURL: https://github.com/your-org/asset-tokenization-kit
    targetRevision: main
    path: kit/charts/atk
    helm:
      valueFiles:
        - values-production.yaml
      parameters:
        - name: dapp.image.tag
          value: v1.2.3

  destination:
    server: https://kubernetes.default.svc
    namespace: atk

  syncPolicy:
    automated:
      prune: true
      selfHeal: true
    syncOptions:
      - CreateNamespace=true
    retry:
      limit: 5
      backoff:
        duration: 5s
        factor: 2
        maxDuration: 3m

ArgoCD monitors Git repository and automatically applies changes. Drift detection alerts when manual kubectl changes deviate from Git state. Self-healing automatically reverts unauthorized changes.

Progressive delivery with ArgoCD Rollouts enables canary deployments:

dapp:
  rollout:
    enabled: true
    strategy:
      canary:
        steps:
          - setWeight: 20
          - pause: { duration: 5m }
          - setWeight: 50
          - pause: { duration: 5m }
          - setWeight: 80
          - pause: { duration: 5m }

Canary deployment gradually shifts traffic to new version. Pause between steps allows observability dashboard monitoring for increased error rates or latency before proceeding.

Troubleshooting with observability

Integrated observability accelerates troubleshooting by correlating metrics, logs, and traces. Start with high-level dashboards to identify symptoms, drill into logs for details, follow traces through distributed operations, then execute remediation commands.

General troubleshooting workflow

Pod crashes and restart loops

Using observability: Open Kubernetes Overview dashboard → Find pods with restart count >3. Click pod name to filter metrics showing resource usage before crash.

Query Loki for crash logs:

{namespace="atk"} |= "panic" or "fatal" or "OOMKilled"

Filter by specific application:

{namespace="atk", pod=~"dapp-.*"} |= "error" | json | level="fatal"

Check memory trends before OOM kills in Node Exporter dashboard. Memory usage climbing to pod limit indicates undersized resources.

Using kubectl: Check previous pod logs (before crash):

kubectl logs -n atk <pod-name> --previous --tail=100

Describe pod for termination reason:

kubectl describe pod -n atk <pod-name>

Look for:

OOMKilled: Memory limit exceeded → Increase resources.limits.memory
Error: Exit code nonzero → Check application logs for panic
CrashLoopBackOff: Pod repeatedly failing startup → Fix configuration or application code

Common remediation:

Database connection errors:

{namespace="atk", app="dapp"} |= "database" |= "connection refused"

PostgreSQL dashboard shows connection pool exhausted. Increase application connection pool size or scale database replicas.

RPC connection failures:

{namespace="atk", app="dapp"} |= "RPC" |= "timeout"

ERPC dashboard displays all upstreams unhealthy. Check Besu validator status; if validators are healthy, restart ERPC gateway to refresh upstream health checks.

Ingress connectivity issues

Using observability: Open NGINX Ingress dashboard → Check request rate and status code distribution. High 502/503 rate indicates backend unavailability; 404 rate suggests misconfigured routes.

Query ingress controller logs:

{namespace="atk", app="ingress-nginx"} |= "error" | json | statusCode >= 500

SSL/TLS errors:

{namespace="atk", app="ingress-nginx"} |= "TLS" |= "handshake"

Check certificate expiration in dashboard SSL certificate panel. Expired certificates require renewal via cert-manager or manual secret update.

Using kubectl: Verify ingress controller pods running:

kubectl get pods -n atk -l app.kubernetes.io/name=ingress-nginx

Check ingress resources:

kubectl get ingress -n atk
kubectl describe ingress -n atk dapp-ingress

Ensure ADDRESS field populated with LoadBalancer IP. Missing address indicates LoadBalancer provisioning failure or cloud provider issue.

Test DNS resolution:

nslookup dapp.example.com

Verify DNS record points to ingress LoadBalancer IP. DNS propagation takes 1-60 minutes after changes.

Database migration failures

DApp includes post-install Helm hook job executing Drizzle migrations. Hook failures prevent deployment success.

Check migration job logs:

kubectl logs -n atk job/atk-dapp-migrations

Common failures:

Schema already exists: Migration previously executed. Delete job and retry: kubectl delete job -n atk atk-dapp-migrations
Connection refused: Database not ready. Increase helm.sh/hook-wait-timeout annotation or add init container polling database readiness
Permission denied: Database user lacks schema modification rights. Grant superuser or schema owner privileges

Manually execute migrations:

kubectl exec -n atk deployment/atk-dapp -- bun run db:migrate

Genesis file regeneration

Modifying smart contract deployments requires regenerating genesis allocations and restarting validators with new genesis.

Regenerate artifacts locally:

bun run artifacts

Update genesis ConfigMap:

kubectl create configmap atk-genesis \
  --from-file=genesis.json=kit/contracts/genesis.json \
  --dry-run=client -o yaml \
  | kubectl apply -n atk -f -

Restart validators to load new genesis:

kubectl rollout restart statefulset/besu-validator -n atk
kubectl rollout restart statefulset/besu-rpc -n atk

Monitor restart in Besu dashboard. Block production resumes within 2 minutes after majority validators restart.

WARNING: Genesis changes on running network cause chain fork. Only regenerate genesis for development/test environments or when performing coordinated network upgrade with all participants.

Helm release recovery

Failed Helm upgrades leave releases in pending or failed state. Recover using rollback or force upgrade.

Check release status:

helm status atk -n atk
helm history atk -n atk

Rollback to last successful revision:

helm rollback atk -n atk

Specify revision number for targeted rollback:

helm rollback atk 3 -n atk

Force upgrade (use cautiously—may cause resource conflicts):

helm upgrade atk . \
  --namespace atk \
  --values values-production.yaml \
  --force \
  --cleanup-on-fail

--force deletes and recreates resources with changes. --cleanup-on-fail removes resources if upgrade fails, preventing stuck state.

For severely corrupted releases, uninstall and reinstall (destroys data):

helm uninstall atk -n atk
helm install atk . -n atk -f values-production.yaml

Restore data from backups after reinstall.

Alerting configuration

Proactive monitoring requires alerting on SLA violations and anomalous conditions. Configure Grafana alert rules with notification channels for critical events.

Alert rule structure

Navigate to Grafana → Alerting → Alert rules → New alert rule. Define query, threshold, evaluation interval, and notification policy.

Example alert for slow block production:

groups:
  - name: blockchain_health
    interval: 1m
    rules:
      - alert: SlowBlockProduction
        expr: rate(besu_blockchain_height[5m]) < 0.067
        for: 10m
        labels:
          severity: warning
          component: blockchain
        annotations:
          summary: "Block production slower than target"
          description:
            "Block production rate {{ $value | humanize }} blocks/s is below
            target 0.067 blocks/s (15s block time) for 10 minutes on {{
            $labels.instance }}"

Alert fires when 5-minute block production rate drops below 1 block per 15 seconds sustained for 10 minutes.

Critical SLA alerts

Settlement latency:

- alert: HighSettlementLatency
  expr: histogram_quantile(0.95, rate(settlement_duration_seconds_bucket[5m])) > 5
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "Settlement latency exceeds SLA"
    description: "95th percentile settlement time {{ $value }}s exceeds 5s SLA"

Subgraph sync lag:

- alert: SubgraphSyncLag
  expr: (ethereum_chain_head_number - ethereum_block_number) > 10
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "Subgraph falling behind chain head"
    description: "Subgraph {{ $labels.subgraph }} is {{ $value }} blocks behind chain head"

Database connection exhaustion:

- alert: DatabaseConnectionPoolSaturated
  expr: (sum(pg_stat_activity_count) by (instance) /
    sum(pg_settings_max_connections) by (instance)) > 0.9
  for: 3m
  labels:
    severity: critical
  annotations:
    summary: "Database connection pool &gt;90% utilized"
    description: "PostgreSQL instance {{ $labels.instance }} has {{ $value |
      humanizePercentage }} connection pool utilization"

Disk space critical:

- alert: PVCDiskSpaceCritical
  expr:
    (kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes) >
    0.85
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "PVC {{ $labels.persistentvolumeclaim }} &gt;85% full"
    description: "Disk usage {{ $value | humanizePercentage }} on PVC {{
      $labels.persistentvolumeclaim }} in namespace {{ $labels.namespace }}"

Notification channels

Configure contact points for alert delivery:

Slack integration:

Navigate to Grafana → Alerting → Contact points → New contact point
Select "Slack" type
Enter webhook URL from Slack App configuration
Test contact point to verify delivery

PagerDuty integration:

Create PagerDuty integration key for ATK service
Add contact point with "PagerDuty" type
Enter integration key
Configure routing by severity (critical → page on-call, warning → notification)

Email notification:

Configure SMTP settings in Grafana Helm values:

grafana:
  smtp:
    enabled: true
    host: smtp.example.com:587
    user: [email protected]
    password: "${SMTP_PASSWORD}"
    from_address: [email protected]

Add email contact point with recipient addresses

Notification policies

Route alerts to appropriate channels based on severity and component:

# Grafana notification policy configuration
policies:
  - match:
      severity: critical
    continue: true
    contact_point: pagerduty

  - match:
      severity: warning
      component: blockchain
    contact_point: slack-ops

  - match:
      severity: warning
    contact_point: email-team

Critical alerts page on-call engineer via PagerDuty. Warning-level blockchain alerts notify operations Slack channel. Other warnings send email to engineering team.

Incident response runbook

Production incidents require structured response workflows to minimize downtime and ensure proper root cause analysis. The incident response process follows a clear escalation path from detection through resolution and postmortem.

Incident severity classification

P0 - Critical (immediate response):

Complete service outage affecting all users
Data loss or corruption detected
Security breach or unauthorized access
Blockchain consensus failure (validators offline)

P1 - High (15-minute response):

Partial service degradation affecting >50% users
Settlement transaction failures >5% rate
Database replication lag >5 minutes
Critical component failure with redundancy active

P2 - Medium (1-hour response):

Performance degradation within SLA but trending worse
Single component failure with no user impact
Subgraph indexing lag >50 blocks
Non-critical alert patterns requiring investigation

P3 - Low (next business day):

Minor performance issues
Non-blocking errors in logs
Resource utilization trending toward limits
Documentation or configuration improvements needed

Escalation criteria

Escalate incidents when:

Mitigation attempts fail after 30 minutes
Impact scope increases during response
Root cause requires specialized expertise (smart contracts, blockchain infrastructure)
Incident affects compliance or regulatory obligations
Multiple components failing simultaneously suggests systemic issue

Escalation contacts maintained in PagerDuty rotation. For P0 incidents, engage incident commander role to coordinate cross-team response.

Communication protocols

Status page updates:

Update status page at:

Incident start (within 5 minutes of confirmation)
Every 30 minutes during active mitigation
Significant changes in scope or impact
Resolution confirmation
Postmortem publication

Status page example entries:

[Investigating] Settlement transactions experiencing delays &gt;2 minutes. Team investigating blockchain node connectivity.

[Identified] Issue identified as RPC gateway upstream health check failure. Restarting ERPC pods.

[Monitoring] ERPC gateway restored. Monitoring settlement latency for 30 minutes before confirming resolution.

[Resolved] Settlement transactions returned to normal latency &lt;5s. Root cause: upstream timeout configuration too aggressive.

Stakeholder notifications:

P0/P1: Immediate Slack notification to #incidents channel + PagerDuty alert
P0: Email to executive stakeholders within 15 minutes
All: Post-incident summary to #engineering within 24 hours
P0/P1: Postmortem document within 5 business days

Mitigation playbooks

Rollback deployment:

# Check recent deployments
helm history atk -n atk

# Rollback to previous revision
helm rollback atk -n atk

# Monitor rollback progress
kubectl rollout status deployment/atk-dapp -n atk
watch kubectl get pods -n atk

Use rollback when:

Recent deployment correlates with incident start time
New version showing increased error rates or latency
Configuration change caused service disruption

Scale resources:

# Scale application replicas
kubectl scale deployment/atk-dapp -n atk --replicas=6

# Scale RPC nodes for increased blockchain read capacity
kubectl scale statefulset/besu-rpc -n atk --replicas=5

# Scale Hasura for database query load
helm upgrade atk . -n atk \
  --set hasura.replicaCount=4 \
  --reuse-values

Use scaling when:

CPU/memory utilization >85% sustained
Request queue depth increasing
Response times degrading under load
Connection pool exhaustion

Restart services:

# Restart DApp deployment
kubectl rollout restart deployment/atk-dapp -n atk

# Restart ERPC gateway to refresh upstream health
kubectl rollout restart deployment/erpc -n atk

# Restart Hasura for database connection reset
kubectl rollout restart deployment/hasura -n atk

# Restart specific pod (last resort)
kubectl delete pod -n atk hasura-PODNAME

Use restarts when:

Application showing memory leaks or resource exhaustion
Stale connections to external services
Configuration reload required without full redeployment
Clearing transient error states

Root cause analysis

After immediate mitigation, identify root cause through observability stack:

Metrics analysis: Review dashboard metrics for 2 hours before incident. Compare against baseline to identify anomalies. Look for:

Resource utilization spikes (CPU, memory, network)
Error rate increases in specific components
Latency percentile degradation (p95, p99)
Throughput drops or queue depth increases

Log correlation: Query Loki for error patterns:

{namespace="atk"} |= "error" or "fatal" or "exception"
  | json
  | __error__=""
  | line_format "{{.timestamp}} {{.level}} {{.component}} {{.message}}"

Filter to incident time window. Identify first occurrence of errors and trace backward to triggering event.

Trace analysis: Open Tempo and search for failed requests during incident window. Follow trace spans to identify:

Which component introduced latency or errors
External dependency timeouts or failures
Database query performance issues
Authentication or authorization failures

Change correlation: Review recent changes:

# Recent deployments
helm history atk -n atk

# Recent pod restarts
kubectl get events -n atk --sort-by='.lastTimestamp' | head -50

# Git commits deployed
git log --since="2 hours ago" --oneline

Correlate incident timing with deployments, configuration changes, or infrastructure modifications.

Postmortem documentation

Document every P0 and P1 incident with structured postmortem:

Incident summary:

Date and time (start, detection, mitigation, resolution)
Severity classification
Impact scope (users affected, revenue impact, SLA breach duration)
Incident commander and responders

Timeline:

Chronological sequence of events with timestamps:

14:23 UTC - Alert fired: HighSettlementLatency
14:25 UTC - On-call engineer acknowledged
14:28 UTC - Identified ERPC upstream health check failures
14:30 UTC - Status page updated: Investigating
14:32 UTC - Restarted ERPC deployment
14:35 UTC - Settlement latency returning to normal
14:45 UTC - Confirmed resolution, updated status page

Root cause:

Single-sentence root cause statement, followed by detailed explanation:

Root cause: ERPC gateway upstream health check timeout set to 5s caused false negatives during normal Besu block production delays of 10-15s, marking healthy RPC nodes as unhealthy and routing traffic to single remaining node.

Impact assessment:

Users affected: Estimated 200 active users during incident window
Transactions affected: 347 settlement transactions delayed >2 minutes
SLA impact: Settlement latency SLA (<5s) breached for 22 minutes
Revenue impact: Minimal (transactions completed, delayed only)

Mitigation actions:

Restarted ERPC deployment to reset upstream health state (14:32 UTC)
Verified settlement latency returned to normal <5s (14:35 UTC)
Monitored for 30 minutes to confirm stable state

Corrective actions (with owners and due dates):

Action 1: Increase ERPC upstream health check timeout from 5s to 30s (Owner: DevOps, Due: 2025-01-30)
Action 2: Add alerting for ERPC upstream health check failures (Owner: SRE, Due: 2025-02-05)
Action 3: Document ERPC tuning guidelines in runbook (Owner: Tech Writer, Due: 2025-02-10)
Action 4: Review all service health check timeouts for appropriate values (Owner: Platform Team, Due: 2025-02-15)

Lessons learned:

Health check timeouts must account for normal system variability
Upstream health check failures should trigger alerts before routing impact
Need better observability into ERPC upstream selection decisions
Status page updates were timely and clear (positive feedback from users)

Store postmortems in shared documentation repository (Git or Confluence). Reference in future incident response and system design discussions.

External observability forwarding

Send telemetry to external platforms for long-term retention, advanced analysis, or centralized monitoring across multiple deployments.

Configure Alloy to forward metrics, logs, and traces while maintaining local observability:

observability:
  alloy:
    endpoints:
      external:
        prometheus:
          enabled: true
          url: https://prometheus-prod.grafana.net/api/prom/push
          basicAuth:
            username: "123456"
            passwordSecretRef:
              name: grafana-cloud-secrets
              key: prometheus-password

        loki:
          enabled: true
          url: https://logs-prod.grafana.net/loki/api/v1/push
          basicAuth:
            username: "123456"
            passwordSecretRef:
              name: grafana-cloud-secrets
              key: loki-password

        tempo:
          enabled: true
          url: https://tempo-prod.grafana.net:443
          basicAuth:
            username: "123456"
            passwordSecretRef:
              name: grafana-cloud-secrets
              key: tempo-password

Create secret with platform credentials:

kubectl create secret generic grafana-cloud-secrets \
  --from-literal=prometheus-password='PROMETHEUS_TOKEN' \
  --from-literal=loki-password='LOKI_TOKEN' \
  --from-literal=tempo-password='TEMPO_TOKEN' \
  -n atk

Alloy forwards all telemetry to both local VictoriaMetrics/Loki/Tempo and external endpoints. Local observability remains fully operational during external platform outages.

Datadog integration uses OTLP exporter:

observability:
  alloy:
    endpoints:
      external:
        otel:
          enabled: true
          url: https://api.datadoghq.com
          headers:
            DD-API-KEY: "${DATADOG_API_KEY}"

Uninstallation and cleanup

Complete removal requires deleting Helm release, namespace, and PVCs (which may not delete automatically based on reclaim policy).

Uninstall Helm release:

helm uninstall atk -n atk

Delete namespace (removes all resources):

kubectl delete namespace atk

Warning: This permanently deletes all data including blockchain state, databases, and configuration. Backup critical data before uninstalling.

List remaining PVCs with Retain reclaim policy:

kubectl get pv | grep atk

Manually delete retained PVs:

kubectl delete pv <pv-name>

Cloud provider volumes may require manual deletion through provider console (AWS EBS volumes, Azure Disks, GCP Persistent Disks).

Next steps

Review Installation & configuration for initial deployment and configuration reference
See Testing and QA for running tests against deployed environments
Explore Development FAQ for common deployment issues
Consult Code Structure to understand monorepo layout
Review Contract Reference for smart contract interfaces

Operating ATK in production

Observability as competitive advantage

Access Grafana at the configured hostname (default: grafana.k8s.orb.local):

kubectl get ingress -n atk grafana

# Default credentials (change immediately)
Username: settlemint
Password: atk

The stack includes:

Grafana provides pre-configured datasources and auto-loaded dashboards from ConfigMaps. Every ATK component ships with production-ready dashboards covering critical metrics and SLA indicators.

Production dashboards and SLA targets

Blockchain layer SLAs

Besu Dashboard tracks validator performance with these SLA targets:

Block production rate: >1 block per 15 seconds (target: <10s average)
Peer connectivity: Matches validator count (100% uptime)
Transaction throughput: >100 tx/s sustained (varies by network size)
Chain reorganizations: <1 per day (indicates stability)

Portal Metrics dashboard shows RPC gateway authentication and rate limiting. Target SLAs:

Authentication success rate: >99.9%
Rate limit rejections: <0.1% of requests
Gateway availability: 99.95% uptime

Settlement and vault operations

DALP lifecycle monitoring tracks vault operations and settlement flows—key differentiators for production asset tokenization:

Vault operations dashboard monitors:

Deposit confirmations: <30s from blockchain finality to subgraph indexing
Withdrawal processing: <2 minutes end-to-end (includes compliance checks)
Vault balance reconciliation: Real-time vs. blockchain state divergence <0.01%

Query vault deposit latency:

{namespace="atk"} |= "VaultDeposit" | json | latency_ms > 30000

This identifies deposits taking longer than 30 seconds, indicating either blockchain congestion or indexing lag.

Settlement monitoring tracks DvP (Delivery vs Payment) execution:

DvP success rate: >99.5% (excluding legitimate rejections)
Settlement latency: <5s from initiation to on-chain confirmation
Failed settlement reasons: Categorized by compliance rejection, insufficient balance, timeout

Yield processing operations

Yield distribution dashboard monitors dividend and interest payments:

Distribution execution time: <10 minutes for 10,000 token holders
Failed distributions: <0.5% (due to compliance or gas limits)
Gas cost per distribution: <200,000 gas per token holder (optimized batch processing)

Check yield distribution progress:

{namespace="atk", app="yield-processor"} |= "DistributionBatch" | json | status="processing"

Monitor in-progress distributions. Alert if any batch exceeds 15 minutes—indicates blockchain congestion or gas price spikes requiring manual intervention.

Indexing layer SLAs

TheGraph Indexing Status dashboard tracks subgraph synchronization:

Sync lag: <10 blocks behind chain head (target: <5 blocks)
Indexing errors: Zero per hour (any error requires investigation)
Entity processing time: <500ms per block
Query performance: <100ms p50, <500ms p99

TheGraph Query Performance shows GraphQL operation metrics:

Cache hit rate: >60% after 1 hour (improves over time)
Query complexity: Monitor for inefficient queries exceeding cost limits
Response times: <100ms for cached, <500ms for uncached queries

RPC gateway performance

ERPC Dashboard provides comprehensive gateway monitoring:

Request routing: Healthy upstream selection >99.9%
Cache effectiveness: >40% hit rate for read operations
Rate limiting: <1% of legitimate requests rejected
Error rate: <0.1% for production readiness
Latency percentiles: p50 <50ms, p95 <200ms, p99 <500ms

Infrastructure health

Kubernetes Overview dashboard tracks cluster resources:

Node capacity: CPU <80%, memory <85% sustained
Pod health: >99.9% running state
PVC utilization: <85% (alert at 80% for capacity planning)

NGINX Ingress metrics:

Request rate: Baseline established per deployment
Error rate: <0.5% (4xx + 5xx responses)
Response time: p95 <1s for all endpoints

Security hardening

Production deployments require security measures beyond default configurations. Start with secrets management, enforce TLS, restrict network traffic, and apply pod security standards.

Secrets management strategy

Replace inline credentials with Kubernetes Secret references. Create secrets separately from Helm releases to enable rotation without redeployment:

kubectl create secret generic atk-postgres-secret \
  --from-literal=password='GENERATED_SECURE_PASSWORD' \
  -n atk

kubectl create secret generic atk-auth-secret \
  --from-literal=jwt-secret='GENERATED_JWT_SECRET' \
  --from-literal=session-secret='GENERATED_SESSION_SECRET' \
  -n atk

Reference secrets in values-production.yaml:

global:
  datastores:
    default:
      postgresql:
        existingSecret: "atk-postgres-secret"
        existingSecretKeys:
          password: password

dapp:
  env:
    AUTH_SECRET:
      valueFrom:
        secretKeyRef:
          name: atk-auth-secret
          key: jwt-secret

TLS termination and certificate management

Configure ingress resources with TLS certificates. Use cert-manager for automatic certificate provisioning and renewal from Let's Encrypt or enterprise CAs:

dapp:
  ingress:
    enabled: true
    ingressClassName: "nginx"
    tls:
      - secretName: dapp-tls
        hosts:
          - dapp.example.com
    hosts:
      - host: dapp.example.com
        paths:
          - path: /
            pathType: Prefix

    annotations:
      cert-manager.io/cluster-issuer: "letsencrypt-prod"
      nginx.ingress.kubernetes.io/force-ssl-redirect: "true"

Verify certificate expiration in Grafana: the NGINX Ingress dashboard includes SSL certificate expiration warnings 30 days before renewal.

Network isolation policies

Enable NetworkPolicy resources to implement zero-trust networking. Restrict pod-to-pod traffic to only required communication paths:

erpc:
  networkPolicy:
    enabled: true
    ingress:
      - from:
          - podSelector:
              matchLabels:
                app: dapp
        ports:
          - protocol: TCP
            port: 8545

graph-node:
  networkPolicy:
    enabled: true
    ingress:
      - from:
          - podSelector:
              matchLabels:
                app: hasura
          - podSelector:
              matchLabels:
                app: dapp
        ports:
          - protocol: TCP
            port: 8000

Verify network policies block unauthorized traffic by testing from a pod without matching labels. Network policy violations appear in Kubernetes audit logs queryable in Loki.

Pod security standards compliance

Apply restricted Pod Security Standards to prevent privilege escalation and enforce least-privilege principles:

dapp:
  podSecurityContext:
    runAsNonRoot: true
    runAsUser: 1001
    fsGroup: 1001
    seccompProfile:
      type: RuntimeDefault

  securityContext:
    allowPrivilegeEscalation: false
    readOnlyRootFilesystem: true
    capabilities:
      drop:
        - ALL

  volumeMounts:
    - name: tmp
      mountPath: /tmp
    - name: cache
      mountPath: /app/.cache

  volumes:
    - name: tmp
      emptyDir: {}
    - name: cache
      emptyDir: {}

Read-only root filesystems require explicit volume mounts for writable directories. The DApp needs /tmp and cache directories; other applications may require different paths.

High availability architecture

Replica configuration strategy

Scale validator nodes for blockchain consensus (minimum 4 for IBFT 2.0), RPC nodes for read capacity, and application instances for request throughput:

network:
  network-nodes:
    validatorReplicaCount: 4
    rpcReplicaCount: 3

dapp:
  replicaCount: 3

hasura:
  replicas: 2

erpc:
  replicaCount: 3

Validator count determines Byzantine fault tolerance. IBFT 2.0 tolerates (n-1)/3 Byzantine nodes, so 4 validators tolerate 1 failure, 7 validators tolerate 2 failures.

RPC node scaling provides read capacity and redundancy. Each RPC node maintains full blockchain state; ERPC gateway load-balances requests across healthy nodes.

Application scaling increases throughput. Stateless DApp instances handle concurrent users; horizontal scaling adds capacity linearly up to database connection limits.

Pod disruption budgets

Most subcharts include PodDisruptionBudget resources preventing simultaneous pod termination during cluster maintenance:

dapp:
  podDisruptionBudget:
    enabled: true
    minAvailable: 1

hasura:
  podDisruptionBudget:
    minAvailable: 1

Node distribution with anti-affinity

Spread replicas across nodes to survive node failures. Pod anti-affinity rules influence scheduler placement:

dapp:
  affinity:
    podAntiAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
        - weight: 100
          podAffinityTerm:
            labelSelector:
              matchLabels:
                app.kubernetes.io/name: dapp
            topologyKey: kubernetes.io/hostname

Monitor replica distribution in the Kubernetes Overview dashboard. Pods concentrated on single nodes indicate scheduler constraints or insufficient cluster capacity.

Database replication

PostgreSQL replication requires external operators (Bitnami PostgreSQL subchart supports replication mode):

support:
  postgresql:
    architecture: replication
    replicaCount: 3
    replication:
      synchronousCommit: "on"
      numSynchronousReplicas: 1

Monitor replication lag in PostgreSQL dashboard. Lag exceeding 100MB or 10 seconds indicates network issues or insufficient replica resources.

Backup and disaster recovery

Database backup automation

Use Velero for Kubernetes-native backups or PostgreSQL-specific backup operators. Velero provides crash-consistent volume snapshots with application-aware hooks:

# Install Velero with cloud provider plugin
velero install \
  --provider aws \
  --plugins velero/velero-plugin-for-aws:v1.9.0 \
  --bucket atk-backups \
  --backup-location-config region=us-east-1

# Create scheduled backup
velero schedule create atk-daily \
  --schedule="0 2 * * *" \
  --include-namespaces atk \
  --storage-location default \
  --snapshot-volumes

Velero backups include PVCs, ConfigMaps, Secrets, and resource definitions. Restore creates namespace with identical state:

velero restore create atk-restore-20250129 \
  --from-backup atk-daily-20250129020000

PostgreSQL logical backups capture consistent database state:

kubectl exec -n atk postgresql-0 -- \
  pg_dump -U postgres -d atk \
  | gzip > atk-db-backup-$(date +%Y%m%d).sql.gz

Automate using CronJobs that upload to S3-compatible storage. Retain daily backups for 7 days, weekly for 4 weeks, monthly for 12 months based on compliance requirements.

Blockchain data snapshots

Besu validator and RPC node data resides in PersistentVolumes. Volume snapshots provide point-in-time recovery:

# List blockchain PVCs
kubectl get pvc -n atk | grep besu

# Create snapshot (requires CSI driver support)
kubectl create -f - <<EOF
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
  name: besu-validator-0-snapshot
  namespace: atk
spec:
  volumeSnapshotClassName: csi-snapshots
  source:
    persistentVolumeClaimName: besu-validator-0
EOF

Snapshot creation time depends on volume size (typically 1-5 minutes for 100GB). Snapshots consume cloud storage incrementally based on changed blocks.

Restore from snapshots by creating new PVCs referencing snapshot as data source:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: besu-validator-0-restored
  namespace: atk
spec:
  dataSource:
    name: besu-validator-0-snapshot
    kind: VolumeSnapshot
    apiGroup: snapshot.storage.k8s.io
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 100Gi

Configuration backup and version control

Store Helm values files in version-controlled repositories. Extract current deployed configuration:

helm get values atk -n atk > atk-values-backup-$(date +%Y%m%d).yaml

Commit to Git alongside infrastructure-as-code definitions. Tag releases matching production deployments for traceable rollback:

git tag -a prod-v1.2.3 -m "Production deployment 2025-01-29"
git push origin prod-v1.2.3

Disaster recovery procedure:

Provision new Kubernetes cluster
Install ATK using backed-up Helm values
Restore database from latest backup
Restore blockchain PVCs from snapshots (or fast-sync from network)
Verify observability dashboards show healthy state

RTO target: <4 hours for full cluster recovery. RPO target: <1 hour data loss (based on hourly database backups and blockchain network fast-sync capability).

Operational scenarios

Real production operations combine dashboard monitoring with log analysis and remediation commands. These scenarios walk through common issues using observability-first workflows.

Scenario: settlement transaction delays

Symptom: Users report vault withdrawals taking longer than expected. DApp shows "pending" status beyond normal 2-minute SLA.

Investigation workflow:

Open Besu dashboard → Check block production rate. If blocks produce normally (<10s average), blockchain is healthy.
Open DALP vault operations dashboard → Check withdrawal processing time percentiles. If p95 >120s, issue is in withdrawal logic not blockchain.
Query Loki for withdrawal transactions:

{namespace="atk", app="dapp"} |= "WithdrawalRequest" | json | status="pending" | line_format "{{.userId}} {{.vaultAddress}} {{.amount}} {{.duration_ms}}"

This shows pending withdrawals with user IDs and processing time. If multiple transactions from same vault show high duration, vault contract may have high gas costs or compliance rule delays.

Open Tempo traces → Search for trace IDs from slow withdrawal logs. Trace spans reveal time spent in each phase: compliance validation, vault unlock, blockchain submission, confirmation waiting.
Identify bottleneck from trace breakdown:

Compliance validation >30s: Compliance rules executing expensive off-chain checks. Review rule complexity; consider caching verification results.
Blockchain submission >60s: Transaction pool congestion or low gas price. Check transaction pool size in Besu dashboard; increase gas price if needed.
Confirmation waiting >45s: Normal for blockchain finality. Communicate expected wait time in UI; no remediation needed.

Remediation: If compliance validation is the bottleneck, review identity registry queries in application logs:

{namespace="atk", app="dapp"} |= "ComplianceCheck" | json | validation_time_ms > 10000

Optimize by adding database indexes on identity lookup fields or implementing compliance rule caching with 5-minute TTL.

Verification: After optimization, monitor withdrawal latency in vault operations dashboard. p95 should drop below 120s within 30 minutes as cache warms up.

Scenario: subgraph indexing lag

Symptom: TheGraph Indexing Status dashboard shows sync lag increasing beyond 10 blocks. Queries return stale data; recent transactions don't appear in DApp.

Investigation workflow:

TheGraph Indexing Status dashboard → Check "Blocks Behind Head" metric. If lag grows continuously, subgraph cannot keep pace with block production.
Check indexing error rate. Any errors require immediate investigation:

{namespace="atk", app="graph-node"} |= "error" | json | line_format "{{.subgraphName}} {{.blockNumber}} {{.errorMessage}}"

Common errors:

"Entity not found": Schema mismatch between subgraph and blockchain events. Verify deployed contract addresses match subgraph configuration.
"Transaction reverted": Subgraph attempting to index failed transactions. Update subgraph handlers to check transaction receipt status.
"Database timeout": PostgreSQL connection saturation. Check database connections in PostgreSQL dashboard.

If no errors appear, check entity processing time. High processing time indicates complex event handlers or database query inefficiency:

{namespace="atk", app="graph-node"} |= "BlockProcessed" | json | processing_time_ms > 500

Review Tempo traces for subgraph handler execution. Slow database queries appear as long-duration spans. Common issues:

Multiple database roundtrips per event (N+1 queries)
Missing indexes on frequently-queried fields
Complex entity relationships requiring optimization

Remediation: For slow database queries, add indexes to subgraph schema:

type Transfer @entity {
  id: ID!
  from: Bytes! @index
  to: Bytes! @index
  value: BigInt!
  blockNumber: BigInt! @index
  timestamp: BigInt! @index
}

Redeploy subgraph after schema changes. Indexing resumes from last successfully processed block.

For persistent lag without errors, scale graph-node replicas to handle parallel indexing:

graph-node:
  replicaCount: 2

Verification: Monitor "Blocks Behind Head" metric. Lag should decrease to <5 blocks within 15 minutes after scaling or optimization. Query performance improves as cache effectiveness increases.

Scenario: high API error rate

Symptom: NGINX Ingress dashboard shows 5xx error rate spike above 1%. Users report intermittent "server error" messages in DApp.

Investigation workflow:

NGINX Ingress dashboard → Check error breakdown by status code and upstream. If errors concentrate on specific backend (e.g., Hasura), issue is in that service not ingress.
Query Loki for error responses:

{namespace="atk", app="hasura"} |= "error" | json | statusCode >= 500

Common 5xx errors:

503 Service Unavailable: Upstream connection refused. Pod may be crashing or not ready.
502 Bad Gateway: Upstream closed connection. Timeout or OOM kill likely.
500 Internal Server Error: Application logic error. Check error details in logs.

Check pod health in Kubernetes Overview dashboard:

Restart count >5: Pod crashlooping. Check logs for panic or fatal errors.
Memory usage >90%: Approaching OOM. Increase memory limits.
CPU throttling: Pod receiving insufficient CPU. Increase requests or limits.

For 503 errors, verify pod readiness:

kubectl get pods -n atk -l app=hasura -o wide

Pods in Running state but failing readiness checks don't receive traffic. Readiness failures appear in describe output:

kubectl describe pod -n atk hasura-PODNAME

Open Tempo traces for failed requests. Trace reveals where error originated:

Error in Hasura GraphQL execution: Database connection exhausted or query timeout
Error in DApp backend: Application logic exception or external API failure
Error in authentication middleware: JWT validation failure or session expiration

Remediation: For database connection exhaustion, check PostgreSQL dashboard connection pool metrics:

hasura:
  config:
    HASURA_GRAPHQL_MAX_CONNECTIONS: 50
    HASURA_GRAPHQL_POOL_TIMEOUT: 180

Increase MAX_CONNECTIONS to 100 if pool utilization exceeds 80% sustained. Scale PostgreSQL replicas if connection limit cannot be increased further.

For memory-related OOM kills, increase pod memory limits:

hasura:
  resources:
    limits:
      memory: 2Gi
    requests:
      memory: 1Gi

Scenario: validator node connectivity loss

Symptom: Besu dashboard shows peer count drop on one validator. Block production continues but may become unstable if additional validators fail.

Investigation workflow:

Besu dashboard → Identify affected validator by node name. Check peer connectivity graph showing connections between validators.
Query Loki for network errors on affected pod:

{namespace="atk", pod=~"besu-validator-2.*"} |= "peer" |= "disconnect"

Peer disconnections indicate:

Network partition between validators
Firewall rules blocking P2P ports (30303 TCP/UDP)
Pod network policy too restrictive
Validator crashed and restarted (peers reconnect slowly)

Check pod events for network-related issues:

kubectl describe pod -n atk besu-validator-2-0

Look for:

NetworkNotReady: CNI plugin failure
FailedMount: Volume mount issues preventing startup
Unhealthy: Liveness probe failures indicating crash

Test connectivity between validators:

kubectl exec -n atk besu-validator-0-0 -- \
  nc -zv besu-validator-2-0.besu-validator.atk.svc.cluster.local 30303

Connection refused indicates pod not listening. Connection timeout indicates network policy or firewall blocking traffic.

Review network policies:

kubectl get networkpolicy -n atk
kubectl describe networkpolicy -n atk besu-network-policy

Remediation: For network policy issues, verify P2P port ingress allowed:

besu:
  networkPolicy:
    enabled: true
    ingress:
      - from:
          - podSelector:
              matchLabels:
                app: besu
        ports:
          - protocol: TCP
            port: 30303
          - protocol: UDP
            port: 30303

For crashed pods, check logs for errors before restart:

kubectl logs -n atk besu-validator-2-0 --previous

Common crash causes:

OutOfMemory: Increase memory limits from 4Gi to 8Gi
Corrupted database: Restore from PVC snapshot or reinitialize
Genesis mismatch: Regenerate genesis.json and restart all validators

Restart affected validator after remediation:

kubectl rollout restart statefulset/besu-validator -n atk

Verification: Monitor peer count in Besu dashboard. All validators should show full connectivity (peer count = validator count - 1) within 5 minutes. Block production rate returns to normal.

CI/CD integration

Automated deployment pipelines ensure consistent, repeatable releases. The included GitHub Actions workflows provide foundation for extending to production deployment.

QA workflow structure

.github/workflows/qa.yml executes comprehensive testing pipeline:

Artifact generation phase compiles smart contracts, generates ABIs, creates TypeScript bindings, and produces genesis allocation:

- name: Generate artifacts
  run: bun run artifacts

- name: Upload contract ABIs
  uses: actions/upload-artifact@v4
  with:
    name: contract-artifacts
    path: kit/contracts/artifacts/

Artifacts become Docker image layers and Helm chart ConfigMaps for deployment.

Backend services startup launches Docker Compose stack with PostgreSQL, Redis, Besu node, and supporting services. Tests execute against this local environment mimicking production:

- name: Start development environment
  run: bun dev:up
  timeout-minutes: 5

Test execution runs parallel test suites across packages:

- name: Run test suite
  run: bunx turbo run ci:gha --concurrency=3

Includes unit tests (Vitest, Foundry), integration tests (contract interactions), E2E tests (Playwright), linting, and type checking.

Chart validation (conditional on Helm chart changes) uses ct (chart-testing) to lint and validate against Kubernetes API:

- name: Lint and install charts
  run: |
    ct lint --config ct.yaml
    ct install --config ct.yaml

Image builds create container images tagged with commit SHA and push to GitHub Container Registry:

- name: Build and push DApp image
  uses: docker/build-push-action@v5
  with:
    context: .
    file: Dockerfile.dapp
    push: true
    tags: ghcr.io/${{ github.repository }}/dapp:${{ github.sha }}

Production deployment workflow

Extend QA workflow for production deployment triggered by git tags:

name: Deploy to production

on:
  push:
    tags:
      - "v*"

jobs:
  deploy:
    runs-on: ubuntu-latest
    environment: production

    steps:
      - uses: actions/checkout@v4

      - name: Configure kubectl
        run: |
          echo "${{ secrets.KUBE_CONFIG }}" | base64 -d > kubeconfig
          export KUBECONFIG=kubeconfig
          kubectl cluster-info

      - name: Deploy ATK
        working-directory: kit/charts/atk
        run: |
          helm upgrade atk . \
            --install \
            --namespace atk \
            --create-namespace \
            --values values-production.yaml \
            --set dapp.image.tag=${{ github.ref_name }} \
            --set contracts.image.tag=${{ github.ref_name }} \
            --timeout 20m \
            --wait \
            --atomic

--atomic flag rolls back automatically if deployment fails. --wait waits for all resources to reach ready state before considering deployment successful.

Post-deployment verification checks observability dashboards programmatically:

- name: Verify deployment health
  run: |
    # Wait for pods to be ready
    kubectl wait --for=condition=ready pod \
      -l app.kubernetes.io/instance=atk \
      -n atk \
      --timeout=600s

    # Check Besu validator connectivity
    PEER_COUNT=$(kubectl exec -n atk besu-validator-0-0 -- \
      curl -s -X POST http://localhost:8545 \
      -H "Content-Type: application/json" \
      -d '{"jsonrpc":"2.0","method":"net_peerCount","params":[],"id":1}' \
      | jq -r '.result' | xargs printf "%d")

    if [ "$PEER_COUNT" -lt 3 ]; then
      echo "Insufficient peers: $PEER_COUNT"
      exit 1
    fi

GitOps deployment with ArgoCD

Declarative deployments using ArgoCD provide automatic synchronization and drift detection:

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: atk-production
  namespace: argocd
spec:
  project: tokenization

  source:
    repoURL: https://github.com/your-org/asset-tokenization-kit
    targetRevision: main
    path: kit/charts/atk
    helm:
      valueFiles:
        - values-production.yaml
      parameters:
        - name: dapp.image.tag
          value: v1.2.3

  destination:
    server: https://kubernetes.default.svc
    namespace: atk

  syncPolicy:
    automated:
      prune: true
      selfHeal: true
    syncOptions:
      - CreateNamespace=true
    retry:
      limit: 5
      backoff:
        duration: 5s
        factor: 2
        maxDuration: 3m

ArgoCD monitors Git repository and automatically applies changes. Drift detection alerts when manual kubectl changes deviate from Git state. Self-healing automatically reverts unauthorized changes.

Progressive delivery with ArgoCD Rollouts enables canary deployments:

dapp:
  rollout:
    enabled: true
    strategy:
      canary:
        steps:
          - setWeight: 20
          - pause: { duration: 5m }
          - setWeight: 50
          - pause: { duration: 5m }
          - setWeight: 80
          - pause: { duration: 5m }

Canary deployment gradually shifts traffic to new version. Pause between steps allows observability dashboard monitoring for increased error rates or latency before proceeding.

Troubleshooting with observability

General troubleshooting workflow

Pod crashes and restart loops

Using observability: Open Kubernetes Overview dashboard → Find pods with restart count >3. Click pod name to filter metrics showing resource usage before crash.

Query Loki for crash logs:

{namespace="atk"} |= "panic" or "fatal" or "OOMKilled"

Filter by specific application:

{namespace="atk", pod=~"dapp-.*"} |= "error" | json | level="fatal"

Check memory trends before OOM kills in Node Exporter dashboard. Memory usage climbing to pod limit indicates undersized resources.

Using kubectl: Check previous pod logs (before crash):

kubectl logs -n atk <pod-name> --previous --tail=100

Describe pod for termination reason:

kubectl describe pod -n atk <pod-name>

Look for:

OOMKilled: Memory limit exceeded → Increase resources.limits.memory
Error: Exit code nonzero → Check application logs for panic
CrashLoopBackOff: Pod repeatedly failing startup → Fix configuration or application code

Common remediation:

Database connection errors:

{namespace="atk", app="dapp"} |= "database" |= "connection refused"

PostgreSQL dashboard shows connection pool exhausted. Increase application connection pool size or scale database replicas.

RPC connection failures:

{namespace="atk", app="dapp"} |= "RPC" |= "timeout"

ERPC dashboard displays all upstreams unhealthy. Check Besu validator status; if validators are healthy, restart ERPC gateway to refresh upstream health checks.

Ingress connectivity issues

Query ingress controller logs:

{namespace="atk", app="ingress-nginx"} |= "error" | json | statusCode >= 500

SSL/TLS errors:

{namespace="atk", app="ingress-nginx"} |= "TLS" |= "handshake"

Check certificate expiration in dashboard SSL certificate panel. Expired certificates require renewal via cert-manager or manual secret update.

Using kubectl: Verify ingress controller pods running:

kubectl get pods -n atk -l app.kubernetes.io/name=ingress-nginx

Check ingress resources:

kubectl get ingress -n atk
kubectl describe ingress -n atk dapp-ingress

Ensure ADDRESS field populated with LoadBalancer IP. Missing address indicates LoadBalancer provisioning failure or cloud provider issue.

Test DNS resolution:

nslookup dapp.example.com

Verify DNS record points to ingress LoadBalancer IP. DNS propagation takes 1-60 minutes after changes.

Database migration failures

DApp includes post-install Helm hook job executing Drizzle migrations. Hook failures prevent deployment success.

Check migration job logs:

kubectl logs -n atk job/atk-dapp-migrations

Common failures:

Schema already exists: Migration previously executed. Delete job and retry: kubectl delete job -n atk atk-dapp-migrations
Connection refused: Database not ready. Increase helm.sh/hook-wait-timeout annotation or add init container polling database readiness
Permission denied: Database user lacks schema modification rights. Grant superuser or schema owner privileges

Manually execute migrations:

kubectl exec -n atk deployment/atk-dapp -- bun run db:migrate

Genesis file regeneration

Modifying smart contract deployments requires regenerating genesis allocations and restarting validators with new genesis.

Regenerate artifacts locally:

bun run artifacts

Update genesis ConfigMap:

kubectl create configmap atk-genesis \
  --from-file=genesis.json=kit/contracts/genesis.json \
  --dry-run=client -o yaml \
  | kubectl apply -n atk -f -

Restart validators to load new genesis:

kubectl rollout restart statefulset/besu-validator -n atk
kubectl rollout restart statefulset/besu-rpc -n atk

Monitor restart in Besu dashboard. Block production resumes within 2 minutes after majority validators restart.

WARNING: Genesis changes on running network cause chain fork. Only regenerate genesis for development/test environments or when performing coordinated network upgrade with all participants.

Helm release recovery

Failed Helm upgrades leave releases in pending or failed state. Recover using rollback or force upgrade.

Check release status:

helm status atk -n atk
helm history atk -n atk

Rollback to last successful revision:

helm rollback atk -n atk

Specify revision number for targeted rollback:

helm rollback atk 3 -n atk

Force upgrade (use cautiously—may cause resource conflicts):

helm upgrade atk . \
  --namespace atk \
  --values values-production.yaml \
  --force \
  --cleanup-on-fail

--force deletes and recreates resources with changes. --cleanup-on-fail removes resources if upgrade fails, preventing stuck state.

For severely corrupted releases, uninstall and reinstall (destroys data):

helm uninstall atk -n atk
helm install atk . -n atk -f values-production.yaml

Restore data from backups after reinstall.

Alerting configuration

Proactive monitoring requires alerting on SLA violations and anomalous conditions. Configure Grafana alert rules with notification channels for critical events.

Alert rule structure

Navigate to Grafana → Alerting → Alert rules → New alert rule. Define query, threshold, evaluation interval, and notification policy.

Example alert for slow block production:

groups:
  - name: blockchain_health
    interval: 1m
    rules:
      - alert: SlowBlockProduction
        expr: rate(besu_blockchain_height[5m]) < 0.067
        for: 10m
        labels:
          severity: warning
          component: blockchain
        annotations:
          summary: "Block production slower than target"
          description:
            "Block production rate {{ $value | humanize }} blocks/s is below
            target 0.067 blocks/s (15s block time) for 10 minutes on {{
            $labels.instance }}"

Alert fires when 5-minute block production rate drops below 1 block per 15 seconds sustained for 10 minutes.

Critical SLA alerts

Settlement latency:

- alert: HighSettlementLatency
  expr: histogram_quantile(0.95, rate(settlement_duration_seconds_bucket[5m])) > 5
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "Settlement latency exceeds SLA"
    description: "95th percentile settlement time {{ $value }}s exceeds 5s SLA"

Subgraph sync lag:

- alert: SubgraphSyncLag
  expr: (ethereum_chain_head_number - ethereum_block_number) > 10
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "Subgraph falling behind chain head"
    description: "Subgraph {{ $labels.subgraph }} is {{ $value }} blocks behind chain head"

Database connection exhaustion:

- alert: DatabaseConnectionPoolSaturated
  expr: (sum(pg_stat_activity_count) by (instance) /
    sum(pg_settings_max_connections) by (instance)) > 0.9
  for: 3m
  labels:
    severity: critical
  annotations:
    summary: "Database connection pool &gt;90% utilized"
    description: "PostgreSQL instance {{ $labels.instance }} has {{ $value |
      humanizePercentage }} connection pool utilization"

Disk space critical:

- alert: PVCDiskSpaceCritical
  expr:
    (kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes) >
    0.85
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "PVC {{ $labels.persistentvolumeclaim }} &gt;85% full"
    description: "Disk usage {{ $value | humanizePercentage }} on PVC {{
      $labels.persistentvolumeclaim }} in namespace {{ $labels.namespace }}"

Notification channels

Configure contact points for alert delivery:

Slack integration:

Navigate to Grafana → Alerting → Contact points → New contact point
Select "Slack" type
Enter webhook URL from Slack App configuration
Test contact point to verify delivery

PagerDuty integration:

Create PagerDuty integration key for ATK service
Add contact point with "PagerDuty" type
Enter integration key
Configure routing by severity (critical → page on-call, warning → notification)

Email notification:

Configure SMTP settings in Grafana Helm values:

grafana:
  smtp:
    enabled: true
    host: smtp.example.com:587
    user: [email protected]
    password: "${SMTP_PASSWORD}"
    from_address: [email protected]

Add email contact point with recipient addresses

Notification policies

Route alerts to appropriate channels based on severity and component:

# Grafana notification policy configuration
policies:
  - match:
      severity: critical
    continue: true
    contact_point: pagerduty

  - match:
      severity: warning
      component: blockchain
    contact_point: slack-ops

  - match:
      severity: warning
    contact_point: email-team

Critical alerts page on-call engineer via PagerDuty. Warning-level blockchain alerts notify operations Slack channel. Other warnings send email to engineering team.

Incident response runbook

Incident severity classification

P0 - Critical (immediate response):

Complete service outage affecting all users
Data loss or corruption detected
Security breach or unauthorized access
Blockchain consensus failure (validators offline)

P1 - High (15-minute response):

Partial service degradation affecting >50% users
Settlement transaction failures >5% rate
Database replication lag >5 minutes
Critical component failure with redundancy active

P2 - Medium (1-hour response):

Performance degradation within SLA but trending worse
Single component failure with no user impact
Subgraph indexing lag >50 blocks
Non-critical alert patterns requiring investigation

P3 - Low (next business day):

Minor performance issues
Non-blocking errors in logs
Resource utilization trending toward limits
Documentation or configuration improvements needed

Escalation criteria

Escalate incidents when:

Mitigation attempts fail after 30 minutes
Impact scope increases during response
Root cause requires specialized expertise (smart contracts, blockchain infrastructure)
Incident affects compliance or regulatory obligations
Multiple components failing simultaneously suggests systemic issue

Escalation contacts maintained in PagerDuty rotation. For P0 incidents, engage incident commander role to coordinate cross-team response.

Communication protocols

Status page updates:

Update status page at:

Incident start (within 5 minutes of confirmation)
Every 30 minutes during active mitigation
Significant changes in scope or impact
Resolution confirmation
Postmortem publication

Status page example entries:

[Investigating] Settlement transactions experiencing delays &gt;2 minutes. Team investigating blockchain node connectivity.

[Identified] Issue identified as RPC gateway upstream health check failure. Restarting ERPC pods.

[Monitoring] ERPC gateway restored. Monitoring settlement latency for 30 minutes before confirming resolution.

[Resolved] Settlement transactions returned to normal latency &lt;5s. Root cause: upstream timeout configuration too aggressive.

Stakeholder notifications:

P0/P1: Immediate Slack notification to #incidents channel + PagerDuty alert
P0: Email to executive stakeholders within 15 minutes
All: Post-incident summary to #engineering within 24 hours
P0/P1: Postmortem document within 5 business days

Mitigation playbooks

Rollback deployment:

# Check recent deployments
helm history atk -n atk

# Rollback to previous revision
helm rollback atk -n atk

# Monitor rollback progress
kubectl rollout status deployment/atk-dapp -n atk
watch kubectl get pods -n atk

Use rollback when:

Recent deployment correlates with incident start time
New version showing increased error rates or latency
Configuration change caused service disruption

Scale resources:

# Scale application replicas
kubectl scale deployment/atk-dapp -n atk --replicas=6

# Scale RPC nodes for increased blockchain read capacity
kubectl scale statefulset/besu-rpc -n atk --replicas=5

# Scale Hasura for database query load
helm upgrade atk . -n atk \
  --set hasura.replicaCount=4 \
  --reuse-values

Use scaling when:

CPU/memory utilization >85% sustained
Request queue depth increasing
Response times degrading under load
Connection pool exhaustion

Restart services:

# Restart DApp deployment
kubectl rollout restart deployment/atk-dapp -n atk

# Restart ERPC gateway to refresh upstream health
kubectl rollout restart deployment/erpc -n atk

# Restart Hasura for database connection reset
kubectl rollout restart deployment/hasura -n atk

# Restart specific pod (last resort)
kubectl delete pod -n atk hasura-PODNAME

Use restarts when:

Application showing memory leaks or resource exhaustion
Stale connections to external services
Configuration reload required without full redeployment
Clearing transient error states

Root cause analysis

After immediate mitigation, identify root cause through observability stack:

Metrics analysis: Review dashboard metrics for 2 hours before incident. Compare against baseline to identify anomalies. Look for:

Resource utilization spikes (CPU, memory, network)
Error rate increases in specific components
Latency percentile degradation (p95, p99)
Throughput drops or queue depth increases

Log correlation: Query Loki for error patterns:

{namespace="atk"} |= "error" or "fatal" or "exception"
  | json
  | __error__=""
  | line_format "{{.timestamp}} {{.level}} {{.component}} {{.message}}"

Filter to incident time window. Identify first occurrence of errors and trace backward to triggering event.

Trace analysis: Open Tempo and search for failed requests during incident window. Follow trace spans to identify:

Which component introduced latency or errors
External dependency timeouts or failures
Database query performance issues
Authentication or authorization failures

Change correlation: Review recent changes:

# Recent deployments
helm history atk -n atk

# Recent pod restarts
kubectl get events -n atk --sort-by='.lastTimestamp' | head -50

# Git commits deployed
git log --since="2 hours ago" --oneline

Correlate incident timing with deployments, configuration changes, or infrastructure modifications.

Postmortem documentation

Document every P0 and P1 incident with structured postmortem:

Incident summary:

Date and time (start, detection, mitigation, resolution)
Severity classification
Impact scope (users affected, revenue impact, SLA breach duration)
Incident commander and responders

Timeline:

Chronological sequence of events with timestamps:

14:23 UTC - Alert fired: HighSettlementLatency
14:25 UTC - On-call engineer acknowledged
14:28 UTC - Identified ERPC upstream health check failures
14:30 UTC - Status page updated: Investigating
14:32 UTC - Restarted ERPC deployment
14:35 UTC - Settlement latency returning to normal
14:45 UTC - Confirmed resolution, updated status page

Root cause:

Single-sentence root cause statement, followed by detailed explanation:

Root cause: ERPC gateway upstream health check timeout set to 5s caused false negatives during normal Besu block production delays of 10-15s, marking healthy RPC nodes as unhealthy and routing traffic to single remaining node.

Impact assessment:

Users affected: Estimated 200 active users during incident window
Transactions affected: 347 settlement transactions delayed >2 minutes
SLA impact: Settlement latency SLA (<5s) breached for 22 minutes
Revenue impact: Minimal (transactions completed, delayed only)

Mitigation actions:

Restarted ERPC deployment to reset upstream health state (14:32 UTC)
Verified settlement latency returned to normal <5s (14:35 UTC)
Monitored for 30 minutes to confirm stable state

Corrective actions (with owners and due dates):

Action 1: Increase ERPC upstream health check timeout from 5s to 30s (Owner: DevOps, Due: 2025-01-30)
Action 2: Add alerting for ERPC upstream health check failures (Owner: SRE, Due: 2025-02-05)
Action 3: Document ERPC tuning guidelines in runbook (Owner: Tech Writer, Due: 2025-02-10)
Action 4: Review all service health check timeouts for appropriate values (Owner: Platform Team, Due: 2025-02-15)

Lessons learned:

Health check timeouts must account for normal system variability
Upstream health check failures should trigger alerts before routing impact
Need better observability into ERPC upstream selection decisions
Status page updates were timely and clear (positive feedback from users)

Store postmortems in shared documentation repository (Git or Confluence). Reference in future incident response and system design discussions.

External observability forwarding

Send telemetry to external platforms for long-term retention, advanced analysis, or centralized monitoring across multiple deployments.

Configure Alloy to forward metrics, logs, and traces while maintaining local observability:

observability:
  alloy:
    endpoints:
      external:
        prometheus:
          enabled: true
          url: https://prometheus-prod.grafana.net/api/prom/push
          basicAuth:
            username: "123456"
            passwordSecretRef:
              name: grafana-cloud-secrets
              key: prometheus-password

        loki:
          enabled: true
          url: https://logs-prod.grafana.net/loki/api/v1/push
          basicAuth:
            username: "123456"
            passwordSecretRef:
              name: grafana-cloud-secrets
              key: loki-password

        tempo:
          enabled: true
          url: https://tempo-prod.grafana.net:443
          basicAuth:
            username: "123456"
            passwordSecretRef:
              name: grafana-cloud-secrets
              key: tempo-password

Create secret with platform credentials:

kubectl create secret generic grafana-cloud-secrets \
  --from-literal=prometheus-password='PROMETHEUS_TOKEN' \
  --from-literal=loki-password='LOKI_TOKEN' \
  --from-literal=tempo-password='TEMPO_TOKEN' \
  -n atk

Alloy forwards all telemetry to both local VictoriaMetrics/Loki/Tempo and external endpoints. Local observability remains fully operational during external platform outages.

Datadog integration uses OTLP exporter:

observability:
  alloy:
    endpoints:
      external:
        otel:
          enabled: true
          url: https://api.datadoghq.com
          headers:
            DD-API-KEY: "${DATADOG_API_KEY}"

Uninstallation and cleanup

Complete removal requires deleting Helm release, namespace, and PVCs (which may not delete automatically based on reclaim policy).

Uninstall Helm release:

helm uninstall atk -n atk

Delete namespace (removes all resources):

kubectl delete namespace atk

Warning: This permanently deletes all data including blockchain state, databases, and configuration. Backup critical data before uninstalling.

List remaining PVCs with Retain reclaim policy:

kubectl get pv | grep atk

Manually delete retained PVs:

kubectl delete pv <pv-name>

Cloud provider volumes may require manual deletion through provider console (AWS EBS volumes, Azure Disks, GCP Persistent Disks).

Next steps

Review Installation & configuration for initial deployment and configuration reference
See Testing and QA for running tests against deployed environments
Explore Development FAQ for common deployment issues
Consult Code Structure to understand monorepo layout
Review Contract Reference for smart contract interfaces