Production operations
Run ATK in production with observability-driven operations, focusing on DALP lifecycle monitoring, automated troubleshooting workflows, and operational SLA tracking through integrated dashboards.
Operating ATK in production
Production operations center on the observability stack as your operational command center. Every critical system—blockchain nodes, settlement vaults, yield processors, RPC gateways, indexers—exposes metrics, logs, and traces that flow into Grafana dashboards. This integration transforms operations from reactive firefighting into proactive monitoring with predictable SLA tracking.
The operational workflow starts with dashboards showing real-time health, drills into Loki logs when anomalies appear, correlates issues through Tempo traces, and ends with kubectl commands to remediate. This playbook walks through common operational scenarios using this observability-first approach.
Observability as competitive advantage
ATK ships with production-grade observability that most tokenization platforms force you to assemble from multiple vendors. The integrated stack provides comprehensive monitoring out of the box—a critical differentiator for production deployments.
Access Grafana at the configured hostname (default: grafana.k8s.orb.local):
kubectl get ingress -n atk grafana
# Default credentials (change immediately)
Username: settlemint
Password: atkThe stack includes:
VictoriaMetrics stores time-series metrics with 1-month retention
(configurable via
observability.victoria-metrics-single.server.retentionPeriod). All components
expose Prometheus-compatible metrics automatically scraped by Alloy agents.
Loki aggregates logs from every pod with 7-day retention (168 hours).
Structured metadata enables advanced queries like
{namespace="atk", app="dapp"} |= "error" | json to filter JSON logs by
severity.
Tempo collects OpenTelemetry traces with 7-day retention, correlating requests across services. Click any trace span to see related logs—critical for debugging distributed transactions through settlement vaults and blockchain confirmations.
Alloy agents run as DaemonSets, scraping metrics from annotated pods, collecting logs, and receiving traces. Configure forwarding to external platforms (Grafana Cloud, Datadog) while keeping local observability operational.
Grafana provides pre-configured datasources and auto-loaded dashboards from ConfigMaps. Every ATK component ships with production-ready dashboards covering critical metrics and SLA indicators.
Production dashboards and SLA targets
Blockchain layer SLAs
Besu Dashboard tracks validator performance with these SLA targets:
- Block production rate: >1 block per 15 seconds (target: <10s average)
- Peer connectivity: Matches validator count (100% uptime)
- Transaction throughput: >100 tx/s sustained (varies by network size)
- Chain reorganizations: <1 per day (indicates stability)
Monitor block production in the Besu dashboard. If block times exceed 15 seconds
for 10 minutes, the SlowBlockProduction alert fires. Check validator
connectivity first—network partitions manifest as peer count drops before block
delays.
Portal Metrics dashboard shows RPC gateway authentication and rate limiting. Target SLAs:
- Authentication success rate: >99.9%
- Rate limit rejections: <0.1% of requests
- Gateway availability: 99.95% uptime
Settlement and vault operations
DALP lifecycle monitoring tracks vault operations and settlement flows—key differentiators for production asset tokenization:
Vault operations dashboard monitors:
- Deposit confirmations: <30s from blockchain finality to subgraph indexing
- Withdrawal processing: <2 minutes end-to-end (includes compliance checks)
- Vault balance reconciliation: Real-time vs. blockchain state divergence <0.01%
Query vault deposit latency:
{namespace="atk"} |= "VaultDeposit" | json | latency_ms > 30000This identifies deposits taking longer than 30 seconds, indicating either blockchain congestion or indexing lag.
Settlement monitoring tracks DvP (Delivery vs Payment) execution:
- DvP success rate: >99.5% (excluding legitimate rejections)
- Settlement latency: <5s from initiation to on-chain confirmation
- Failed settlement reasons: Categorized by compliance rejection, insufficient balance, timeout
Yield processing operations
Yield distribution dashboard monitors dividend and interest payments:
- Distribution execution time: <10 minutes for 10,000 token holders
- Failed distributions: <0.5% (due to compliance or gas limits)
- Gas cost per distribution: <200,000 gas per token holder (optimized batch processing)
Check yield distribution progress:
{namespace="atk", app="yield-processor"} |= "DistributionBatch" | json | status="processing"Monitor in-progress distributions. Alert if any batch exceeds 15 minutes—indicates blockchain congestion or gas price spikes requiring manual intervention.
Indexing layer SLAs
TheGraph Indexing Status dashboard tracks subgraph synchronization:
- Sync lag: <10 blocks behind chain head (target: <5 blocks)
- Indexing errors: Zero per hour (any error requires investigation)
- Entity processing time: <500ms per block
- Query performance: <100ms p50, <500ms p99
TheGraph Query Performance shows GraphQL operation metrics:
- Cache hit rate: >60% after 1 hour (improves over time)
- Query complexity: Monitor for inefficient queries exceeding cost limits
- Response times: <100ms for cached, <500ms for uncached queries
RPC gateway performance
ERPC Dashboard provides comprehensive gateway monitoring:
- Request routing: Healthy upstream selection >99.9%
- Cache effectiveness: >40% hit rate for read operations
- Rate limiting: <1% of legitimate requests rejected
- Error rate: <0.1% for production readiness
- Latency percentiles: p50 <50ms, p95 <200ms, p99 <500ms
Infrastructure health
Kubernetes Overview dashboard tracks cluster resources:
- Node capacity: CPU <80%, memory <85% sustained
- Pod health: >99.9% running state
- PVC utilization: <85% (alert at 80% for capacity planning)
NGINX Ingress metrics:
- Request rate: Baseline established per deployment
- Error rate: <0.5% (4xx + 5xx responses)
- Response time: p95 <1s for all endpoints
Security hardening
Production deployments require security measures beyond default configurations. Start with secrets management, enforce TLS, restrict network traffic, and apply pod security standards.
Secrets management strategy
Replace inline credentials with Kubernetes Secret references. Create secrets separately from Helm releases to enable rotation without redeployment:
kubectl create secret generic atk-postgres-secret \
--from-literal=password='GENERATED_SECURE_PASSWORD' \
-n atk
kubectl create secret generic atk-auth-secret \
--from-literal=jwt-secret='GENERATED_JWT_SECRET' \
--from-literal=session-secret='GENERATED_SESSION_SECRET' \
-n atkReference secrets in values-production.yaml:
global:
datastores:
default:
postgresql:
existingSecret: "atk-postgres-secret"
existingSecretKeys:
password: password
dapp:
env:
AUTH_SECRET:
valueFrom:
secretKeyRef:
name: atk-auth-secret
key: jwt-secretExternal secret management integrates with HashiCorp Vault, AWS Secrets Manager, or Azure Key Vault using the External Secrets Operator. Configure automatic rotation by updating the secret backend; pods restart automatically on secret changes when configured with annotations.
TLS termination and certificate management
Configure ingress resources with TLS certificates. Use cert-manager for automatic certificate provisioning and renewal from Let's Encrypt or enterprise CAs:
dapp:
ingress:
enabled: true
ingressClassName: "nginx"
tls:
- secretName: dapp-tls
hosts:
- dapp.example.com
hosts:
- host: dapp.example.com
paths:
- path: /
pathType: Prefix
annotations:
cert-manager.io/cluster-issuer: "letsencrypt-prod"
nginx.ingress.kubernetes.io/force-ssl-redirect: "true"Verify certificate expiration in Grafana: the NGINX Ingress dashboard includes SSL certificate expiration warnings 30 days before renewal.
Network isolation policies
Enable NetworkPolicy resources to implement zero-trust networking. Restrict pod-to-pod traffic to only required communication paths:
erpc:
networkPolicy:
enabled: true
ingress:
- from:
- podSelector:
matchLabels:
app: dapp
ports:
- protocol: TCP
port: 8545
graph-node:
networkPolicy:
enabled: true
ingress:
- from:
- podSelector:
matchLabels:
app: hasura
- podSelector:
matchLabels:
app: dapp
ports:
- protocol: TCP
port: 8000Verify network policies block unauthorized traffic by testing from a pod without matching labels. Network policy violations appear in Kubernetes audit logs queryable in Loki.
Pod security standards compliance
Apply restricted Pod Security Standards to prevent privilege escalation and enforce least-privilege principles:
dapp:
podSecurityContext:
runAsNonRoot: true
runAsUser: 1001
fsGroup: 1001
seccompProfile:
type: RuntimeDefault
securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
capabilities:
drop:
- ALL
volumeMounts:
- name: tmp
mountPath: /tmp
- name: cache
mountPath: /app/.cache
volumes:
- name: tmp
emptyDir: {}
- name: cache
emptyDir: {}Read-only root filesystems require explicit volume mounts for writable
directories. The DApp needs /tmp and cache directories; other applications may
require different paths.
High availability architecture
HA configurations distribute workload across multiple nodes and replicas. Start with critical path components: blockchain validators, RPC gateways, DApp instances, and databases. Then add pod disruption budgets and anti-affinity rules.
Replica configuration strategy
Scale validator nodes for blockchain consensus (minimum 4 for IBFT 2.0), RPC nodes for read capacity, and application instances for request throughput:
network:
network-nodes:
validatorReplicaCount: 4
rpcReplicaCount: 3
dapp:
replicaCount: 3
hasura:
replicas: 2
erpc:
replicaCount: 3Validator count determines Byzantine fault tolerance. IBFT 2.0 tolerates
(n-1)/3 Byzantine nodes, so 4 validators tolerate 1 failure, 7 validators
tolerate 2 failures.
RPC node scaling provides read capacity and redundancy. Each RPC node maintains full blockchain state; ERPC gateway load-balances requests across healthy nodes.
Application scaling increases throughput. Stateless DApp instances handle concurrent users; horizontal scaling adds capacity linearly up to database connection limits.
Pod disruption budgets
Most subcharts include PodDisruptionBudget resources preventing simultaneous pod termination during cluster maintenance:
dapp:
podDisruptionBudget:
enabled: true
minAvailable: 1
hasura:
podDisruptionBudget:
minAvailable: 1minAvailable: 1 ensures at least one replica stays running during node drains
or upgrades. Adjust for higher availability requirements: minAvailable: 2 with
replicaCount: 3 guarantees two instances during maintenance.
Node distribution with anti-affinity
Spread replicas across nodes to survive node failures. Pod anti-affinity rules influence scheduler placement:
dapp:
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchLabels:
app.kubernetes.io/name: dapp
topologyKey: kubernetes.io/hostnamepreferredDuringSchedulingIgnoredDuringExecution allows scheduling even if
anti-affinity cannot be satisfied (useful in small clusters). Change to
requiredDuringSchedulingIgnoredDuringExecution for hard requirements in large
production clusters.
Monitor replica distribution in the Kubernetes Overview dashboard. Pods concentrated on single nodes indicate scheduler constraints or insufficient cluster capacity.
Database replication
PostgreSQL replication requires external operators (Bitnami PostgreSQL subchart supports replication mode):
support:
postgresql:
architecture: replication
replicaCount: 3
replication:
synchronousCommit: "on"
numSynchronousReplicas: 1Synchronous replication ensures zero data loss at performance cost. Each write waits for acknowledgment from 1 replica before returning. Async replication improves performance but risks data loss during primary failures.
Monitor replication lag in PostgreSQL dashboard. Lag exceeding 100MB or 10 seconds indicates network issues or insufficient replica resources.
Backup and disaster recovery
Production operations require backup strategies for blockchain data, application databases, and configuration state. Recovery Time Objective (RTO) and Recovery Point Objective (RPO) determine backup frequency and retention.
Database backup automation
Use Velero for Kubernetes-native backups or PostgreSQL-specific backup operators. Velero provides crash-consistent volume snapshots with application-aware hooks:
# Install Velero with cloud provider plugin
velero install \
--provider aws \
--plugins velero/velero-plugin-for-aws:v1.9.0 \
--bucket atk-backups \
--backup-location-config region=us-east-1
# Create scheduled backup
velero schedule create atk-daily \
--schedule="0 2 * * *" \
--include-namespaces atk \
--storage-location default \
--snapshot-volumesVelero backups include PVCs, ConfigMaps, Secrets, and resource definitions. Restore creates namespace with identical state:
velero restore create atk-restore-20250129 \
--from-backup atk-daily-20250129020000PostgreSQL logical backups capture consistent database state:
kubectl exec -n atk postgresql-0 -- \
pg_dump -U postgres -d atk \
| gzip > atk-db-backup-$(date +%Y%m%d).sql.gzAutomate using CronJobs that upload to S3-compatible storage. Retain daily backups for 7 days, weekly for 4 weeks, monthly for 12 months based on compliance requirements.
Blockchain data snapshots
Besu validator and RPC node data resides in PersistentVolumes. Volume snapshots provide point-in-time recovery:
# List blockchain PVCs
kubectl get pvc -n atk | grep besu
# Create snapshot (requires CSI driver support)
kubectl create -f - <<EOF
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
name: besu-validator-0-snapshot
namespace: atk
spec:
volumeSnapshotClassName: csi-snapshots
source:
persistentVolumeClaimName: besu-validator-0
EOFSnapshot creation time depends on volume size (typically 1-5 minutes for 100GB). Snapshots consume cloud storage incrementally based on changed blocks.
Restore from snapshots by creating new PVCs referencing snapshot as data source:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: besu-validator-0-restored
namespace: atk
spec:
dataSource:
name: besu-validator-0-snapshot
kind: VolumeSnapshot
apiGroup: snapshot.storage.k8s.io
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 100GiConfiguration backup and version control
Store Helm values files in version-controlled repositories. Extract current deployed configuration:
helm get values atk -n atk > atk-values-backup-$(date +%Y%m%d).yamlCommit to Git alongside infrastructure-as-code definitions. Tag releases matching production deployments for traceable rollback:
git tag -a prod-v1.2.3 -m "Production deployment 2025-01-29"
git push origin prod-v1.2.3Disaster recovery procedure:
- Provision new Kubernetes cluster
- Install ATK using backed-up Helm values
- Restore database from latest backup
- Restore blockchain PVCs from snapshots (or fast-sync from network)
- Verify observability dashboards show healthy state
RTO target: <4 hours for full cluster recovery. RPO target: <1 hour data loss (based on hourly database backups and blockchain network fast-sync capability).
Operational scenarios
Real production operations combine dashboard monitoring with log analysis and remediation commands. These scenarios walk through common issues using observability-first workflows.
Scenario: settlement transaction delays
Symptom: Users report vault withdrawals taking longer than expected. DApp shows "pending" status beyond normal 2-minute SLA.
Investigation workflow:
-
Open Besu dashboard → Check block production rate. If blocks produce normally (<10s average), blockchain is healthy.
-
Open DALP vault operations dashboard → Check withdrawal processing time percentiles. If p95 >120s, issue is in withdrawal logic not blockchain.
-
Query Loki for withdrawal transactions:
{namespace="atk", app="dapp"} |= "WithdrawalRequest" | json | status="pending" | line_format "{{.userId}} {{.vaultAddress}} {{.amount}} {{.duration_ms}}"This shows pending withdrawals with user IDs and processing time. If multiple transactions from same vault show high duration, vault contract may have high gas costs or compliance rule delays.
-
Open Tempo traces → Search for trace IDs from slow withdrawal logs. Trace spans reveal time spent in each phase: compliance validation, vault unlock, blockchain submission, confirmation waiting.
-
Identify bottleneck from trace breakdown:
- Compliance validation >30s: Compliance rules executing expensive off-chain checks. Review rule complexity; consider caching verification results.
- Blockchain submission >60s: Transaction pool congestion or low gas price. Check transaction pool size in Besu dashboard; increase gas price if needed.
- Confirmation waiting >45s: Normal for blockchain finality. Communicate expected wait time in UI; no remediation needed.
Remediation: If compliance validation is the bottleneck, review identity registry queries in application logs:
{namespace="atk", app="dapp"} |= "ComplianceCheck" | json | validation_time_ms > 10000Optimize by adding database indexes on identity lookup fields or implementing compliance rule caching with 5-minute TTL.
Verification: After optimization, monitor withdrawal latency in vault operations dashboard. p95 should drop below 120s within 30 minutes as cache warms up.
Scenario: subgraph indexing lag
Symptom: TheGraph Indexing Status dashboard shows sync lag increasing beyond 10 blocks. Queries return stale data; recent transactions don't appear in DApp.
Investigation workflow:
-
TheGraph Indexing Status dashboard → Check "Blocks Behind Head" metric. If lag grows continuously, subgraph cannot keep pace with block production.
-
Check indexing error rate. Any errors require immediate investigation:
{namespace="atk", app="graph-node"} |= "error" | json | line_format "{{.subgraphName}} {{.blockNumber}} {{.errorMessage}}"Common errors:
- "Entity not found": Schema mismatch between subgraph and blockchain events. Verify deployed contract addresses match subgraph configuration.
- "Transaction reverted": Subgraph attempting to index failed transactions. Update subgraph handlers to check transaction receipt status.
- "Database timeout": PostgreSQL connection saturation. Check database connections in PostgreSQL dashboard.
- If no errors appear, check entity processing time. High processing time indicates complex event handlers or database query inefficiency:
{namespace="atk", app="graph-node"} |= "BlockProcessed" | json | processing_time_ms > 500- Review Tempo traces for subgraph handler execution. Slow database queries appear as long-duration spans. Common issues:
- Multiple database roundtrips per event (N+1 queries)
- Missing indexes on frequently-queried fields
- Complex entity relationships requiring optimization
Remediation: For slow database queries, add indexes to subgraph schema:
type Transfer @entity {
id: ID!
from: Bytes! @index
to: Bytes! @index
value: BigInt!
blockNumber: BigInt! @index
timestamp: BigInt! @index
}Redeploy subgraph after schema changes. Indexing resumes from last successfully processed block.
For persistent lag without errors, scale graph-node replicas to handle parallel indexing:
graph-node:
replicaCount: 2Verification: Monitor "Blocks Behind Head" metric. Lag should decrease to <5 blocks within 15 minutes after scaling or optimization. Query performance improves as cache effectiveness increases.
Scenario: high API error rate
Symptom: NGINX Ingress dashboard shows 5xx error rate spike above 1%. Users report intermittent "server error" messages in DApp.
Investigation workflow:
-
NGINX Ingress dashboard → Check error breakdown by status code and upstream. If errors concentrate on specific backend (e.g., Hasura), issue is in that service not ingress.
-
Query Loki for error responses:
{namespace="atk", app="hasura"} |= "error" | json | statusCode >= 500Common 5xx errors:
- 503 Service Unavailable: Upstream connection refused. Pod may be crashing or not ready.
- 502 Bad Gateway: Upstream closed connection. Timeout or OOM kill likely.
- 500 Internal Server Error: Application logic error. Check error details in logs.
- Check pod health in Kubernetes Overview dashboard:
- Restart count >5: Pod crashlooping. Check logs for panic or fatal errors.
- Memory usage >90%: Approaching OOM. Increase memory limits.
- CPU throttling: Pod receiving insufficient CPU. Increase requests or limits.
- For 503 errors, verify pod readiness:
kubectl get pods -n atk -l app=hasura -o widePods in Running state but failing readiness checks don't receive traffic.
Readiness failures appear in describe output:
kubectl describe pod -n atk hasura-PODNAME- Open Tempo traces for failed requests. Trace reveals where error originated:
- Error in Hasura GraphQL execution: Database connection exhausted or query timeout
- Error in DApp backend: Application logic exception or external API failure
- Error in authentication middleware: JWT validation failure or session expiration
Remediation: For database connection exhaustion, check PostgreSQL dashboard connection pool metrics:
hasura:
config:
HASURA_GRAPHQL_MAX_CONNECTIONS: 50
HASURA_GRAPHQL_POOL_TIMEOUT: 180Increase MAX_CONNECTIONS to 100 if pool utilization exceeds 80% sustained.
Scale PostgreSQL replicas if connection limit cannot be increased further.
For memory-related OOM kills, increase pod memory limits:
hasura:
resources:
limits:
memory: 2Gi
requests:
memory: 1GiVerification: Monitor error rate in NGINX Ingress dashboard. Errors should drop below 0.5% within 5 minutes after remediation. Check pod metrics in Kubernetes dashboard to verify memory usage stays below 85%.
Scenario: validator node connectivity loss
Symptom: Besu dashboard shows peer count drop on one validator. Block production continues but may become unstable if additional validators fail.
Investigation workflow:
-
Besu dashboard → Identify affected validator by node name. Check peer connectivity graph showing connections between validators.
-
Query Loki for network errors on affected pod:
{namespace="atk", pod=~"besu-validator-2.*"} |= "peer" |= "disconnect"Peer disconnections indicate:
- Network partition between validators
- Firewall rules blocking P2P ports (30303 TCP/UDP)
- Pod network policy too restrictive
- Validator crashed and restarted (peers reconnect slowly)
- Check pod events for network-related issues:
kubectl describe pod -n atk besu-validator-2-0Look for:
NetworkNotReady: CNI plugin failureFailedMount: Volume mount issues preventing startupUnhealthy: Liveness probe failures indicating crash
- Test connectivity between validators:
kubectl exec -n atk besu-validator-0-0 -- \
nc -zv besu-validator-2-0.besu-validator.atk.svc.cluster.local 30303Connection refused indicates pod not listening. Connection timeout indicates network policy or firewall blocking traffic.
- Review network policies:
kubectl get networkpolicy -n atk
kubectl describe networkpolicy -n atk besu-network-policyRemediation: For network policy issues, verify P2P port ingress allowed:
besu:
networkPolicy:
enabled: true
ingress:
- from:
- podSelector:
matchLabels:
app: besu
ports:
- protocol: TCP
port: 30303
- protocol: UDP
port: 30303For crashed pods, check logs for errors before restart:
kubectl logs -n atk besu-validator-2-0 --previousCommon crash causes:
- OutOfMemory: Increase memory limits from 4Gi to 8Gi
- Corrupted database: Restore from PVC snapshot or reinitialize
- Genesis mismatch: Regenerate genesis.json and restart all validators
Restart affected validator after remediation:
kubectl rollout restart statefulset/besu-validator -n atkVerification: Monitor peer count in Besu dashboard. All validators should show full connectivity (peer count = validator count - 1) within 5 minutes. Block production rate returns to normal.
CI/CD integration
Automated deployment pipelines ensure consistent, repeatable releases. The included GitHub Actions workflows provide foundation for extending to production deployment.
QA workflow structure
.github/workflows/qa.yml executes comprehensive testing pipeline:
Artifact generation phase compiles smart contracts, generates ABIs, creates TypeScript bindings, and produces genesis allocation:
- name: Generate artifacts
run: bun run artifacts
- name: Upload contract ABIs
uses: actions/upload-artifact@v4
with:
name: contract-artifacts
path: kit/contracts/artifacts/Artifacts become Docker image layers and Helm chart ConfigMaps for deployment.
Backend services startup launches Docker Compose stack with PostgreSQL, Redis, Besu node, and supporting services. Tests execute against this local environment mimicking production:
- name: Start development environment
run: bun dev:up
timeout-minutes: 5Test execution runs parallel test suites across packages:
- name: Run test suite
run: bunx turbo run ci:gha --concurrency=3Includes unit tests (Vitest, Foundry), integration tests (contract interactions), E2E tests (Playwright), linting, and type checking.
Chart validation (conditional on Helm chart changes) uses ct
(chart-testing) to lint and validate against Kubernetes API:
- name: Lint and install charts
run: |
ct lint --config ct.yaml
ct install --config ct.yamlImage builds create container images tagged with commit SHA and push to GitHub Container Registry:
- name: Build and push DApp image
uses: docker/build-push-action@v5
with:
context: .
file: Dockerfile.dapp
push: true
tags: ghcr.io/${{ github.repository }}/dapp:${{ github.sha }}Production deployment workflow
Extend QA workflow for production deployment triggered by git tags:
name: Deploy to production
on:
push:
tags:
- "v*"
jobs:
deploy:
runs-on: ubuntu-latest
environment: production
steps:
- uses: actions/checkout@v4
- name: Configure kubectl
run: |
echo "${{ secrets.KUBE_CONFIG }}" | base64 -d > kubeconfig
export KUBECONFIG=kubeconfig
kubectl cluster-info
- name: Deploy ATK
working-directory: kit/charts/atk
run: |
helm upgrade atk . \
--install \
--namespace atk \
--create-namespace \
--values values-production.yaml \
--set dapp.image.tag=${{ github.ref_name }} \
--set contracts.image.tag=${{ github.ref_name }} \
--timeout 20m \
--wait \
--atomic--atomic flag rolls back automatically if deployment fails. --wait waits for
all resources to reach ready state before considering deployment successful.
Post-deployment verification checks observability dashboards programmatically:
- name: Verify deployment health
run: |
# Wait for pods to be ready
kubectl wait --for=condition=ready pod \
-l app.kubernetes.io/instance=atk \
-n atk \
--timeout=600s
# Check Besu validator connectivity
PEER_COUNT=$(kubectl exec -n atk besu-validator-0-0 -- \
curl -s -X POST http://localhost:8545 \
-H "Content-Type: application/json" \
-d '{"jsonrpc":"2.0","method":"net_peerCount","params":[],"id":1}' \
| jq -r '.result' | xargs printf "%d")
if [ "$PEER_COUNT" -lt 3 ]; then
echo "Insufficient peers: $PEER_COUNT"
exit 1
fiGitOps deployment with ArgoCD
Declarative deployments using ArgoCD provide automatic synchronization and drift detection:
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: atk-production
namespace: argocd
spec:
project: tokenization
source:
repoURL: https://github.com/your-org/asset-tokenization-kit
targetRevision: main
path: kit/charts/atk
helm:
valueFiles:
- values-production.yaml
parameters:
- name: dapp.image.tag
value: v1.2.3
destination:
server: https://kubernetes.default.svc
namespace: atk
syncPolicy:
automated:
prune: true
selfHeal: true
syncOptions:
- CreateNamespace=true
retry:
limit: 5
backoff:
duration: 5s
factor: 2
maxDuration: 3mArgoCD monitors Git repository and automatically applies changes. Drift detection alerts when manual kubectl changes deviate from Git state. Self-healing automatically reverts unauthorized changes.
Progressive delivery with ArgoCD Rollouts enables canary deployments:
dapp:
rollout:
enabled: true
strategy:
canary:
steps:
- setWeight: 20
- pause: { duration: 5m }
- setWeight: 50
- pause: { duration: 5m }
- setWeight: 80
- pause: { duration: 5m }Canary deployment gradually shifts traffic to new version. Pause between steps allows observability dashboard monitoring for increased error rates or latency before proceeding.
Troubleshooting with observability
Integrated observability accelerates troubleshooting by correlating metrics, logs, and traces. Start with high-level dashboards to identify symptoms, drill into logs for details, follow traces through distributed operations, then execute remediation commands.
General troubleshooting workflow
Pod crashes and restart loops
Using observability: Open Kubernetes Overview dashboard → Find pods with restart count >3. Click pod name to filter metrics showing resource usage before crash.
Query Loki for crash logs:
{namespace="atk"} |= "panic" or "fatal" or "OOMKilled"Filter by specific application:
{namespace="atk", pod=~"dapp-.*"} |= "error" | json | level="fatal"Check memory trends before OOM kills in Node Exporter dashboard. Memory usage climbing to pod limit indicates undersized resources.
Using kubectl: Check previous pod logs (before crash):
kubectl logs -n atk <pod-name> --previous --tail=100Describe pod for termination reason:
kubectl describe pod -n atk <pod-name>Look for:
OOMKilled: Memory limit exceeded → Increaseresources.limits.memoryError: Exit code nonzero → Check application logs for panicCrashLoopBackOff: Pod repeatedly failing startup → Fix configuration or application code
Common remediation:
Database connection errors:
{namespace="atk", app="dapp"} |= "database" |= "connection refused"PostgreSQL dashboard shows connection pool exhausted. Increase application connection pool size or scale database replicas.
RPC connection failures:
{namespace="atk", app="dapp"} |= "RPC" |= "timeout"ERPC dashboard displays all upstreams unhealthy. Check Besu validator status; if validators are healthy, restart ERPC gateway to refresh upstream health checks.
Ingress connectivity issues
Using observability: Open NGINX Ingress dashboard → Check request rate and status code distribution. High 502/503 rate indicates backend unavailability; 404 rate suggests misconfigured routes.
Query ingress controller logs:
{namespace="atk", app="ingress-nginx"} |= "error" | json | statusCode >= 500SSL/TLS errors:
{namespace="atk", app="ingress-nginx"} |= "TLS" |= "handshake"Check certificate expiration in dashboard SSL certificate panel. Expired certificates require renewal via cert-manager or manual secret update.
Using kubectl: Verify ingress controller pods running:
kubectl get pods -n atk -l app.kubernetes.io/name=ingress-nginxCheck ingress resources:
kubectl get ingress -n atk
kubectl describe ingress -n atk dapp-ingressEnsure ADDRESS field populated with LoadBalancer IP. Missing address indicates
LoadBalancer provisioning failure or cloud provider issue.
Test DNS resolution:
nslookup dapp.example.comVerify DNS record points to ingress LoadBalancer IP. DNS propagation takes 1-60 minutes after changes.
Database migration failures
DApp includes post-install Helm hook job executing Drizzle migrations. Hook failures prevent deployment success.
Check migration job logs:
kubectl logs -n atk job/atk-dapp-migrationsCommon failures:
- Schema already exists: Migration previously executed. Delete job and
retry:
kubectl delete job -n atk atk-dapp-migrations - Connection refused: Database not ready. Increase
helm.sh/hook-wait-timeoutannotation or add init container polling database readiness - Permission denied: Database user lacks schema modification rights. Grant superuser or schema owner privileges
Manually execute migrations:
kubectl exec -n atk deployment/atk-dapp -- bun run db:migrateGenesis file regeneration
Modifying smart contract deployments requires regenerating genesis allocations and restarting validators with new genesis.
Regenerate artifacts locally:
bun run artifactsUpdate genesis ConfigMap:
kubectl create configmap atk-genesis \
--from-file=genesis.json=kit/contracts/genesis.json \
--dry-run=client -o yaml \
| kubectl apply -n atk -f -Restart validators to load new genesis:
kubectl rollout restart statefulset/besu-validator -n atk
kubectl rollout restart statefulset/besu-rpc -n atkMonitor restart in Besu dashboard. Block production resumes within 2 minutes after majority validators restart.
WARNING: Genesis changes on running network cause chain fork. Only regenerate genesis for development/test environments or when performing coordinated network upgrade with all participants.
Helm release recovery
Failed Helm upgrades leave releases in pending or failed state. Recover using rollback or force upgrade.
Check release status:
helm status atk -n atk
helm history atk -n atkRollback to last successful revision:
helm rollback atk -n atkSpecify revision number for targeted rollback:
helm rollback atk 3 -n atkForce upgrade (use cautiously—may cause resource conflicts):
helm upgrade atk . \
--namespace atk \
--values values-production.yaml \
--force \
--cleanup-on-fail--force deletes and recreates resources with changes. --cleanup-on-fail
removes resources if upgrade fails, preventing stuck state.
For severely corrupted releases, uninstall and reinstall (destroys data):
helm uninstall atk -n atk
helm install atk . -n atk -f values-production.yamlRestore data from backups after reinstall.
Alerting configuration
Proactive monitoring requires alerting on SLA violations and anomalous conditions. Configure Grafana alert rules with notification channels for critical events.
Alert rule structure
Navigate to Grafana → Alerting → Alert rules → New alert rule. Define query, threshold, evaluation interval, and notification policy.
Example alert for slow block production:
groups:
- name: blockchain_health
interval: 1m
rules:
- alert: SlowBlockProduction
expr: rate(besu_blockchain_height[5m]) < 0.067
for: 10m
labels:
severity: warning
component: blockchain
annotations:
summary: "Block production slower than target"
description:
"Block production rate {{ $value | humanize }} blocks/s is below
target 0.067 blocks/s (15s block time) for 10 minutes on {{
$labels.instance }}"Alert fires when 5-minute block production rate drops below 1 block per 15 seconds sustained for 10 minutes.
Critical SLA alerts
Settlement latency:
- alert: HighSettlementLatency
expr: histogram_quantile(0.95, rate(settlement_duration_seconds_bucket[5m])) > 5
for: 5m
labels:
severity: critical
annotations:
summary: "Settlement latency exceeds SLA"
description: "95th percentile settlement time {{ $value }}s exceeds 5s SLA"Subgraph sync lag:
- alert: SubgraphSyncLag
expr: (ethereum_chain_head_number - ethereum_block_number) > 10
for: 5m
labels:
severity: warning
annotations:
summary: "Subgraph falling behind chain head"
description: "Subgraph {{ $labels.subgraph }} is {{ $value }} blocks behind chain head"Database connection exhaustion:
- alert: DatabaseConnectionPoolSaturated
expr: (sum(pg_stat_activity_count) by (instance) /
sum(pg_settings_max_connections) by (instance)) > 0.9
for: 3m
labels:
severity: critical
annotations:
summary: "Database connection pool >90% utilized"
description: "PostgreSQL instance {{ $labels.instance }} has {{ $value |
humanizePercentage }} connection pool utilization"Disk space critical:
- alert: PVCDiskSpaceCritical
expr:
(kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes) >
0.85
for: 5m
labels:
severity: warning
annotations:
summary: "PVC {{ $labels.persistentvolumeclaim }} >85% full"
description: "Disk usage {{ $value | humanizePercentage }} on PVC {{
$labels.persistentvolumeclaim }} in namespace {{ $labels.namespace }}"Notification channels
Configure contact points for alert delivery:
Slack integration:
- Navigate to Grafana → Alerting → Contact points → New contact point
- Select "Slack" type
- Enter webhook URL from Slack App configuration
- Test contact point to verify delivery
PagerDuty integration:
- Create PagerDuty integration key for ATK service
- Add contact point with "PagerDuty" type
- Enter integration key
- Configure routing by severity (critical → page on-call, warning → notification)
Email notification:
- Configure SMTP settings in Grafana Helm values:
grafana:
smtp:
enabled: true
host: smtp.example.com:587
user: [email protected]
password: "${SMTP_PASSWORD}"
from_address: [email protected]- Add email contact point with recipient addresses
Notification policies
Route alerts to appropriate channels based on severity and component:
# Grafana notification policy configuration
policies:
- match:
severity: critical
continue: true
contact_point: pagerduty
- match:
severity: warning
component: blockchain
contact_point: slack-ops
- match:
severity: warning
contact_point: email-teamCritical alerts page on-call engineer via PagerDuty. Warning-level blockchain alerts notify operations Slack channel. Other warnings send email to engineering team.
Incident response runbook
Production incidents require structured response workflows to minimize downtime and ensure proper root cause analysis. The incident response process follows a clear escalation path from detection through resolution and postmortem.
Incident severity classification
P0 - Critical (immediate response):
- Complete service outage affecting all users
- Data loss or corruption detected
- Security breach or unauthorized access
- Blockchain consensus failure (validators offline)
P1 - High (15-minute response):
- Partial service degradation affecting >50% users
- Settlement transaction failures >5% rate
- Database replication lag >5 minutes
- Critical component failure with redundancy active
P2 - Medium (1-hour response):
- Performance degradation within SLA but trending worse
- Single component failure with no user impact
- Subgraph indexing lag >50 blocks
- Non-critical alert patterns requiring investigation
P3 - Low (next business day):
- Minor performance issues
- Non-blocking errors in logs
- Resource utilization trending toward limits
- Documentation or configuration improvements needed
Escalation criteria
Escalate incidents when:
- Mitigation attempts fail after 30 minutes
- Impact scope increases during response
- Root cause requires specialized expertise (smart contracts, blockchain infrastructure)
- Incident affects compliance or regulatory obligations
- Multiple components failing simultaneously suggests systemic issue
Escalation contacts maintained in PagerDuty rotation. For P0 incidents, engage incident commander role to coordinate cross-team response.
Communication protocols
Status page updates:
Update status page at:
- Incident start (within 5 minutes of confirmation)
- Every 30 minutes during active mitigation
- Significant changes in scope or impact
- Resolution confirmation
- Postmortem publication
Status page example entries:
[Investigating] Settlement transactions experiencing delays >2 minutes. Team investigating blockchain node connectivity.
[Identified] Issue identified as RPC gateway upstream health check failure. Restarting ERPC pods.
[Monitoring] ERPC gateway restored. Monitoring settlement latency for 30 minutes before confirming resolution.
[Resolved] Settlement transactions returned to normal latency <5s. Root cause: upstream timeout configuration too aggressive.Stakeholder notifications:
- P0/P1: Immediate Slack notification to #incidents channel + PagerDuty alert
- P0: Email to executive stakeholders within 15 minutes
- All: Post-incident summary to #engineering within 24 hours
- P0/P1: Postmortem document within 5 business days
Mitigation playbooks
Rollback deployment:
# Check recent deployments
helm history atk -n atk
# Rollback to previous revision
helm rollback atk -n atk
# Monitor rollback progress
kubectl rollout status deployment/atk-dapp -n atk
watch kubectl get pods -n atkUse rollback when:
- Recent deployment correlates with incident start time
- New version showing increased error rates or latency
- Configuration change caused service disruption
Scale resources:
# Scale application replicas
kubectl scale deployment/atk-dapp -n atk --replicas=6
# Scale RPC nodes for increased blockchain read capacity
kubectl scale statefulset/besu-rpc -n atk --replicas=5
# Scale Hasura for database query load
helm upgrade atk . -n atk \
--set hasura.replicaCount=4 \
--reuse-valuesUse scaling when:
- CPU/memory utilization >85% sustained
- Request queue depth increasing
- Response times degrading under load
- Connection pool exhaustion
Restart services:
# Restart DApp deployment
kubectl rollout restart deployment/atk-dapp -n atk
# Restart ERPC gateway to refresh upstream health
kubectl rollout restart deployment/erpc -n atk
# Restart Hasura for database connection reset
kubectl rollout restart deployment/hasura -n atk
# Restart specific pod (last resort)
kubectl delete pod -n atk hasura-PODNAMEUse restarts when:
- Application showing memory leaks or resource exhaustion
- Stale connections to external services
- Configuration reload required without full redeployment
- Clearing transient error states
Root cause analysis
After immediate mitigation, identify root cause through observability stack:
Metrics analysis: Review dashboard metrics for 2 hours before incident. Compare against baseline to identify anomalies. Look for:
- Resource utilization spikes (CPU, memory, network)
- Error rate increases in specific components
- Latency percentile degradation (p95, p99)
- Throughput drops or queue depth increases
Log correlation: Query Loki for error patterns:
{namespace="atk"} |= "error" or "fatal" or "exception"
| json
| __error__=""
| line_format "{{.timestamp}} {{.level}} {{.component}} {{.message}}"Filter to incident time window. Identify first occurrence of errors and trace backward to triggering event.
Trace analysis: Open Tempo and search for failed requests during incident window. Follow trace spans to identify:
- Which component introduced latency or errors
- External dependency timeouts or failures
- Database query performance issues
- Authentication or authorization failures
Change correlation: Review recent changes:
# Recent deployments
helm history atk -n atk
# Recent pod restarts
kubectl get events -n atk --sort-by='.lastTimestamp' | head -50
# Git commits deployed
git log --since="2 hours ago" --onelineCorrelate incident timing with deployments, configuration changes, or infrastructure modifications.
Postmortem documentation
Document every P0 and P1 incident with structured postmortem:
Incident summary:
- Date and time (start, detection, mitigation, resolution)
- Severity classification
- Impact scope (users affected, revenue impact, SLA breach duration)
- Incident commander and responders
Timeline:
Chronological sequence of events with timestamps:
14:23 UTC - Alert fired: HighSettlementLatency
14:25 UTC - On-call engineer acknowledged
14:28 UTC - Identified ERPC upstream health check failures
14:30 UTC - Status page updated: Investigating
14:32 UTC - Restarted ERPC deployment
14:35 UTC - Settlement latency returning to normal
14:45 UTC - Confirmed resolution, updated status pageRoot cause:
Single-sentence root cause statement, followed by detailed explanation:
Root cause: ERPC gateway upstream health check timeout set to 5s caused false negatives during normal Besu block production delays of 10-15s, marking healthy RPC nodes as unhealthy and routing traffic to single remaining node.Impact assessment:
- Users affected: Estimated 200 active users during incident window
- Transactions affected: 347 settlement transactions delayed >2 minutes
- SLA impact: Settlement latency SLA (<5s) breached for 22 minutes
- Revenue impact: Minimal (transactions completed, delayed only)
Mitigation actions:
- Restarted ERPC deployment to reset upstream health state (14:32 UTC)
- Verified settlement latency returned to normal <5s (14:35 UTC)
- Monitored for 30 minutes to confirm stable state
Corrective actions (with owners and due dates):
- Action 1: Increase ERPC upstream health check timeout from 5s to 30s (Owner: DevOps, Due: 2025-01-30)
- Action 2: Add alerting for ERPC upstream health check failures (Owner: SRE, Due: 2025-02-05)
- Action 3: Document ERPC tuning guidelines in runbook (Owner: Tech Writer, Due: 2025-02-10)
- Action 4: Review all service health check timeouts for appropriate values (Owner: Platform Team, Due: 2025-02-15)
Lessons learned:
- Health check timeouts must account for normal system variability
- Upstream health check failures should trigger alerts before routing impact
- Need better observability into ERPC upstream selection decisions
- Status page updates were timely and clear (positive feedback from users)
Store postmortems in shared documentation repository (Git or Confluence). Reference in future incident response and system design discussions.
External observability forwarding
Send telemetry to external platforms for long-term retention, advanced analysis, or centralized monitoring across multiple deployments.
Configure Alloy to forward metrics, logs, and traces while maintaining local observability:
observability:
alloy:
endpoints:
external:
prometheus:
enabled: true
url: https://prometheus-prod.grafana.net/api/prom/push
basicAuth:
username: "123456"
passwordSecretRef:
name: grafana-cloud-secrets
key: prometheus-password
loki:
enabled: true
url: https://logs-prod.grafana.net/loki/api/v1/push
basicAuth:
username: "123456"
passwordSecretRef:
name: grafana-cloud-secrets
key: loki-password
tempo:
enabled: true
url: https://tempo-prod.grafana.net:443
basicAuth:
username: "123456"
passwordSecretRef:
name: grafana-cloud-secrets
key: tempo-passwordCreate secret with platform credentials:
kubectl create secret generic grafana-cloud-secrets \
--from-literal=prometheus-password='PROMETHEUS_TOKEN' \
--from-literal=loki-password='LOKI_TOKEN' \
--from-literal=tempo-password='TEMPO_TOKEN' \
-n atkAlloy forwards all telemetry to both local VictoriaMetrics/Loki/Tempo and external endpoints. Local observability remains fully operational during external platform outages.
Datadog integration uses OTLP exporter:
observability:
alloy:
endpoints:
external:
otel:
enabled: true
url: https://api.datadoghq.com
headers:
DD-API-KEY: "${DATADOG_API_KEY}"Uninstallation and cleanup
Complete removal requires deleting Helm release, namespace, and PVCs (which may not delete automatically based on reclaim policy).
Uninstall Helm release:
helm uninstall atk -n atkDelete namespace (removes all resources):
kubectl delete namespace atkWarning: This permanently deletes all data including blockchain state, databases, and configuration. Backup critical data before uninstalling.
List remaining PVCs with Retain reclaim policy:
kubectl get pv | grep atkManually delete retained PVs:
kubectl delete pv <pv-name>Cloud provider volumes may require manual deletion through provider console (AWS EBS volumes, Azure Disks, GCP Persistent Disks).
Next steps
- Review Installation & configuration for initial deployment and configuration reference
- See Testing and QA for running tests against deployed environments
- Explore Development FAQ for common deployment issues
- Consult Code Structure to understand monorepo layout
- Review Contract Reference for smart contract interfaces