Operational Assurance

Operational Resilience

How JIL maintains settlement availability under adverse conditions. Signing-node fault tolerance, network partition handling, automated recovery, and disaster recovery procedures.

← All Assurance

Resilience Philosophy

JIL's operational model prioritizes safety over liveness. The system is designed to halt cleanly under adverse conditions rather than produce incorrect settlements. Resilience means graceful degradation, not infinite availability.

Safety first: Incorrect settlements are categorically worse than delayed settlements
Graceful degradation: Node failures reduce capacity but do not compromise correctness
Automated recovery: SentinelAI triggers fleet-wide recovery when health drops below thresholds
Transparent status: System health is observable, not opaque

Signing-Node Network Topology

The independent signing-node network is distributed across 13+ jurisdictions with geographic and jurisdictional diversity.

Zone	Location	Role	Specifications
Genesis	Nuremberg, DE	Full node (seed)	CPX52 - 16 vCPU, 32 GB RAM
US	Ashburn, US	Full node	CPX42 - 8 vCPU, 16 GB RAM
DE	Nuremberg, DE	Full node	CPX42 - 8 vCPU, 16 GB RAM
EU	Helsinki, FI	Full node	CPX42 - 8 vCPU, 16 GB RAM
SG	Singapore, SG	Full node	CPX42 - 8 vCPU, 16 GB RAM
CH	Zurich, CH	Compact node	CPX32 - 4 vCPU, 8 GB RAM
JP	Tokyo, JP	Compact node	CPX32 - 4 vCPU, 8 GB RAM
GB	London, GB	Compact node	CPX32 - 4 vCPU, 8 GB RAM
AE	Dubai, AE	Compact node	CPX32 - 4 vCPU, 8 GB RAM
BR	Sao Paulo, BR	Compact node	CPX32 - 4 vCPU, 8 GB RAM

Node Failure Tolerance

The independent signing-quorum design provides tolerance for a minority of simultaneous node failures while maintaining settlement capability.

Nodes Online	Status	Settlement Capability
20 of 20	Full capacity	Normal operation, maximum throughput
14-19 of 20	Degraded but operational	Full settlement capability, reduced redundancy
10-13 of 20	Below quorum	Settlement halted - safety preserved. Queue pending.
Below 10	Critical	Signing halts. No settlements processed. Fleet recovery triggered.

Current architecture supports 10 of 20 signing-node slots filled. Additional signing nodes are planned post May 2026.

Network Partition Handling

Network partitions - where groups of signing nodes cannot communicate with each other - are handled according to the CAP theorem tradeoff: JIL chooses consistency (safety) over availability.

Minority Partition

If a partition holds fewer than the required signing nodes, that partition cannot reach quorum. Settlements halt in the minority partition. No incorrect settlements are produced.

Majority Partition

The partition holding the signing quorum continues processing. The minority partition halts. When connectivity is restored, the minority partition synchronizes from the majority.

Even Split

If the network splits evenly (10/10), neither partition can reach quorum. All settlement processing halts until connectivity is restored. This is the safest outcome.

Recovery

When partitions heal, signing nodes sync state and resume normal signing. Queued settlements are processed in order. No manual intervention required for standard partitions.

Infrastructure Design

Multi-provider: Signing nodes run on Hetzner infrastructure across multiple data centers - no single data center failure takes out the network
Stateless application layer: Service containers are stateless and can be replaced without data loss - state lives in PostgreSQL and Kafka
Data persistence: PostgreSQL data on NVMe SSDs with automated backup schedules
Image distribution: Docker images distributed via JILHQ registry with digest verification - corrupted images are rejected
DNS redundancy: Cloudflare provides DNS with anycast routing and DDoS protection

Monitoring Architecture

SentinelAI Fleet Inspector

Automated monitoring system that continuously evaluates signing-node fleet health.

Health checks every 60 seconds across all signing nodes
Threat scoring based on heartbeat patterns, resource usage, and attestation behavior
Automated fleet cycle when fleet health drops below 30% for 5 consecutive cycles
Anti-loop protection: max 3 fleet cycles per 2 hours, max 2 failed cycles per node per 2 hours

Metrics and Alerting

Prometheus metrics collection from all services
Settlement latency, throughput, and error rate tracking
Signing-node uptime and signing participation monitoring
Bridge balance reconciliation and deposit confirmation tracking

Incident Response

Severity	Description	Response Time	Actions
Critical	Signing failure, security breach, data corruption	Immediate	Halt settlements, isolate affected systems, forensic investigation
High	Signing-node outage (3+), bridge anomaly, performance degradation	Under 15 min	Automated recovery attempt, manual investigation if auto-recovery fails
Medium	Single node failure, elevated error rates, monitoring gaps	Under 1 hour	Node restart, log analysis, root cause investigation
Low	Performance warning, configuration drift, non-critical service issue	Next business day	Scheduled maintenance, configuration update

Disaster Recovery

Golden snapshots: Known-good signing-node state backed up to Hetzner S3. Recovery from snapshot restores last known healthy state.
Database backup: PostgreSQL continuous WAL archiving with point-in-time recovery capability
Image registry backup: JILHQ registry backed up to S3 daily at 04:00 UTC via systemd timer
Configuration recovery: All configuration stored in version control. Fresh node can be provisioned from fleet registry and compose files.
Recovery time objective: Single node recovery under 30 minutes. Full fleet recovery under 2 hours.

Resilience Testing

The following resilience scenarios are regularly tested.

Single signing-node restart during active signing
Multiple simultaneous signing-node failures (chaos testing)
Network partition simulation between signing-node groups
Image registry unavailability during deployment
Database failover and recovery from backup
SentinelAI fleet cycle under various health conditions

Ready to verify?

Start with a structured POC. Evaluate JIL settlement infrastructure on a single corridor.

Request a POC All Assurance