Auto-Scaling
Nstance Server automatically reconciles instance groups to maintain desired capacity through an event-driven reconciliation system.
Group Reconciliation
The Nstance Server (when elected as shard leader) continuously reconciles groups to ensure actual instance counts match desired group sizes through an event-driven reconciliation system:
Reconciliation Triggers:
- Initial Reconciliation: On server startup or when becoming shard leader, all groups are reconciled to desired state
- Group Configuration Changes: When group size, instance type, or vars are updated (via Operator API or config changes)
- Health-Based Replacement: When instances become unhealthy (gRPC disconnect, missed health reports, provider status checks)
- Instance Expiry: When instances exceed configured server-wide age limits (eligibleAge or forcedAge)
- Instance Deletion: When instances are deleted, groups are backfilled to maintain desired size
Reconciliation Logic:
- Instance Counting: Only counts managed instances created by the reconciler; on-demand instances created via the Operator API are excluded from reconciliation decisions
- Scale Up: If actual < desired, create new instances (rate-limited, with subnet capacity checking)
- Scale Down: If actual > desired, delete oldest managed instances (waits for unhealthy instances to be replaced first)
- Unhealthy Replacement: Unhealthy managed instances are automatically replaced to maintain group health and size
- Instance Expiry: Instances exceeding age limits are expired with replacement, following drain coordination (see Instance Expiry)
Priority Order:
Reconciliation operations are prioritized as follows:
- Scale Down (reduce group size)
- Forced Expiry (compliance requirements)
- Unhealthy Replacement (maintain health)
- Opportunistic Expiry (routine rotation)
- Scale Up (increase group size)
Dynamic Groups Storage
- Static groups are defined in
config/{shard}.jsonc(enabling restricted editing for those groups) - Dynamic groups (created via Operator API) are stored in
groups/{shard}.jsonc(and have unrestricted editing) - Dynamic groups override static groups by key, but only unrestricted fields (e.g.
size,instance_type,vars) can be changed - Restricted fields (e.g.
template,subnet_pool) from static groups cannot be overridden, preventing breaking changes to critical groups
Group Deletion
When a group is deleted (removed from config or via Operator API):
- The reconciler gracefully scales down all managed instances to 0
- Drain coordination is followed for each instance (if
drain_timeout > 0) - This ensures clean shutdown rather than immediate termination