Spot Instances
Nstance Agent supports automatic detection and handling of spot instance termination notices across multiple cloud providers. The agent automatically detects the cloud provider and uses provider-specific metadata endpoints to detect termination notices, then normalizes the data into a common format for health reports.
Provider Detection
The agent determines the cloud provider through:
- Environment Variable:
NSTANCE_PROVIDERcan explicitly set the provider (aws,gcp,proxmox) - Auto-Detection: Falls back to checking provider-specific metadata endpoints
Spot Instance Detection
On startup, the Nstance Agent automatically detects if it is running on a spot/preemptible instance by querying the detected cloud provider’s metadata service.
AWS Support:
- Query metadata endpoint:
http://169.254.169.254/latest/meta-data/instance-life-cycle - If response is
spot, enable spot termination monitoring - Auto-detected by default when running on AWS
Google Cloud Support:
- Query metadata endpoint:
http://metadata.google.internal/computeMetadata/v1/instance/scheduling/preemptible - If response is
TRUE, enable preemptible instance monitoring
Spot Termination Monitoring
When running on a spot instance, the Agent starts a background monitor that polls for termination notices using provider-specific endpoints:
Poll Configuration:
- Default Interval: 2 seconds (configurable via
NSTANCE_SPOT_POLL_INTERVAL) - Provider-Specific Endpoints:
- AWS:
http://169.254.169.254/latest/meta-data/spot/instance-action - Google Cloud:
http://metadata.google.internal/computeMetadata/v1/instance/preempted - Proxmox VE: n/a
- AWS:
Detection Logic:
AWS:
- 404 response = no termination notice (normal)
- 200 response = termination notice detected
- Other errors = logged but don’t trigger termination handling
Google Cloud:
- Checks
preemptedmetadata during graceful shutdown window - Returns termination notice when
TRUE, but no deadline provided
Termination Notice Format (AWS):
{
"action": "terminate",
"time": "2024-01-01T12:00:00Z"
}Note: Each provider has different metadata formats and termination notice mechanisms. The Agent normalizes these into a common termination_notice structure for health reports. The deadline field is optional - AWS provides a termination time (typically 2 minutes notice), but GCP does not provide a specific deadline (for GCP, we only know about preemption after termination has started).
Health Report Integration
When a spot termination notice is detected:
- Parse Notice: Extract termination time (if available) and action from metadata response
- Include in Health Reports: Add
termination_noticefield to all subsequent health reports - Continue Reporting: Agent continues sending health reports until actual termination occurs
Example Health Report with Termination Notice:
{
"version": "1.0.0",
"instance_id": "knc0000000001r010000000000000",
"timestamp": "2024-01-01T11:58:00Z",
"termination_notice": {
"action": "terminate",
"deadline": "2024-01-01T12:00:00Z"
},
"one_minute": {
"cpu_usage": 15.2,
"memory_usage": 45.8
}
}Note: The deadline field is optional - AWS provides a termination time (typically 2 minutes notice), but GCP does not provide a specific deadline.
Server-Side Termination Handling
When a health report contains a spot termination notice:
- Termination Notice Received: Health report includes
termination_noticefield with action and optional deadline - Mark as Terminating: Instance marked as terminating in SQLite (similar to unhealthy instances)
- Schedule Replacement: Immediately trigger reconciliation to create replacement instance
- Initiate Drain: If group has
drain_timeout > 0, follow standard drain coordination flow - Cleanup: After drain completion or timeout, instance is deleted (or allowed to terminate naturally by cloud provider)
Termination Notice Structure:
action: Action type (e.g., “terminate”, “stop”, “hibernate”)deadline: Optional timestamp when cloud provider will terminate the instance (AWS provides this with typically 2 minutes notice, GCP does not)
Processing Logic:
- Server treats spot termination notices similar to unhealthy instance detection
- Replacement instances are created immediately upon first notice (don’t wait for actual termination)
- Drain coordination reuses existing infrastructure (
drain_started_at,WatchInstanceEvents,AcknowledgeDrained) - Multiple health reports with the same termination notice are handled idempotently
- If leader changes during spot termination, new leader may re-notify Operator (handled idempotently)
Spot Termination Workflow
From Agent Perspective:
- Normal Operation: Agent runs normally, polling spot metadata every 2 seconds
- Notice Detected: Metadata endpoint returns termination notice (typically 2 minutes before termination)
- Report to Server: Include termination notice in health report
- Continue Operation: Agent continues normal operation and health reporting
- Await Termination: AWS terminates the instance at the specified time
Server Coordination:
The Server handles the termination notice by:
- Immediately scheduling a replacement instance
- Triggering drain coordination with the Operator (for Kubernetes nodes)
- Waiting for drain completion or timeout before allowing termination
Error Handling
Metadata Polling Errors:
- Transient network errors are logged but don’t affect normal operation
- Only a 200 response with valid termination data triggers termination handling
- Invalid JSON responses are logged as errors and ignored
Multiple Detections:
- Once a termination notice is detected, it persists in all subsequent health reports
- Server handles duplicate termination notices idempotently (similar to drain coordination)