Cost-Efficient Training and Checkpointing for Large Models on Preemptible Cloud VMs
Conference
Desai, O, Pei, S, Bhimani, J et al. (2026). Cost-Efficient Training and Checkpointing for Large Models on Preemptible Cloud VMs
. 31-40. 10.1145/3805621.3807617
Desai, O, Pei, S, Bhimani, J et al. (2026). Cost-Efficient Training and Checkpointing for Large Models on Preemptible Cloud VMs
. 31-40. 10.1145/3805621.3807617
Training large models on discounted spot VMs offers significant cost savings but remains challenging due to their low availability and unilateral preemptions. To address these challenges, we present a cost-effective training and checkpointing system for large models using spot VMs. First, we predict the preemption rate for a spot instance using historical preemption data. Second, we dynamically tune checkpointing interval through a mathematical model that uses the preemption predictions and balances the overheads of checkpointing and recovery. Finally, we guarantee consistent training throughput and minimized training cost through prediction-informed hybrid resource utilization: switching to on-demand instances when spot VM availability is low while also opportunistically scaling the number of spot VMs for training when availability is high. We improve training throughput by up to 60.27% with dynamic checkpoint interval tuning when compared to fixed-interval approaches. We also achieve up to 2.04× higher throughput at a 51.41% lower cost through effective use of spot and on-demand instances.