SkyPilot logo

SkyPilot

Run, scale, and manage AI workloads on any cloud

SkyPilot provides a unified, code‑first interface to launch, orchestrate, and auto‑scale GPU, TPU, or CPU jobs across Kubernetes clusters and 16+ cloud providers, cutting costs with spot and auto‑stop.

SkyPilot banner

Overview

Overview

SkyPilot lets AI teams launch and manage GPU, TPU, or CPU jobs with a single YAML or Python definition. The same task file can run on Kubernetes, AWS, GCP, Azure, and over a dozen other clouds, eliminating vendor lock‑in.

Features

The platform automatically selects the cheapest available resources, supports spot instances with automatic pre‑emptive recovery, and shuts down idle machines to save cost. Built‑in gang scheduling and multi‑cluster scaling let you run large‑scale LLM training or RL experiments without manual orchestration. Developers also get a local‑dev experience on Kubernetes, including SSH access, code sync, and IDE integration.

Deployments start with a simple pip install and a sky launch command, after which SkyPilot handles provisioning, monitoring, and cleanup across the chosen infrastructure.

Highlights

Unified YAML/Python API works across 16+ clouds and Kubernetes
Automatic cheapest-instance selection with spot support and auto-recovery
Built-in gang scheduling, multi-cluster scaling, and auto-stop for idle resources
Local development experience: SSH into pods, code sync, IDE integration

Pros

  • Reduces vendor lock-in with a single control plane
  • Simplifies AI job orchestration for both developers and ops
  • Achieves significant cost savings via spot instances and auto-stop
  • Supports GPUs, TPUs, and CPUs without code changes

Considerations

  • Requires familiarity with YAML/Python task definitions
  • Advanced features may need cloud-specific credentials setup
  • Performance depends on underlying provider's availability
  • Debugging distributed jobs can be complex without Kubernetes expertise

Managed products teams compare with

When teams consider SkyPilot, these hosted platforms usually appear on the same shortlist.

Amazon SageMaker logo

Amazon SageMaker

Fully managed machine learning service to build, train, and deploy ML models at scale

Anyscale logo

Anyscale

Ray-powered platform for scalable LLM training and inference.

BentoML logo

BentoML

Open-source model serving framework to ship AI applications.

Looking for a hosted option? These are the services engineering teams benchmark against before choosing open source.

Fit guide

Great for

  • AI research teams needing rapid multi-cloud training
  • ML Ops groups managing shared GPU clusters
  • Startups wanting cost-effective LLM finetuning
  • Enterprises seeking a unified interface for heterogeneous compute

Not ideal when

  • Small projects that run on a single local machine
  • Teams without access to cloud credentials or spot markets
  • Users preferring GUI-only workflow
  • Workloads requiring deep custom kernel modifications

How teams use it

Finetune Llama 2 on a multi-cloud GPU pool

Trains the model in half the time while cutting cloud spend by 60% using spot instances.

Serve GPT-OSS 120B model with auto-scaling

Provides low-latency inference across clusters, automatically adding GPUs during traffic spikes and releasing them when idle.

Run RL-based LLM training with PPO on Kubernetes

Orchestrates distributed PPO jobs, handling preemptions and ensuring reproducible results.

Deploy Retrieval-Augmented Generation pipeline on hybrid infra

Synchronizes code and data, launches the RAG service on the cheapest available GPUs, and shuts down resources after completion.

Tech snapshot

Python88%
JavaScript8%
Jinja2%
Shell1%
HTML1%
Go1%

Tags

ml-infrastructuretpucost-managementdistributed-trainingcloud-computingmulticloudmachine-learningllm-servinghyperparameter-tuningcost-optimizationjob-queuegpucloud-managementjob-schedulerdeep-learningml-platformspot-instancesfinopsdata-sciencellm-training

Frequently asked questions

How does SkyPilot choose which cloud provider to use?

It evaluates cost, GPU availability, and user-specified constraints, then automatically provisions the cheapest suitable instance, falling back to others if needed.

Can I run jobs on my on-premise Kubernetes cluster?

Yes, SkyPilot supports any Kubernetes cluster, allowing you to submit tasks alongside cloud resources.

What is required to enable spot-instance savings?

Provide credentials for the target cloud and enable the `spot` flag in the resource spec; SkyPilot handles preemption and auto-recovery.

How does auto-stop work?

SkyPilot monitors job activity and shuts down idle VMs or pods after a configurable idle timeout, preventing unnecessary charges.

Is there a GUI for monitoring jobs?

Currently SkyPilot is CLI-driven; job status can be inspected via `sky status` and logs are streamed to the console.

Project at a glance

Active
Stars
9,311
Watchers
9,311
Forks
917
LicenseApache-2.0
Repo age4 years old
Last commityesterday
Primary languagePython

Last synced yesterday