SkyPilot

Run, scale, and manage AI workloads on any cloud

SkyPilot provides a unified, code‑first interface to launch, orchestrate, and auto‑scale GPU, TPU, or CPU jobs across Kubernetes clusters and 16+ cloud providers, cutting costs with spot and auto‑stop.

Overview

SkyPilot lets AI teams launch and manage GPU, TPU, or CPU jobs with a single YAML or Python definition. The same task file can run on Kubernetes, AWS, GCP, Azure, and over a dozen other clouds, eliminating vendor lock‑in.

Features

The platform automatically selects the cheapest available resources, supports spot instances with automatic pre‑emptive recovery, and shuts down idle machines to save cost. Built‑in gang scheduling and multi‑cluster scaling let you run large‑scale LLM training or RL experiments without manual orchestration. Developers also get a local‑dev experience on Kubernetes, including SSH access, code sync, and IDE integration.

Deployments start with a simple pip install and a sky launch command, after which SkyPilot handles provisioning, monitoring, and cleanup across the chosen infrastructure.

Highlights

Unified YAML/Python API works across 16+ clouds and Kubernetes

Automatic cheapest-instance selection with spot support and auto-recovery

Built-in gang scheduling, multi-cluster scaling, and auto-stop for idle resources

Local development experience: SSH into pods, code sync, IDE integration

Pros

Reduces vendor lock-in with a single control plane
Simplifies AI job orchestration for both developers and ops
Achieves significant cost savings via spot instances and auto-stop
Supports GPUs, TPUs, and CPUs without code changes

Considerations

Requires familiarity with YAML/Python task definitions
Advanced features may need cloud-specific credentials setup
Performance depends on underlying provider's availability
Debugging distributed jobs can be complex without Kubernetes expertise

Managed products teams compare with

When teams consider SkyPilot, these hosted platforms usually appear on the same shortlist.

Anyscale

Ray-powered platform for scalable LLM training and inference.

Amazon SageMaker

Fully managed machine learning service to build, train, and deploy ML models at scale

BentoML

Open-source model serving framework to ship AI applications.

Looking for a hosted option? These are the services engineering teams benchmark against before choosing open source.

Fit guide

Great for

AI research teams needing rapid multi-cloud training
ML Ops groups managing shared GPU clusters
Startups wanting cost-effective LLM finetuning
Enterprises seeking a unified interface for heterogeneous compute

Not ideal when

Small projects that run on a single local machine
Teams without access to cloud credentials or spot markets
Users preferring GUI-only workflow
Workloads requiring deep custom kernel modifications

How teams use it

Finetune Llama 2 on a multi-cloud GPU pool

Trains the model in half the time while cutting cloud spend by 60% using spot instances.

Serve GPT-OSS 120B model with auto-scaling

Provides low-latency inference across clusters, automatically adding GPUs during traffic spikes and releasing them when idle.

Run RL-based LLM training with PPO on Kubernetes

Orchestrates distributed PPO jobs, handling preemptions and ensuring reproducible results.

Deploy Retrieval-Augmented Generation pipeline on hybrid infra

Synchronizes code and data, launches the RAG service on the cheapest available GPUs, and shuts down resources after completion.

Tech snapshot

Python88%

JavaScript8%

Jinja2%

Shell1%

HTML1%

Go1%

Frequently asked questions

How does SkyPilot choose which cloud provider to use?

It evaluates cost, GPU availability, and user-specified constraints, then automatically provisions the cheapest suitable instance, falling back to others if needed.

Can I run jobs on my on-premise Kubernetes cluster?

Yes, SkyPilot supports any Kubernetes cluster, allowing you to submit tasks alongside cloud resources.

What is required to enable spot-instance savings?

Provide credentials for the target cloud and enable the `spot` flag in the resource spec; SkyPilot handles preemption and auto-recovery.

How does auto-stop work?

SkyPilot monitors job activity and shuts down idle VMs or pods after a configurable idle timeout, preventing unnecessary charges.

Is there a GUI for monitoring jobs?

Currently SkyPilot is CLI-driven; job status can be inspected via `sky status` and logs are streamed to the console.

Project at a glance

Active

Visit site View repo

Stars: 9,544
Watchers: 9,544
Forks: 980

LicenseApache-2.0

Repo age4 years old

Last commityesterday

Primary languagePython

Last synced yesterday

Overview

Overview

Features

Highlights

Pros

Considerations

Managed products teams compare with

Anyscale

Amazon SageMaker

BentoML

Fit guide

Great for

Not ideal when

How teams use it

Tech snapshot

Tags

Frequently asked questions