
Amazon SageMaker JumpStart
ML hub with curated foundation models, pretrained algorithms, and solution templates you can deploy and fine-tune in SageMaker
Discover top open-source software, updated regularly with real-world adoption signals.

Kubernetes-native platform for scalable LLM fine‑tuning and distributed training
Kubeflow Trainer lets data scientists run large‑language‑model fine‑tuning and other ML workloads on Kubernetes, supporting PyTorch, TensorFlow, JAX, HuggingFace, DeepSpeed, and Megatron‑LM via a unified Python SDK.

Kubeflow Trainer is a Kubernetes‑native project that enables large‑language‑model fine‑tuning and general distributed model training. It supports major frameworks such as PyTorch, TensorFlow, JAX, and can integrate libraries like HuggingFace, DeepSpeed, and Megatron‑LM.
The platform provides custom resource definitions (CRDs) and a Python SDK for building training runtimes. Users can choose between a CustomTrainer, BuiltinTrainer, or local PyTorch execution. An MPI runtime adds high‑performance computing support, and the project is compatible with the broader Kubeflow ecosystem.
Training jobs are declared as Kubernetes resources and run on any K8s cluster with GPU support. While the project is currently in alpha and APIs may evolve, recent releases (v2.0) bring stability and community backing. Documentation, Slack community, and bi‑weekly working‑group meetings help teams adopt the solution quickly.
When teams consider Kubeflow Trainer, these hosted platforms usually appear on the same shortlist.

ML hub with curated foundation models, pretrained algorithms, and solution templates you can deploy and fine-tune in SageMaker

Enterprise AI platform providing LLMs (Command, Aya) plus Embed/Rerank for retrieval

API-first platform to run, fine-tune, and deploy AI models without managing infrastructure
Looking for a hosted option? These are the services engineering teams benchmark against before choosing open source.
Fine‑tune a GPT‑style LLM with DeepSpeed on a GPU cluster
Accelerated training completes in hours, leveraging DeepSpeed optimizations and Kubernetes auto‑scaling.
Run distributed TensorFlow training for image classification
Scales across multiple nodes using the TensorFlow operator, simplifying resource allocation via CRDs.
Integrate HuggingFace pipelines into a CI/CD MLOps workflow
Automated model updates are triggered by code changes, with training jobs managed as Kubernetes resources.
Leverage MPI runtime for large‑scale scientific simulations
High‑performance compute jobs execute efficiently on Kubernetes, reducing time‑to‑solution for HPC workloads.
Kubeflow Trainer works with any Kubernetes version that supports Custom Resource Definitions and GPU scheduling; the latest stable release is recommended.
Yes, the platform integrates with Kubernetes device plugins to schedule GPUs for training pods.
The SDK generates and applies the appropriate custom resources, handling lifecycle management and status monitoring.
Trainer jobs can be invoked from Kubeflow Pipelines as steps, allowing end‑to‑end MLOps workflows.
The project aims to reach beta in the next release cycle, stabilizing APIs and expanding documentation.
Project at a glance
ActiveLast synced 4 days ago