YTsaurus

Scalable, fault-tolerant platform for big-data storage and processing

YTsaurus delivers a multitenant, distributed storage and compute engine with MapReduce, SQL, and NoSQL capabilities, supporting exabyte-scale data, millions of cores, and seamless scaling.

Overview

YTsaurus is a distributed storage and processing platform designed for organizations that need to handle petabyte‑to‑exabyte data volumes across many users. It combines a MapReduce engine, an SQL query layer powered by ClickHouse (CHYT), a job scheduler, and a key‑value store for OLTP workloads, all within a single multitenant ecosystem.

Core Capabilities

The system offers fault‑tolerant operation with automated replication and zero‑downtime updates, while scaling to millions of CPU cores, thousands of GPUs, and tens of thousands of nodes. Data can reside on HDD, SSD, NVMe, or RAM, and the platform supports ACID transactions, a rich set of SDKs/APIs, and secure isolation of compute and storage resources. Integrated SPYT brings Apache Spark‑compatible tools for ETL, and the web UI simplifies cluster monitoring and job management.

Deployment Options

YTsaurus can be launched locally via source builds or quickly provisioned on Kubernetes using the provided Helm chart. An online demo is also available for hands‑on evaluation. The Apache‑2.0 license permits unrestricted use and contribution, making it suitable for both on‑premises data centers and cloud environments.

Highlights

Multitenant ecosystem with MapReduce, SQL, job scheduler, and key‑value store

Fault‑tolerant architecture with automated replication and zero‑downtime updates

Massive scalability to millions of CPU cores, exabytes of data, and tens of thousands of nodes

Integrated analytics via ClickHouse‑compatible CHYT and Spark‑compatible SPYT

Pros

High scalability for extreme data volumes
Robust fault tolerance and no single point of failure
Rich set of processing models (MapReduce, SQL, OLTP, Spark)
Compatibility with familiar ClickHouse and Spark ecosystems

Considerations

Complex deployment may require Kubernetes expertise
Resource‑intensive at large scale
Steep learning curve due to many subsystems
Primary codebase in C++ may limit contributions from non‑C++ developers

Managed products teams compare with

When teams consider YTsaurus, these hosted platforms usually appear on the same shortlist.

Amazon Redshift

Fully managed, petabyte-scale cloud data warehouse for analytics and reporting

Azure Synapse Analytics

Limitless analytics platform unifying enterprise data warehousing and big data analytics in a single service

Google BigQuery

Serverless, highly scalable cloud data warehouse

Looking for a hosted option? These are the services engineering teams benchmark against before choosing open source.

Fit guide

Great for

Enterprises needing a unified big‑data platform across teams
Data scientists requiring both OLAP and OLTP workloads
Organizations with large‑scale hardware resources
Teams leveraging ClickHouse or Spark tools

Not ideal when

Small teams with limited infrastructure
Projects needing a lightweight embedded database
Users seeking a fully managed SaaS solution
Workloads that cannot tolerate operational complexity of a distributed system

How teams use it

Real‑time clickstream analytics

Process billions of events per day with low latency using MapReduce and CHYT for instant dashboards.

Large‑scale ETL pipelines

Leverage SPYT to orchestrate Spark jobs that transform and load petabytes of data into the lake.

Multi‑tenant data lake for business units

Provide isolated storage and compute environments for different departments while sharing underlying hardware.

High‑frequency trading data storage

Store tick‑level data with ACID guarantees and query it efficiently via the ClickHouse‑compatible SQL layer.

Tech snapshot

C++51%

C26%

Python15%

Go4%

Assembly2%

Cython1%

Frequently asked questions

Which programming languages have SDKs or APIs?

YTsaurus offers SDKs and APIs for C++, Python, Java, Go, and additional languages through REST and gRPC interfaces.

How does YTsaurus avoid a single point of failure?

The platform uses automated data replication across nodes and a distributed architecture that continues operating despite individual server failures.

Can YTsaurus be run on Kubernetes?

Yes, a Helm chart and quick‑start guide enable deployment of YTsaurus clusters on Kubernetes.

What SQL engine does YTsaurus use?

The CHYT layer is powered by ClickHouse, providing a familiar ClickHouse SQL dialect for fast analytic queries.

Is there a graphical interface for managing the cluster?

YTsaurus includes a web‑based UI for monitoring nodes, managing jobs, and interacting with stored data.

Project at a glance

Active

Visit site View repo

Stars: 2,198
Watchers: 2,198
Forks: 212

LicenseApache-2.0

Repo age3 years old

Last commit7 hours ago

Primary languageC++

Last synced 4 hours ago

Overview

Overview

Core Capabilities

Deployment Options

Highlights

Pros

Considerations

Managed products teams compare with

Amazon Redshift

Azure Synapse Analytics

Google BigQuery

Fit guide

Great for

Not ideal when

How teams use it

Tech snapshot

Tags

Frequently asked questions