YTsaurus logo

YTsaurus

Scalable, fault-tolerant platform for big-data storage and processing

YTsaurus delivers a multitenant, distributed storage and compute engine with MapReduce, SQL, and NoSQL capabilities, supporting exabyte-scale data, millions of cores, and seamless scaling.

YTsaurus banner

Overview

Overview

YTsaurus is a distributed storage and processing platform designed for organizations that need to handle petabyte‑to‑exabyte data volumes across many users. It combines a MapReduce engine, an SQL query layer powered by ClickHouse (CHYT), a job scheduler, and a key‑value store for OLTP workloads, all within a single multitenant ecosystem.

Core Capabilities

The system offers fault‑tolerant operation with automated replication and zero‑downtime updates, while scaling to millions of CPU cores, thousands of GPUs, and tens of thousands of nodes. Data can reside on HDD, SSD, NVMe, or RAM, and the platform supports ACID transactions, a rich set of SDKs/APIs, and secure isolation of compute and storage resources. Integrated SPYT brings Apache Spark‑compatible tools for ETL, and the web UI simplifies cluster monitoring and job management.

Deployment Options

YTsaurus can be launched locally via source builds or quickly provisioned on Kubernetes using the provided Helm chart. An online demo is also available for hands‑on evaluation. The Apache‑2.0 license permits unrestricted use and contribution, making it suitable for both on‑premises data centers and cloud environments.

Highlights

Multitenant ecosystem with MapReduce, SQL, job scheduler, and key‑value store
Fault‑tolerant architecture with automated replication and zero‑downtime updates
Massive scalability to millions of CPU cores, exabytes of data, and tens of thousands of nodes
Integrated analytics via ClickHouse‑compatible CHYT and Spark‑compatible SPYT

Pros

  • High scalability for extreme data volumes
  • Robust fault tolerance and no single point of failure
  • Rich set of processing models (MapReduce, SQL, OLTP, Spark)
  • Compatibility with familiar ClickHouse and Spark ecosystems

Considerations

  • Complex deployment may require Kubernetes expertise
  • Resource‑intensive at large scale
  • Steep learning curve due to many subsystems
  • Primary codebase in C++ may limit contributions from non‑C++ developers

Managed products teams compare with

When teams consider YTsaurus, these hosted platforms usually appear on the same shortlist.

Amazon Redshift logo

Amazon Redshift

Fully managed, petabyte-scale cloud data warehouse for analytics and reporting

Azure Synapse Analytics logo

Azure Synapse Analytics

Limitless analytics platform unifying enterprise data warehousing and big data analytics in a single service

Google BigQuery logo

Google BigQuery

Serverless, highly scalable cloud data warehouse

Looking for a hosted option? These are the services engineering teams benchmark against before choosing open source.

Fit guide

Great for

  • Enterprises needing a unified big‑data platform across teams
  • Data scientists requiring both OLAP and OLTP workloads
  • Organizations with large‑scale hardware resources
  • Teams leveraging ClickHouse or Spark tools

Not ideal when

  • Small teams with limited infrastructure
  • Projects needing a lightweight embedded database
  • Users seeking a fully managed SaaS solution
  • Workloads that cannot tolerate operational complexity of a distributed system

How teams use it

Real‑time clickstream analytics

Process billions of events per day with low latency using MapReduce and CHYT for instant dashboards.

Large‑scale ETL pipelines

Leverage SPYT to orchestrate Spark jobs that transform and load petabytes of data into the lake.

Multi‑tenant data lake for business units

Provide isolated storage and compute environments for different departments while sharing underlying hardware.

High‑frequency trading data storage

Store tick‑level data with ACID guarantees and query it efficiently via the ClickHouse‑compatible SQL layer.

Tech snapshot

C++51%
C26%
Python15%
Go4%
Assembly2%
Cython1%

Tags

distributed-databasesparkclickhousesqlytsauruslakehousebig-dataolap-database

Frequently asked questions

Which programming languages have SDKs or APIs?

YTsaurus offers SDKs and APIs for C++, Python, Java, Go, and additional languages through REST and gRPC interfaces.

How does YTsaurus avoid a single point of failure?

The platform uses automated data replication across nodes and a distributed architecture that continues operating despite individual server failures.

Can YTsaurus be run on Kubernetes?

Yes, a Helm chart and quick‑start guide enable deployment of YTsaurus clusters on Kubernetes.

What SQL engine does YTsaurus use?

The CHYT layer is powered by ClickHouse, providing a familiar ClickHouse SQL dialect for fast analytic queries.

Is there a graphical interface for managing the cluster?

YTsaurus includes a web‑based UI for monitoring nodes, managing jobs, and interacting with stored data.

Project at a glance

Active
Stars
2,118
Watchers
2,118
Forks
188
LicenseApache-2.0
Repo age3 years old
Last commit4 hours ago
Primary languageC++

Last synced 4 hours ago