LakeSoul

Cloud-native lakehouse with ACID transactions and streaming upserts

End-to-end lakehouse framework supporting scalable metadata, concurrent upserts, CDC ingestion, and unified batch/streaming processing across Spark, Flink, Presto, and PyTorch.

Overview

Modern Lakehouse for Real-Time Analytics

LakeSoul is a cloud-native lakehouse framework designed for organizations building real-time data warehouses and AI pipelines. It combines ACID transaction guarantees with high-throughput upsert operations using an LSM-Tree architecture, enabling concurrent updates on hash-partitioned tables with primary keys. PostgreSQL-backed metadata management ensures scalable ACID control and MVCC isolation.

Multi-Engine, Multi-Workload Support

The framework integrates with Spark, Flink, Presto, and PyTorch, supporting batch, streaming, MPP, and machine learning workloads on HDFS and S3 storage. Native Rust-based IO and metadata layers deliver optimized merge-on-read performance. CDC capabilities include automatic schema evolution, exactly-once guarantees, and whole-database synchronization from sources like MySQL.

Enterprise-Ready Operations

LakeSoul provides multi-workspace RBAC through PostgreSQL row-level security and Hadoop user groups, ensuring metadata and data isolation across teams. Automated disaggregated compaction, lifecycle management, and redundant data cleanup reduce operational overhead. Time travel, snapshot rollback, and incremental queries enable flexible analytics workflows for both BI and AI applications.

Highlights

LSM-Tree upserts with concurrent writes and automatic conflict resolution

CDC ingestion with auto DDL sync and exactly-once streaming guarantees

PostgreSQL-backed metadata for scalable ACID transactions and MVCC

Native Rust IO layer with vectorized merge-on-read and multi-engine support

Pros

High-throughput concurrent upserts on primary-keyed tables
Unified batch and streaming semantics across Spark, Flink, and Presto
Automatic compaction and lifecycle management reduce operational burden
Native Python reader and PyTorch integration for AI workloads

Considerations

Requires PostgreSQL for metadata management, adding infrastructure dependency
LSM-Tree merge-on-read may increase query latency for heavily updated tables
Relatively newer project with smaller community compared to Delta Lake or Iceberg
Documentation primarily targets users familiar with big data ecosystems

Managed products teams compare with

When teams consider LakeSoul, these hosted platforms usually appear on the same shortlist.

Amazon Redshift

Fully managed, petabyte-scale cloud data warehouse for analytics and reporting

Azure Synapse Analytics

Limitless analytics platform unifying enterprise data warehousing and big data analytics in a single service

Google BigQuery

Serverless, highly scalable cloud data warehouse

Looking for a hosted option? These are the services engineering teams benchmark against before choosing open source.

Fit guide

Great for

Real-time data warehouses requiring CDC ingestion and streaming upserts
Organizations needing multi-engine support across Spark, Flink, and AI frameworks
Teams building unified data infrastructure for both BI and machine learning
Environments demanding fine-grained RBAC and multi-workspace isolation

Not ideal when

Projects requiring serverless metadata management without PostgreSQL
Read-heavy analytics workloads with minimal update requirements
Teams seeking maximum ecosystem maturity and third-party tool integrations
Use cases prioritizing simplicity over advanced concurrent write capabilities

How teams use it

Real-Time MySQL Replication

Sync entire MySQL databases to cloud storage with auto table creation, DDL propagation, and exactly-once CDC guarantees for downstream analytics.

Concurrent Multi-Stream Merging

Merge multiple Kafka streams sharing primary keys into wide tables without joins, enabling real-time feature stores for ML pipelines.

PyTorch Model Training on Lakehouse

Train distributed deep learning models directly on versioned lakehouse data using native Python readers, eliminating ETL to separate AI storage.

Time-Travel Analytics and Rollback

Query historical snapshots for auditing or A/B testing, then rollback tables to previous states when data quality issues arise.

Tech snapshot

Java38%

Scala32%

Rust23%

Python5%

MDX1%

Shell1%

Frequently asked questions

What makes LakeSoul different from Delta Lake or Apache Iceberg?

LakeSoul uses an LSM-Tree architecture for high-throughput concurrent upserts on primary-keyed tables, PostgreSQL for scalable metadata management, and native Rust IO for performance. It emphasizes streaming CDC ingestion and multi-workspace RBAC.

Does LakeSoul require PostgreSQL to operate?

Yes, LakeSoul uses PostgreSQL for ACID metadata management, MVCC isolation, and RBAC. This enables scalable concurrent writes but adds an infrastructure dependency compared to file-based metadata systems.

Can I use LakeSoul with existing Spark or Flink jobs?

Yes, LakeSoul integrates with Spark (Table/DataFrame/SQL APIs) and Flink (Table API, batch/stream sources and sinks) through connectors. It also supports Presto for MPP queries and PyTorch for AI workloads.

How does automatic compaction work?

LakeSoul provides disaggregated, size-tiered multi-level compaction that runs automatically in the background, merging upsert delta files to optimize read performance without manual intervention.

What storage systems does LakeSoul support?

LakeSoul works with HDFS and S3-compatible object storage, with native IO optimizations for cloud environments including multi-layer storage classes and local-disk caching.

Project at a glance

Active

Visit site View repo

Stars: 3,225
Watchers: 3,225
Forks: 414

LicenseApache-2.0

Repo age4 years old

Last commit4 days ago

Primary languageJava

Last synced 11 hours ago