
Amazon Redshift
Fully managed, petabyte-scale cloud data warehouse for analytics and reporting
Discover top open-source software, updated regularly with real-world adoption signals.

Cloud-native lakehouse with ACID transactions and streaming upserts
End-to-end lakehouse framework supporting scalable metadata, concurrent upserts, CDC ingestion, and unified batch/streaming processing across Spark, Flink, Presto, and PyTorch.

LakeSoul is a cloud-native lakehouse framework designed for organizations building real-time data warehouses and AI pipelines. It combines ACID transaction guarantees with high-throughput upsert operations using an LSM-Tree architecture, enabling concurrent updates on hash-partitioned tables with primary keys. PostgreSQL-backed metadata management ensures scalable ACID control and MVCC isolation.
The framework integrates with Spark, Flink, Presto, and PyTorch, supporting batch, streaming, MPP, and machine learning workloads on HDFS and S3 storage. Native Rust-based IO and metadata layers deliver optimized merge-on-read performance. CDC capabilities include automatic schema evolution, exactly-once guarantees, and whole-database synchronization from sources like MySQL.
LakeSoul provides multi-workspace RBAC through PostgreSQL row-level security and Hadoop user groups, ensuring metadata and data isolation across teams. Automated disaggregated compaction, lifecycle management, and redundant data cleanup reduce operational overhead. Time travel, snapshot rollback, and incremental queries enable flexible analytics workflows for both BI and AI applications.
When teams consider LakeSoul, these hosted platforms usually appear on the same shortlist.
Looking for a hosted option? These are the services engineering teams benchmark against before choosing open source.
Real-Time MySQL Replication
Sync entire MySQL databases to cloud storage with auto table creation, DDL propagation, and exactly-once CDC guarantees for downstream analytics.
Concurrent Multi-Stream Merging
Merge multiple Kafka streams sharing primary keys into wide tables without joins, enabling real-time feature stores for ML pipelines.
PyTorch Model Training on Lakehouse
Train distributed deep learning models directly on versioned lakehouse data using native Python readers, eliminating ETL to separate AI storage.
Time-Travel Analytics and Rollback
Query historical snapshots for auditing or A/B testing, then rollback tables to previous states when data quality issues arise.
LakeSoul uses an LSM-Tree architecture for high-throughput concurrent upserts on primary-keyed tables, PostgreSQL for scalable metadata management, and native Rust IO for performance. It emphasizes streaming CDC ingestion and multi-workspace RBAC.
Yes, LakeSoul uses PostgreSQL for ACID metadata management, MVCC isolation, and RBAC. This enables scalable concurrent writes but adds an infrastructure dependency compared to file-based metadata systems.
Yes, LakeSoul integrates with Spark (Table/DataFrame/SQL APIs) and Flink (Table API, batch/stream sources and sinks) through connectors. It also supports Presto for MPP queries and PyTorch for AI workloads.
LakeSoul provides disaggregated, size-tiered multi-level compaction that runs automatically in the background, merging upsert delta files to optimize read performance without manual intervention.
LakeSoul works with HDFS and S3-compatible object storage, with native IO optimizations for cloud environments including multi-layer storage classes and local-disk caching.
Project at a glance
ActiveLast synced 4 days ago