
OLake
Blazing-fast database replication to Apache Iceberg tables
OLake replicates PostgreSQL, MySQL, MongoDB, Oracle, and Kafka to Apache Iceberg at high throughput with CDC support, no Spark or Flink required.

Overview
Fast, Infrastructure-Light Data Replication
OLake is a high-performance connector designed for data engineers who need to replicate transactional databases into Apache Iceberg without the overhead of traditional streaming infrastructure. It supports PostgreSQL, MySQL, MongoDB, and Oracle sources with full load, CDC (change data capture), and incremental sync modes.
Built for Speed and Simplicity
With benchmarks showing 235K RPS for PostgreSQL full loads—15.9× faster than Debezium—and 64K RPS for MySQL, OLake delivers enterprise-grade throughput on minimal infrastructure. It eliminates dependencies on Spark, Flink, Kafka, and Debezium, reducing operational complexity and cost. The self-serve web UI and Docker Compose deployment enable teams to configure and launch pipelines in minutes.
Iceberg-Native Architecture
OLake writes directly to Apache Iceberg tables and supports Glue, Hive, JDBC, and REST catalogs (including Nessie, Polaris, Unity Catalog, and AWS S3 Tables). It also outputs Parquet to filesystems, with Delta Lake and Hudi support planned. Advanced users can leverage the CLI for automation and orchestration with Airflow or Kubernetes.
Highlights
Pros
- Exceptional throughput with minimal infrastructure—no Spark, Flink, or Kafka required
- Automatic schema discovery and CDC replication simplify pipeline setup
- Supports multiple Iceberg catalogs and cloud storage backends (S3, ADLS, GCS, MinIO)
- Docker Compose quickstart and web UI lower the barrier to entry
Considerations
- MongoDB and Oracle CDC support marked as work-in-progress
- Kafka source and pgoutput for PostgreSQL still under development
- Delta Lake and Hudi destinations planned but not yet available
- Benchmark reproducibility and full reports pending publication
Fit guide
Great for
- Data teams migrating OLTP databases to Iceberg without heavy streaming infrastructure
- Organizations requiring high-throughput CDC replication on cost-efficient object storage
- Teams using Athena, Trino, Presto, Dremio, Databricks, or Snowflake for BI on Iceberg
- Engineers seeking self-serve UI and CLI automation for data lakehouse pipelines
Not ideal when
- Projects requiring production-ready Kafka source or pgoutput CDC for PostgreSQL today
- Teams needing Delta Lake or Hudi destinations in the immediate term
- Use cases demanding fully published, reproducible benchmark validation before adoption
- Environments where MongoDB or Oracle CDC must be production-stable now
How teams use it
OLTP to Iceberg Migration
Replicate PostgreSQL or MySQL transactional databases to Iceberg tables without deploying Spark or Flink, reducing infrastructure cost and complexity.
Real-Time BI on CDC Data
Stream change data capture events into Iceberg and query fresh data with Athena, Trino, Presto, or Snowflake for near real-time analytics.
Cost-Efficient Data Lakehouse
Build a lakehouse on S3, ADLS, or GCS with high-throughput ingestion, leveraging Iceberg's open format for multi-engine access.
Self-Service Data Pipelines
Enable analysts and engineers to configure and launch replication jobs via the web UI, accelerating time-to-insight without custom code.
Tech snapshot
Frequently asked questions
What databases does OLake support as sources?
OLake supports PostgreSQL, MySQL, MongoDB (full load and CDC), and Oracle (full load and incremental). Kafka source support is in development.
Does OLake require Spark, Flink, or Kafka?
No. OLake is infrastructure-light and does not depend on Spark, Flink, Kafka, or Debezium, reducing operational overhead and cost.
Which Iceberg catalogs are supported?
OLake supports AWS Glue, Hive, JDBC, and REST catalogs, including Nessie, Polaris, Unity Catalog, Lakekeeper, and AWS S3 Tables.
How do I deploy OLake?
Use Docker Compose for quickstart with the web UI, or deploy via Kubernetes with Helm, standalone Docker, or Airflow on EC2 or Kubernetes.
What are the benchmark results?
OLake achieves 235K RPS for PostgreSQL full loads (15.9× faster than Debezium) and 64K RPS for MySQL (9× faster than Airbyte). Fully reproducible reports are forthcoming.
Project at a glance
Active- Stars
- 1,272
- Watchers
- 1,272
- Forks
- 195
Last synced yesterday