OLake logo

OLake

Blazing-fast database replication to Apache Iceberg tables

OLake replicates PostgreSQL, MySQL, MongoDB, Oracle, and Kafka to Apache Iceberg at high throughput with CDC support, no Spark or Flink required.

OLake banner

Overview

Fast, Infrastructure-Light Data Replication

OLake is a high-performance connector designed for data engineers who need to replicate transactional databases into Apache Iceberg without the overhead of traditional streaming infrastructure. It supports PostgreSQL, MySQL, MongoDB, and Oracle sources with full load, CDC (change data capture), and incremental sync modes.

Built for Speed and Simplicity

With benchmarks showing 235K RPS for PostgreSQL full loads—15.9× faster than Debezium—and 64K RPS for MySQL, OLake delivers enterprise-grade throughput on minimal infrastructure. It eliminates dependencies on Spark, Flink, Kafka, and Debezium, reducing operational complexity and cost. The self-serve web UI and Docker Compose deployment enable teams to configure and launch pipelines in minutes.

Iceberg-Native Architecture

OLake writes directly to Apache Iceberg tables and supports Glue, Hive, JDBC, and REST catalogs (including Nessie, Polaris, Unity Catalog, and AWS S3 Tables). It also outputs Parquet to filesystems, with Delta Lake and Hudi support planned. Advanced users can leverage the CLI for automation and orchestration with Airflow or Kubernetes.

Highlights

235K RPS PostgreSQL throughput, 15.9× faster than Debezium in benchmarks
Full load, CDC, and incremental sync for PostgreSQL, MySQL, MongoDB, Oracle
Native Apache Iceberg writer supporting Glue, Hive, JDBC, and REST catalogs
Self-serve web UI and CLI with Docker Compose, Kubernetes, and Airflow deployment

Pros

  • Exceptional throughput with minimal infrastructure—no Spark, Flink, or Kafka required
  • Automatic schema discovery and CDC replication simplify pipeline setup
  • Supports multiple Iceberg catalogs and cloud storage backends (S3, ADLS, GCS, MinIO)
  • Docker Compose quickstart and web UI lower the barrier to entry

Considerations

  • MongoDB and Oracle CDC support marked as work-in-progress
  • Kafka source and pgoutput for PostgreSQL still under development
  • Delta Lake and Hudi destinations planned but not yet available
  • Benchmark reproducibility and full reports pending publication

Fit guide

Great for

  • Data teams migrating OLTP databases to Iceberg without heavy streaming infrastructure
  • Organizations requiring high-throughput CDC replication on cost-efficient object storage
  • Teams using Athena, Trino, Presto, Dremio, Databricks, or Snowflake for BI on Iceberg
  • Engineers seeking self-serve UI and CLI automation for data lakehouse pipelines

Not ideal when

  • Projects requiring production-ready Kafka source or pgoutput CDC for PostgreSQL today
  • Teams needing Delta Lake or Hudi destinations in the immediate term
  • Use cases demanding fully published, reproducible benchmark validation before adoption
  • Environments where MongoDB or Oracle CDC must be production-stable now

How teams use it

OLTP to Iceberg Migration

Replicate PostgreSQL or MySQL transactional databases to Iceberg tables without deploying Spark or Flink, reducing infrastructure cost and complexity.

Real-Time BI on CDC Data

Stream change data capture events into Iceberg and query fresh data with Athena, Trino, Presto, or Snowflake for near real-time analytics.

Cost-Efficient Data Lakehouse

Build a lakehouse on S3, ADLS, or GCS with high-throughput ingestion, leveraging Iceberg's open format for multi-engine access.

Self-Service Data Pipelines

Enable analysts and engineers to configure and launch replication jobs via the web UI, accelerating time-to-insight without custom code.

Tech snapshot

Go86%
Java11%
Shell2%
Dockerfile1%
Makefile1%

Tags

cdcchange-data-capturehacktoberfestreplicationeltapache-icebergs3lakehousedatabasedata-pipelineparquet

Frequently asked questions

What databases does OLake support as sources?

OLake supports PostgreSQL, MySQL, MongoDB (full load and CDC), and Oracle (full load and incremental). Kafka source support is in development.

Does OLake require Spark, Flink, or Kafka?

No. OLake is infrastructure-light and does not depend on Spark, Flink, Kafka, or Debezium, reducing operational overhead and cost.

Which Iceberg catalogs are supported?

OLake supports AWS Glue, Hive, JDBC, and REST catalogs, including Nessie, Polaris, Unity Catalog, Lakekeeper, and AWS S3 Tables.

How do I deploy OLake?

Use Docker Compose for quickstart with the web UI, or deploy via Kubernetes with Helm, standalone Docker, or Airflow on EC2 or Kubernetes.

What are the benchmark results?

OLake achieves 235K RPS for PostgreSQL full loads (15.9× faster than Debezium) and 64K RPS for MySQL (9× faster than Airbyte). Fully reproducible reports are forthcoming.

Project at a glance

Active
Stars
1,272
Watchers
1,272
Forks
195
LicenseApache-2.0
Repo age1 year old
Last commit2 days ago
Primary languageGo

Last synced yesterday