chDB

In-process SQL OLAP engine powered by ClickHouse

Embedded SQL analytics engine bringing ClickHouse's columnar performance directly into Python applications without external dependencies or separate server installations.

Overview

What is chDB?

chDB is an embedded SQL OLAP engine that brings the full power of ClickHouse directly into your Python applications as an in-process library. Unlike traditional database deployments, chDB requires no separate server installation, configuration, or network overhead—simply pip install and start querying.

Who Should Use chDB?

Designed for data engineers, analysts, and Python developers who need high-performance analytical queries on local data without the operational complexity of managing database infrastructure. Whether you're processing Parquet files, transforming Pandas DataFrames, or running ad-hoc analytics, chDB delivers ClickHouse-grade performance with minimal setup.

Core Capabilities

chDB supports 60+ data formats including Parquet, CSV, JSON, Arrow, and ORC with zero-copy data access via Python memoryview. It offers multiple query interfaces: a DB-API 2.0 compliant connection API, direct file querying, stateful sessions with persistent tables and views, and Python UDF support for custom transformations. Query results can be returned as DataFrames, JSON, CSV, or any ClickHouse-supported format.

Deployment

Available via pip for Python 3.8+ on macOS and Linux (x86_64 and ARM64). Use in-memory mode for ephemeral analytics or file-based mode for persistent storage across sessions.

Highlights

Zero-installation embedded ClickHouse engine with no separate server required

Native support for 60+ formats including Parquet, Arrow, ORC, CSV, and JSON

Zero-copy data transfer between C++ and Python via memoryview for maximum performance

Multiple query interfaces: DB-API 2.0, direct file queries, stateful sessions, and Python UDFs

Pros

No infrastructure overhead—runs entirely in-process within Python applications
ClickHouse-powered columnar performance for analytical workloads on local data
Seamless integration with Pandas DataFrames and common data science workflows
Supports both ephemeral in-memory and persistent file-based storage modes

Considerations

Limited to single-process execution without distributed query capabilities
Python 3.8+ only with platform restrictions (macOS, Linux x86_64/ARM64)
UDFs are stateless and line-by-line, unsuitable for complex aggregations
Lacks the full feature set and scalability of a standalone ClickHouse cluster

Managed products teams compare with

When teams consider chDB, these hosted platforms usually appear on the same shortlist.

Amazon Redshift

Fully managed, petabyte-scale cloud data warehouse for analytics and reporting

Azure Synapse Analytics

Limitless analytics platform unifying enterprise data warehousing and big data analytics in a single service

Google BigQuery

Serverless, highly scalable cloud data warehouse

Looking for a hosted option? These are the services engineering teams benchmark against before choosing open source.

Fit guide

Great for

Local analytics on files without spinning up database infrastructure
Embedded analytics in Python applications requiring fast OLAP queries
Data transformation pipelines processing Parquet, Arrow, or CSV at scale
Prototyping and testing ClickHouse queries before production deployment

Not ideal when

Multi-user concurrent access or client-server database architectures
Distributed queries across multiple nodes or large-scale data warehousing
Applications requiring OLTP workloads or transactional consistency guarantees
Production systems needing high availability, replication, or failover

How teams use it

Ad-hoc Parquet Analysis

Query multi-gigabyte Parquet files directly from disk with SQL, returning results as Pandas DataFrames without ETL pipelines or database imports.

Embedded Application Analytics

Integrate real-time OLAP queries into Python applications for dashboards, reporting, or user-facing analytics without external database dependencies.

DataFrame Join Acceleration

Perform complex joins and aggregations on multiple Pandas DataFrames using SQL, leveraging ClickHouse's columnar engine for 10-100x speedups over native Pandas.

ETL Pipeline Prototyping

Test and validate ClickHouse SQL transformations locally before deploying to production clusters, using identical query syntax and behavior.

Tech snapshot

C++82%

Assembly10%

C3%

Python2%

CMake1%

Jupyter Notebook1%

Frequently asked questions

Does chDB require a separate ClickHouse server installation?

No. chDB embeds the ClickHouse engine directly into Python as a library. Simply pip install chdb and start querying—no server setup, configuration files, or network ports required.

What data formats can chDB query?

chDB supports 60+ formats including Parquet, CSV, JSON, Arrow, ORC, and all formats supported by ClickHouse. You can query files directly from disk or work with in-memory data structures like Pandas DataFrames.

Can I persist data across sessions?

Yes. Use file-based connections (e.g., chdb.connect('test.db')) to create persistent databases with tables and views that survive across Python sessions. In-memory mode (:memory:) is ephemeral.

How does chDB compare to DuckDB?

Both are embedded OLAP engines. chDB brings ClickHouse's columnar engine and SQL dialect to Python, while DuckDB has its own engine. Choose chDB if you need ClickHouse compatibility or plan to migrate queries to a ClickHouse cluster.

What are the limitations of Python UDFs in chDB?

UDFs must be stateless, pure Python functions that process input line-by-line. They do not support user-defined aggregations (UDAFs) or stateful operations. All inputs are strings (tab-separated), and you must handle type conversions manually.

Project at a glance

Active

Visit site View repo

Stars: 2,628
Watchers: 2,628
Forks: 103

LicenseApache-2.0

Repo age3 years old

Last commit3 days ago

Primary languagePython

Last synced 5 hours ago