Apache Doris

High-performance real-time analytical database with MPP architecture

Apache Doris is an MPP-based analytical database delivering sub-second query responses on massive datasets, supporting high-concurrency queries and complex analysis for real-time data warehousing.

Overview

Apache Doris is a high-performance, real-time analytical database built on MPP (Massively Parallel Processing) architecture. It delivers sub-second query responses on massive datasets while supporting both high-concurrency point queries and high-throughput complex analysis scenarios.

Architecture & Capabilities

Doris uses a storage-compute integrated architecture with two core components: Frontend (FE) nodes handle query parsing, metadata management, and request routing, while Backend (BE) nodes manage data storage and query execution. Both components scale horizontally to support hundreds of machines and tens of petabytes of storage.

The database is highly compatible with MySQL protocol and supports standard SQL syntax, including most MySQL and Hive functions. Its vectorized columnar storage engine optimizes query performance and compression ratios, while the Pipeline execution model ensures efficient resource utilization.

Use Cases

Apache Doris excels in real-time reporting, ad-hoc analysis, user behavior analytics, and lakehouse query acceleration. Organizations use it to build unified data warehouses, accelerate data lake queries through federated analytics, and perform log analysis for observability. The platform supports various applications including AB testing platforms, user profiling, order analysis, and real-time dashboards with second-level data ingestion from upstream transactional databases.

Highlights

Sub-second query response times on massive datasets with MPP architecture

MySQL protocol compatibility with standard SQL and seamless BI tool integration

Storage-compute integrated architecture with horizontal scalability to petabyte scale

Real-time data ingestion with second-level latency from upstream databases

Pros

Extreme query performance with vectorized execution and columnar storage
Simple two-component architecture reduces operational complexity
High availability through multi-replica storage and quorum-based consistency
Unified lakehouse support for federated queries across multiple data sources

Considerations

Storage-compute integrated architecture may limit independent scaling flexibility
Requires careful capacity planning for both FE and BE node deployment
Learning curve for optimizing data modeling approaches and materialized views
Resource-intensive for small-scale deployments compared to simpler databases

Managed products teams compare with

When teams consider Apache Doris, these hosted platforms usually appear on the same shortlist.

Amazon Redshift

Fully managed, petabyte-scale cloud data warehouse for analytics and reporting

Azure Synapse Analytics

Limitless analytics platform unifying enterprise data warehousing and big data analytics in a single service

Google BigQuery

Serverless, highly scalable cloud data warehouse

Looking for a hosted option? These are the services engineering teams benchmark against before choosing open source.

Fit guide

Great for

Organizations needing real-time dashboards and sub-second analytical queries
Teams building unified data warehouses with lakehouse query acceleration
Enterprises requiring high-concurrency user-facing analytics applications
Data teams familiar with MySQL seeking scalable OLAP capabilities

Not ideal when

Small datasets where simpler databases provide sufficient performance
Workloads requiring frequent small transactional updates (OLTP)
Teams needing complete separation of storage and compute resources
Projects with limited infrastructure for distributed system management

How teams use it

Real-Time Business Dashboards

Deliver sub-second reporting and decision-making dashboards with real-time data ingestion from transactional databases, enabling automated business processes and instant insights.

User Behavior Analytics Platform

Analyze user participation, retention, and conversion patterns with multidimensional ad-hoc queries, supporting population insights and targeted audience selection for marketing campaigns.

Lakehouse Query Acceleration

Accelerate queries across data lakes (Hive, Iceberg, Hudi) using federated analytics, eliminating data silos and simplifying architecture while maintaining data lake management capabilities.

Log Analysis for Observability

Perform real-time or batch analysis of distributed system logs and events to identify performance bottlenecks, troubleshoot issues, and optimize system reliability.

Tech snapshot

Java48%

C++44%

Python5%

Shell1%

Thrift1%

C1%

Frequently asked questions

What is the difference between FE and BE nodes in Apache Doris?

Frontend (FE) nodes handle query parsing, metadata management, and request routing, while Backend (BE) nodes manage data storage and query execution. Both scale horizontally and work together in the storage-compute integrated architecture.

How does Apache Doris achieve high availability?

Doris stores metadata and data with multiple replicas, using quorum protocol for synchronization. It supports Master, Follower, and Observer FE roles for disaster recovery, and automatically isolates faulty nodes to maintain cluster availability.

Can I use existing MySQL tools with Apache Doris?

Yes, Apache Doris is highly compatible with MySQL protocol and supports standard SQL syntax, including most MySQL and Hive functions. You can connect using MySQL client tools and integrate with BI reporting and data transmission tools.

What data modeling approaches does Apache Doris support?

Doris offers flexible modeling including wide table models, pre-aggregation models, and star/snowflake schemas. You can flatten data during import or perform modeling through views, materialized views, and real-time multi-table joins.

How quickly can Apache Doris ingest data from upstream sources?

Apache Doris provides second-level data ingestion capabilities, capturing incremental changes from upstream transactional databases within seconds to support real-time data warehouse scenarios.

Project at a glance

Active

Visit site View repo

Stars: 15,077
Watchers: 15,077
Forks: 3,724

LicenseApache-2.0

Repo age8 years old

Last commit19 hours ago

Primary languageJava

Last synced 11 hours ago

Overview

Overview

Architecture & Capabilities

Use Cases

Highlights

Pros

Considerations

Managed products teams compare with

Amazon Redshift

Azure Synapse Analytics

Google BigQuery

Fit guide

Great for

Not ideal when

How teams use it

Tech snapshot

Tags

Frequently asked questions