Apache Doris logo

Apache Doris

High-performance real-time analytical database with MPP architecture

Apache Doris is an MPP-based analytical database delivering sub-second query responses on massive datasets, supporting high-concurrency queries and complex analysis for real-time data warehousing.

Apache Doris banner

Overview

Overview

Apache Doris is a high-performance, real-time analytical database built on MPP (Massively Parallel Processing) architecture. It delivers sub-second query responses on massive datasets while supporting both high-concurrency point queries and high-throughput complex analysis scenarios.

Architecture & Capabilities

Doris uses a storage-compute integrated architecture with two core components: Frontend (FE) nodes handle query parsing, metadata management, and request routing, while Backend (BE) nodes manage data storage and query execution. Both components scale horizontally to support hundreds of machines and tens of petabytes of storage.

The database is highly compatible with MySQL protocol and supports standard SQL syntax, including most MySQL and Hive functions. Its vectorized columnar storage engine optimizes query performance and compression ratios, while the Pipeline execution model ensures efficient resource utilization.

Use Cases

Apache Doris excels in real-time reporting, ad-hoc analysis, user behavior analytics, and lakehouse query acceleration. Organizations use it to build unified data warehouses, accelerate data lake queries through federated analytics, and perform log analysis for observability. The platform supports various applications including AB testing platforms, user profiling, order analysis, and real-time dashboards with second-level data ingestion from upstream transactional databases.

Highlights

Sub-second query response times on massive datasets with MPP architecture
MySQL protocol compatibility with standard SQL and seamless BI tool integration
Storage-compute integrated architecture with horizontal scalability to petabyte scale
Real-time data ingestion with second-level latency from upstream databases

Pros

  • Extreme query performance with vectorized execution and columnar storage
  • Simple two-component architecture reduces operational complexity
  • High availability through multi-replica storage and quorum-based consistency
  • Unified lakehouse support for federated queries across multiple data sources

Considerations

  • Storage-compute integrated architecture may limit independent scaling flexibility
  • Requires careful capacity planning for both FE and BE node deployment
  • Learning curve for optimizing data modeling approaches and materialized views
  • Resource-intensive for small-scale deployments compared to simpler databases

Managed products teams compare with

When teams consider Apache Doris, these hosted platforms usually appear on the same shortlist.

Amazon Redshift logo

Amazon Redshift

Fully managed, petabyte-scale cloud data warehouse for analytics and reporting

Azure Synapse Analytics logo

Azure Synapse Analytics

Limitless analytics platform unifying enterprise data warehousing and big data analytics in a single service

Google BigQuery logo

Google BigQuery

Serverless, highly scalable cloud data warehouse

Looking for a hosted option? These are the services engineering teams benchmark against before choosing open source.

Fit guide

Great for

  • Organizations needing real-time dashboards and sub-second analytical queries
  • Teams building unified data warehouses with lakehouse query acceleration
  • Enterprises requiring high-concurrency user-facing analytics applications
  • Data teams familiar with MySQL seeking scalable OLAP capabilities

Not ideal when

  • Small datasets where simpler databases provide sufficient performance
  • Workloads requiring frequent small transactional updates (OLTP)
  • Teams needing complete separation of storage and compute resources
  • Projects with limited infrastructure for distributed system management

How teams use it

Real-Time Business Dashboards

Deliver sub-second reporting and decision-making dashboards with real-time data ingestion from transactional databases, enabling automated business processes and instant insights.

User Behavior Analytics Platform

Analyze user participation, retention, and conversion patterns with multidimensional ad-hoc queries, supporting population insights and targeted audience selection for marketing campaigns.

Lakehouse Query Acceleration

Accelerate queries across data lakes (Hive, Iceberg, Hudi) using federated analytics, eliminating data silos and simplifying architecture while maintaining data lake management capabilities.

Log Analysis for Observability

Perform real-time or batch analysis of distributed system logs and events to identify performance bottlenecks, troubleshoot issues, and optimize system reliability.

Tech snapshot

Java48%
C++44%
Python5%
Shell1%
Thrift1%
C1%

Tags

aireal-timeicebergquery-enginedelta-lakedbtsnowflakesparkbigqueryolapsqleltpaimonagentredshiftlakehousedatabasehudi

Frequently asked questions

What is the difference between FE and BE nodes in Apache Doris?

Frontend (FE) nodes handle query parsing, metadata management, and request routing, while Backend (BE) nodes manage data storage and query execution. Both scale horizontally and work together in the storage-compute integrated architecture.

How does Apache Doris achieve high availability?

Doris stores metadata and data with multiple replicas, using quorum protocol for synchronization. It supports Master, Follower, and Observer FE roles for disaster recovery, and automatically isolates faulty nodes to maintain cluster availability.

Can I use existing MySQL tools with Apache Doris?

Yes, Apache Doris is highly compatible with MySQL protocol and supports standard SQL syntax, including most MySQL and Hive functions. You can connect using MySQL client tools and integrate with BI reporting and data transmission tools.

What data modeling approaches does Apache Doris support?

Doris offers flexible modeling including wide table models, pre-aggregation models, and star/snowflake schemas. You can flatten data during import or perform modeling through views, materialized views, and real-time multi-table joins.

How quickly can Apache Doris ingest data from upstream sources?

Apache Doris provides second-level data ingestion capabilities, capturing incremental changes from upstream transactional databases within seconds to support real-time data warehouse scenarios.

Project at a glance

Active
Stars
14,906
Watchers
14,906
Forks
3,688
LicenseApache-2.0
Repo age8 years old
Last commityesterday
Primary languageJava

Last synced yesterday