OpenMLDB logo

OpenMLDB

SQL‑driven feature platform delivering millisecond real‑time ML features

OpenMLDB lets data teams define, test, and deploy ML features with SQL, ensuring consistent offline training and online inference while delivering ultra‑low latency real‑time features at scale.

Overview

Consistent, SQL‑First Feature Engineering

OpenMLDB addresses the 95% data‑centric workload of AI projects by letting engineers write feature logic once in SQL and use it for both offline model training and online inference. The unified execution plan guarantees that the same feature definitions produce identical results, eliminating data leakage and costly back‑filling.

Ultra‑Low Latency Real‑Time Serving

A purpose‑built real‑time SQL engine processes time‑series data in a few milliseconds, while a batch engine (based on a tailored Spark distribution) handles large‑scale offline jobs. Deployment follows three simple steps: develop features offline with SQL, deploy them online with a single command, and configure a real‑time data source. Built‑in enterprise capabilities—distributed storage, fault recovery, high availability, seamless scaling, monitoring, and heterogeneous memory support—make OpenMLDB production‑ready for recommendation systems, risk analytics, finance, IoT, and more.

Highlights

Consistent feature generation for training and inference via unified execution plan
Real‑time SQL engine produces features in a few milliseconds
SQL‑first feature definition with extensions like LAST JOIN and WINDOW UNION
Enterprise‑grade reliability: distributed storage, fault recovery, high availability, and seamless scaling

Pros

  • Reduces engineering effort by unifying offline and online feature pipelines
  • Ultra‑low latency suitable for real‑time recommendation and risk analytics
  • Leverages familiar SQL, lowering the learning curve for data scientists
  • Built‑in production features (HA, scaling, monitoring) support enterprise deployments

Considerations

  • Requires SQL expertise; non‑SQL environments may need adaptation
  • Real‑time engine optimized for time‑series data; other patterns may see less benefit
  • Limited to provided streaming connectors; custom integrations may require development
  • Community size smaller than major commercial feature stores, potentially affecting support speed

Managed products teams compare with

When teams consider OpenMLDB, these hosted platforms usually appear on the same shortlist.

Amazon SageMaker Feature Store logo

Amazon SageMaker Feature Store

Fully managed repository to create, store, share, and serve ML features

Databricks Feature Store logo

Databricks Feature Store

Feature registry with governance, lineage, and MLflow integration

Tecton Feature Store logo

Tecton Feature Store

Central hub to manage, govern, and serve ML features across batch, streaming, and real time

Looking for a hosted option? These are the services engineering teams benchmark against before choosing open source.

Fit guide

Great for

  • Teams needing consistent features across training and serving
  • Applications demanding sub‑10 ms real‑time feature computation
  • Organizations preferring SQL‑centric feature engineering workflows
  • Enterprises seeking an on‑premise, open‑source alternative to SaaS feature stores

Not ideal when

  • Projects that rely heavily on Python‑based feature pipelines without SQL conversion
  • Use cases where batch‑only feature computation suffices and ultra‑low latency is unnecessary
  • Environments lacking expertise in managing distributed SQL engines
  • Scenarios requiring out‑of‑the‑box connectors for niche streaming platforms not yet supported

How teams use it

NYC Taxi Trip Duration Prediction

End‑to‑end ML pipeline built with OpenMLDB and LightGBM to predict ride duration, demonstrating rapid feature development and deployment.

Real‑time Data Ingestion from Apache Kafka

Seamless import of streaming events into OpenMLDB via the Kafka connector, enabling millisecond‑level feature computation for online services.

Real‑time Data Ingestion from Apache Pulsar

Pulsar streams are ingested through the OpenMLDB‑Pulsar connector, supporting low‑latency feature serving in cloud‑native environments.

End‑to‑end ML Pipelines in DolphinScheduler

Automated scheduling of feature engineering, model training, and deployment using DolphinScheduler integrated with OpenMLDB.

Tech snapshot

C++74%
Java18%
Scala3%
Python3%
Shell1%
CMake1%

Tags

mlopsdatabase-for-aiin-memory-databasefeaturestoremachine-learningdatabase-for-machine-learningmachine-learning-databasefeature-storefeatureopsfeature-extractionfeature-engineering

Frequently asked questions

What are the primary use cases of OpenMLDB?

It serves as a feature platform for ML applications requiring ultra‑low latency real‑time features, and also functions as a time‑series database for finance, IoT, and similar domains.

Is OpenMLDB a feature store?

OpenMLDB goes beyond a traditional feature store by generating real‑time features in a few milliseconds, whereas most stores only serve pre‑computed offline features.

Why does OpenMLDB use SQL to define features?

SQL offers an elegant yet powerful syntax; its extensions flatten the learning curve and facilitate collaboration across data teams.

How does OpenMLDB ensure consistency between training and inference?

A unified execution plan generator creates identical execution plans for both batch and real‑time engines, guaranteeing feature consistency and preventing data leakage.

What deployment steps are required?

Develop features offline with SQL, deploy them online with a single command, and configure a real‑time data source to start serving features.

Project at a glance

Active
Stars
1,678
Watchers
1,678
Forks
324
LicenseApache-2.0
Repo age4 years old
Last commit2 days ago
Primary languageC++

Last synced 12 hours ago