OpenMLDB

SQL‑driven feature platform delivering millisecond real‑time ML features

OpenMLDB lets data teams define, test, and deploy ML features with SQL, ensuring consistent offline training and online inference while delivering ultra‑low latency real‑time features at scale.

Overview

Consistent, SQL‑First Feature Engineering

OpenMLDB addresses the 95% data‑centric workload of AI projects by letting engineers write feature logic once in SQL and use it for both offline model training and online inference. The unified execution plan guarantees that the same feature definitions produce identical results, eliminating data leakage and costly back‑filling.

Ultra‑Low Latency Real‑Time Serving

A purpose‑built real‑time SQL engine processes time‑series data in a few milliseconds, while a batch engine (based on a tailored Spark distribution) handles large‑scale offline jobs. Deployment follows three simple steps: develop features offline with SQL, deploy them online with a single command, and configure a real‑time data source. Built‑in enterprise capabilities—distributed storage, fault recovery, high availability, seamless scaling, monitoring, and heterogeneous memory support—make OpenMLDB production‑ready for recommendation systems, risk analytics, finance, IoT, and more.

Highlights

Consistent feature generation for training and inference via unified execution plan

Real‑time SQL engine produces features in a few milliseconds

SQL‑first feature definition with extensions like LAST JOIN and WINDOW UNION

Enterprise‑grade reliability: distributed storage, fault recovery, high availability, and seamless scaling

Pros

Reduces engineering effort by unifying offline and online feature pipelines
Ultra‑low latency suitable for real‑time recommendation and risk analytics
Leverages familiar SQL, lowering the learning curve for data scientists
Built‑in production features (HA, scaling, monitoring) support enterprise deployments

Considerations

Requires SQL expertise; non‑SQL environments may need adaptation
Real‑time engine optimized for time‑series data; other patterns may see less benefit
Limited to provided streaming connectors; custom integrations may require development
Community size smaller than major commercial feature stores, potentially affecting support speed

Managed products teams compare with

When teams consider OpenMLDB, these hosted platforms usually appear on the same shortlist.

Amazon SageMaker Feature Store

Fully managed repository to create, store, share, and serve ML features

Databricks Feature Store

Feature registry with governance, lineage, and MLflow integration

Tecton Feature Store

Central hub to manage, govern, and serve ML features across batch, streaming, and real time

Looking for a hosted option? These are the services engineering teams benchmark against before choosing open source.

Fit guide

Great for

Teams needing consistent features across training and serving
Applications demanding sub‑10 ms real‑time feature computation
Organizations preferring SQL‑centric feature engineering workflows
Enterprises seeking an on‑premise, open‑source alternative to SaaS feature stores

Not ideal when

Projects that rely heavily on Python‑based feature pipelines without SQL conversion
Use cases where batch‑only feature computation suffices and ultra‑low latency is unnecessary
Environments lacking expertise in managing distributed SQL engines
Scenarios requiring out‑of‑the‑box connectors for niche streaming platforms not yet supported

How teams use it

NYC Taxi Trip Duration Prediction

End‑to‑end ML pipeline built with OpenMLDB and LightGBM to predict ride duration, demonstrating rapid feature development and deployment.

Real‑time Data Ingestion from Apache Kafka

Seamless import of streaming events into OpenMLDB via the Kafka connector, enabling millisecond‑level feature computation for online services.

Real‑time Data Ingestion from Apache Pulsar

Pulsar streams are ingested through the OpenMLDB‑Pulsar connector, supporting low‑latency feature serving in cloud‑native environments.

End‑to‑end ML Pipelines in DolphinScheduler

Automated scheduling of feature engineering, model training, and deployment using DolphinScheduler integrated with OpenMLDB.

Tech snapshot

C++74%

Java18%

Scala3%

Python3%

Shell1%

CMake1%

Frequently asked questions

What are the primary use cases of OpenMLDB?

It serves as a feature platform for ML applications requiring ultra‑low latency real‑time features, and also functions as a time‑series database for finance, IoT, and similar domains.

Is OpenMLDB a feature store?

OpenMLDB goes beyond a traditional feature store by generating real‑time features in a few milliseconds, whereas most stores only serve pre‑computed offline features.

Why does OpenMLDB use SQL to define features?

SQL offers an elegant yet powerful syntax; its extensions flatten the learning curve and facilitate collaboration across data teams.

How does OpenMLDB ensure consistency between training and inference?

A unified execution plan generator creates identical execution plans for both batch and real‑time engines, guaranteeing feature consistency and preventing data leakage.

What deployment steps are required?

Develop features offline with SQL, deploy them online with a single command, and configure a real‑time data source to start serving features.

Project at a glance

Active

Visit site View repo

Stars: 1,681
Watchers: 1,681
Forks: 324

LicenseApache-2.0

Repo age4 years old

Last commit5 days ago

Primary languageC++

Last synced 2 hours ago