Feathr

Scalable feature store for unified data and AI engineering

Feathr provides Pythonic APIs to define, register, and share feature transformations across batch, streaming, and online environments, with point-in-time correctness and native cloud integrations for enterprise AI pipelines.

Overview

Feathr is a data and AI engineering platform used in production at LinkedIn for over six years and now available as an open‑source project under the LF AI & Data Foundation. It lets data scientists define feature transformations with Pythonic APIs, register them by name, and reuse them across teams, ensuring consistent, point‑in‑time‑correct data for model training and online serving.

Capabilities & Deployment

Feathr supports batch, streaming, and online workloads with built‑in optimizations that can handle billions of rows and petabyte‑scale datasets. Its native integrations with Databricks and Azure Synapse, along with ARM templates and CLI guides, simplify cloud deployment. Users can start quickly with the Feathr Sandbox Docker container, which includes a UI and Jupyter notebooks for hands‑on experimentation, or install the client via pip for local development.

Ecosystem

A built‑in registry and intuitive UI provide feature discovery, lineage tracking, and access control. Rich transformation primitives—including time‑based aggregations, sliding windows, and custom UDFs with PySpark or Spark SQL—enable flexible engineering of complex AI features.

Highlights

Pythonic APIs with native PySpark and Spark SQL UDF support

Point-in-time correct feature computation for training and online serving

Scalable architecture handling billions of rows and petabyte-scale data

Built-in registry and UI for feature discovery, lineage, and access control

Pros

Proven in production at LinkedIn for over 6 years
Unified API works across batch, streaming, and online use cases
Native integrations with Databricks and Azure Synapse simplify cloud deployment
Extensible with custom UDFs and rich transformation primitives

Considerations

Requires a Spark environment, limiting use with non‑Spark stacks
Learning curve for advanced point‑in‑time semantics
Enterprise‑grade scaling may need substantial cloud resources
Documentation can assume familiarity with LinkedIn‑style data pipelines

Managed products teams compare with

When teams consider Feathr, these hosted platforms usually appear on the same shortlist.

Amazon SageMaker Feature Store

Fully managed repository to create, store, share, and serve ML features

Databricks Feature Store

Feature registry with governance, lineage, and MLflow integration

Tecton Feature Store

Central hub to manage, govern, and serve ML features across batch, streaming, and real time

Looking for a hosted option? These are the services engineering teams benchmark against before choosing open source.

Fit guide

Great for

Data science teams building large‑scale feature pipelines for ML models
Enterprises needing reusable, versioned feature definitions across projects
Organizations leveraging Azure or Databricks for AI workloads
Teams requiring strict data leakage prevention via point‑in‑time joins

Not ideal when

Small projects without Spark infrastructure
Use cases focused solely on simple data ETL without ML features
Environments where a lightweight, non‑distributed feature store is preferred
Teams lacking Python or Spark expertise

How teams use it

NYC Taxi fare prediction

Rapidly define, materialize, and serve fare prediction features with point‑in‑time correctness

Fraud detection pipeline

Combine user account and transaction streams into real‑time fraud risk features

Product recommendation system

Generate and serve user‑item interaction features for personalized ranking

Feature embedding for NLP

Create embedding features using transformer models and serve them in online inference

Tech snapshot

Scala47%

Java30%

Python19%

TypeScript3%

Shell1%

Dockerfile1%

Frequently asked questions

How do I try Feathr locally?

Run the Feathr Sandbox Docker container, which includes UI, Jupyter, and core services, and follow the quickstart notebook.

What languages are supported?

Feathr’s APIs are Pythonic; transformations can be expressed with native PySpark or Spark SQL.

Can Feathr run on cloud platforms?

Yes, it has native integrations with Databricks and Azure Synapse, with deployment guides and ARM templates.

How does Feathr prevent data leakage?

It computes features using point‑in‑time‑correct semantics, ensuring training data only sees information available up to the event timestamp.

Is there a UI for feature management?

Feathr includes a web UI for searching, exploring lineage, and managing access to registered features.

Project at a glance

Dormant

Visit site View repo

Stars: 1,926
Watchers: 1,926
Forks: 242

LicenseApache-2.0

Repo age4 years old

Last commit2 years ago

Primary languageScala

Last synced yesterday

Overview

Overview

Capabilities & Deployment

Ecosystem

Highlights

Pros

Considerations

Managed products teams compare with

Amazon SageMaker Feature Store

Databricks Feature Store

Tecton Feature Store

Fit guide

Great for

Not ideal when

How teams use it

Tech snapshot

Tags

Frequently asked questions