Magda

Federated data catalog with scalable search and Kubernetes‑native deployment

Magda unifies datasets, files, APIs, and databases across an organization, offering federated search, automated metadata enrichment, and cloud‑agnostic deployment via Kubernetes.

Overview

Unified Data Discovery

Magda provides a single, federated view of all data assets—whether they reside in files, databases, APIs, or external portals. By crawling sources, enriching metadata, and tracking changes, it enables users to discover, prioritize, and trust the data they need.

Extensible, Cloud‑Agnostic Architecture

Built as Kubernetes‑orchestrated microservices, Magda is deployed via Helm charts and runs on any cloud or on‑premises environment. Its unopinionated Registry stores records as JSON aspects, while connectors and minions—packaged as Docker images—allow custom ingestion, validation, and enrichment in any language. Search is powered by OpenSearch, delivering fast, scalable results.

Ready for Enterprise Scale

Used in production by data.gov.au, Magda supports federated authentication (Google, Facebook, WSFed, AAF, CKAN, custom) and is designed for large, heterogeneous environments. Ongoing development adds automated cataloguing, policy‑based authorization with OPA, and native dataset storage.

Highlights

Scalable OpenSearch‑based search across federated data sources

Kubernetes‑orchestrated microservices with Helm charts for cloud‑agnostic deployment

Extensible metadata registry using dynamic JSON aspects and plug‑in connectors/minions

Federated authentication via passport.js supporting multiple providers

Pros

Handles heterogeneous data sources (files, APIs, databases) in a single catalog
High‑performance search powered by OpenSearch
Extensible via Docker‑based connectors and minions written in any language
Cloud‑agnostic and reproducible deployments with Helm and Kubernetes

Considerations

Requires Kubernetes expertise for installation and upgrades
Documentation may lag behind new features
Dataset storage is still under development
Policy‑based authorization is not yet fully released

Managed products teams compare with

When teams consider Magda, these hosted platforms usually appear on the same shortlist.

Alation

Data catalog platform for data discovery, governance, and lineage

Ataccama

Unified data management platform combining catalog, governance, data quality, and MDM

Atlan

Modern data catalog and collaborative metadata platform for data discovery and governance

Looking for a hosted option? These are the services engineering teams benchmark against before choosing open source.

Fit guide

Great for

Large enterprises needing a unified view of internal and external data assets
Government open‑data portals that require federated search across multiple sources
Teams comfortable with container orchestration and seeking extensible metadata handling
Organizations that want to customize catalog behavior via plug‑in connectors or minions

Not ideal when

Small projects without Kubernetes infrastructure or expertise
Users requiring out‑of‑the‑box dataset hosting and storage
Deployments that need fully documented, stable UI features immediately
Environments where a turnkey, single‑binary solution is preferred over microservice architecture

How teams use it

Cross‑agency open data portal

Aggregates datasets from multiple government portals, providing citizens a single searchable interface.

Enterprise data discovery platform

Indexes internal databases, file shares, and APIs, enabling analysts to locate and assess data assets quickly.

Automated metadata enrichment pipeline

Minions validate links, assess data quality, and enrich records, improving search relevance without manual effort.

Custom compliance enforcement

Integrates Open Policy Agent to restrict dataset visibility based on user roles and regulatory rules.

Tech snapshot

JavaScript38%

Prolog32%

TypeScript17%

Scala8%

SCSS4%

Open Policy Agent1%

Frequently asked questions

What infrastructure is required to run Magda?

Magda runs as a set of Docker containers orchestrated by Kubernetes; a Helm chart is provided for installation on any K8s cluster, including local Minikube.

How does Magda handle authentication?

Authentication is federated through passport.js, supporting providers such as Google, Facebook, WSFed, AAF, CKAN, and custom OAuth/OpenID Connect services.

Can I add support for a new data source?

Yes. New connectors are implemented as Docker‑based microservices that crawl the source and import metadata into the registry.

Is there built‑in data storage for datasets?

Dataset storage is currently under development; Magda presently catalogs metadata and links to external data locations.

How is search performance achieved?

All searchable aspects are indexed in an OpenSearch cluster, delivering fast, scalable full‑text and faceted search.

Project at a glance

Active

Visit site View repo

Stars: 604
Watchers: 604
Forks: 107

LicenseApache-2.0

Repo age9 years old

Last commit3 days ago

Primary languageJavaScript

Last synced 5 hours ago