Magda logo

Magda

Federated data catalog with scalable search and Kubernetes‑native deployment

Magda unifies datasets, files, APIs, and databases across an organization, offering federated search, automated metadata enrichment, and cloud‑agnostic deployment via Kubernetes.

Magda banner

Overview

Unified Data Discovery

Magda provides a single, federated view of all data assets—whether they reside in files, databases, APIs, or external portals. By crawling sources, enriching metadata, and tracking changes, it enables users to discover, prioritize, and trust the data they need.

Extensible, Cloud‑Agnostic Architecture

Built as Kubernetes‑orchestrated microservices, Magda is deployed via Helm charts and runs on any cloud or on‑premises environment. Its unopinionated Registry stores records as JSON aspects, while connectors and minions—packaged as Docker images—allow custom ingestion, validation, and enrichment in any language. Search is powered by OpenSearch, delivering fast, scalable results.

Ready for Enterprise Scale

Used in production by data.gov.au, Magda supports federated authentication (Google, Facebook, WSFed, AAF, CKAN, custom) and is designed for large, heterogeneous environments. Ongoing development adds automated cataloguing, policy‑based authorization with OPA, and native dataset storage.

Highlights

Scalable OpenSearch‑based search across federated data sources
Kubernetes‑orchestrated microservices with Helm charts for cloud‑agnostic deployment
Extensible metadata registry using dynamic JSON aspects and plug‑in connectors/minions
Federated authentication via passport.js supporting multiple providers

Pros

  • Handles heterogeneous data sources (files, APIs, databases) in a single catalog
  • High‑performance search powered by OpenSearch
  • Extensible via Docker‑based connectors and minions written in any language
  • Cloud‑agnostic and reproducible deployments with Helm and Kubernetes

Considerations

  • Requires Kubernetes expertise for installation and upgrades
  • Documentation may lag behind new features
  • Dataset storage is still under development
  • Policy‑based authorization is not yet fully released

Managed products teams compare with

When teams consider Magda, these hosted platforms usually appear on the same shortlist.

Alation logo

Alation

Data catalog platform for data discovery, governance, and lineage

Ataccama logo

Ataccama

Unified data management platform combining catalog, governance, data quality, and MDM

Atlan logo

Atlan

Modern data catalog and collaborative metadata platform for data discovery and governance

Looking for a hosted option? These are the services engineering teams benchmark against before choosing open source.

Fit guide

Great for

  • Large enterprises needing a unified view of internal and external data assets
  • Government open‑data portals that require federated search across multiple sources
  • Teams comfortable with container orchestration and seeking extensible metadata handling
  • Organizations that want to customize catalog behavior via plug‑in connectors or minions

Not ideal when

  • Small projects without Kubernetes infrastructure or expertise
  • Users requiring out‑of‑the‑box dataset hosting and storage
  • Deployments that need fully documented, stable UI features immediately
  • Environments where a turnkey, single‑binary solution is preferred over microservice architecture

How teams use it

Cross‑agency open data portal

Aggregates datasets from multiple government portals, providing citizens a single searchable interface.

Enterprise data discovery platform

Indexes internal databases, file shares, and APIs, enabling analysts to locate and assess data assets quickly.

Automated metadata enrichment pipeline

Minions validate links, assess data quality, and enrich records, improving search relevance without manual effort.

Custom compliance enforcement

Integrates Open Policy Agent to restrict dataset visibility based on user roles and regulatory rules.

Tech snapshot

JavaScript38%
Prolog32%
TypeScript17%
Scala8%
SCSS4%
Open Policy Agent1%

Tags

scalakubernetespostgresqlnodejsopen-dataopensearch

Frequently asked questions

What infrastructure is required to run Magda?

Magda runs as a set of Docker containers orchestrated by Kubernetes; a Helm chart is provided for installation on any K8s cluster, including local Minikube.

How does Magda handle authentication?

Authentication is federated through passport.js, supporting providers such as Google, Facebook, WSFed, AAF, CKAN, and custom OAuth/OpenID Connect services.

Can I add support for a new data source?

Yes. New connectors are implemented as Docker‑based microservices that crawl the source and import metadata into the registry.

Is there built‑in data storage for datasets?

Dataset storage is currently under development; Magda presently catalogs metadata and links to external data locations.

How is search performance achieved?

All searchable aspects are indexed in an OpenSearch cluster, delivering fast, scalable full‑text and faceted search.

Project at a glance

Active
Stars
580
Watchers
580
Forks
97
LicenseApache-2.0
Repo age9 years old
Last commit4 weeks ago
Primary languageJavaScript

Last synced yesterday