Amundsen

Google‑style search engine for data assets across your organization

Amundsen indexes tables, dashboards, ML features, and people, delivering relevance‑ranked search powered by usage patterns, with a Flask/React UI and integrations for major data stores.

Overview

Amundsen is a data discovery and metadata engine that helps analysts, data scientists, and engineers locate the data they need quickly. By indexing tables, dashboards, ML features, and people, it provides a Google‑style search experience where frequently used assets surface first.

Capabilities & Deployment

The platform consists of four microservices—frontend (Flask + React), search (Elasticsearch), metadata (Neo4j, Apache Atlas, relational DBs, or AWS Neptune), and an ingestion library (databuilder). Ingestion can be run via Python scripts or Airflow DAGs, supporting over 30 connectors such as Hive, Redshift, Snowflake, BigQuery, and many more. Deployment requires Python ≥ 3.8 and Node 12, and each service can be containerized for scalable operation.

Audience

Designed for organizations with diverse data ecosystems that need a unified, extensible catalog. The active LF AI & Data community provides support, documentation, and a Slack channel for collaboration.

Highlights

Relevance‑ranked search across tables, dashboards, ML features, and people

Extensible ingestion pipeline with 30+ built‑in connectors

Pluggable metadata stores: Neo4j, Apache Atlas, relational DBs, AWS Neptune

Modern UI with inline previews, column statistics, and dashboard links

Pros

Improves data discoverability and reduces time to insight
Scalable microservice architecture
Rich ecosystem of connectors for major data platforms
Active community backed by LF AI & Data Foundation

Considerations

Requires multiple services (frontend, search, metadata) to run
Operational overhead for Elasticsearch and graph store management
Limited built‑in data quality features
Customization may need Python and Node development

Managed products teams compare with

When teams consider Amundsen, these hosted platforms usually appear on the same shortlist.

Alation

Data catalog platform for data discovery, governance, and lineage

Ataccama

Unified data management platform combining catalog, governance, data quality, and MDM

Atlan

Modern data catalog and collaborative metadata platform for data discovery and governance

Looking for a hosted option? These are the services engineering teams benchmark against before choosing open source.

Fit guide

Great for

Large organizations with diverse data platforms seeking unified discovery
Teams already using Elasticsearch and Neo4j/Atlas
Companies wanting a self‑hosted, extensible metadata layer
Data engineers comfortable with Python and Airflow

Not ideal when

Small teams lacking resources to manage multiple services
Environments that need a turnkey SaaS solution
Organizations requiring built‑in governance without extra tooling
Users preferring a single‑binary deployment

How teams use it

Find frequently queried tables for ad‑hoc analysis

Analysts locate high‑usage tables instantly, cutting discovery time from days to minutes.

Catalog dashboards across BI tools

Data consumers search and navigate to relevant Superset, Tableau, or Looker dashboards.

Expose ML feature metadata to model developers

Feature engineers retrieve feature definitions and lineage, ensuring consistency across models.

Integrate with Airflow to keep metadata fresh

Automated DAGs run databuilder jobs, continuously updating the search index as new tables appear.

Tech snapshot

Python67%

TypeScript30%

SCSS2%

HTML1%

Makefile1%

Scala1%

Frequently asked questions

What storage backends does Amundsen support for metadata?

It can use Neo4j, Apache Atlas, relational databases via SQLAlchemy, or AWS Neptune through the Gremlin library.

How is search powered?

Search service leverages Elasticsearch, ranking results based on usage signals such as query frequency.

Can I add a custom data source?

Yes, the databuilder library lets you write a Python extractor and loader for any dbapi or SQLAlchemy‑compatible source.

What are the main components to deploy?

Frontend (Flask + React), Search service, Metadata service, and the ingestion library; each runs as a separate microservice.

Is there a community or support channel?

Amundsen is hosted by the LF AI & Data Foundation; users can join the Slack workspace and contribute via GitHub.

Project at a glance

Active

Visit site View repo

Stars: 4,782
Watchers: 4,782
Forks: 964

LicenseApache-2.0

Repo age7 years old

Last commit3 weeks ago

Primary languagePython

Last synced 9 hours ago

Overview

Overview

Capabilities & Deployment

Audience

Highlights

Pros

Considerations

Managed products teams compare with

Alation

Ataccama

Atlan

Fit guide

Great for

Not ideal when

How teams use it

Tech snapshot

Tags

Frequently asked questions