Amundsen logo

Amundsen

Google‑style search engine for data assets across your organization

Amundsen indexes tables, dashboards, ML features, and people, delivering relevance‑ranked search powered by usage patterns, with a Flask/React UI and integrations for major data stores.

Amundsen banner

Overview

Overview

Amundsen is a data discovery and metadata engine that helps analysts, data scientists, and engineers locate the data they need quickly. By indexing tables, dashboards, ML features, and people, it provides a Google‑style search experience where frequently used assets surface first.

Capabilities & Deployment

The platform consists of four microservices—frontend (Flask + React), search (Elasticsearch), metadata (Neo4j, Apache Atlas, relational DBs, or AWS Neptune), and an ingestion library (databuilder). Ingestion can be run via Python scripts or Airflow DAGs, supporting over 30 connectors such as Hive, Redshift, Snowflake, BigQuery, and many more. Deployment requires Python ≥ 3.8 and Node 12, and each service can be containerized for scalable operation.

Audience

Designed for organizations with diverse data ecosystems that need a unified, extensible catalog. The active LF AI & Data community provides support, documentation, and a Slack channel for collaboration.

Highlights

Relevance‑ranked search across tables, dashboards, ML features, and people
Extensible ingestion pipeline with 30+ built‑in connectors
Pluggable metadata stores: Neo4j, Apache Atlas, relational DBs, AWS Neptune
Modern UI with inline previews, column statistics, and dashboard links

Pros

  • Improves data discoverability and reduces time to insight
  • Scalable microservice architecture
  • Rich ecosystem of connectors for major data platforms
  • Active community backed by LF AI & Data Foundation

Considerations

  • Requires multiple services (frontend, search, metadata) to run
  • Operational overhead for Elasticsearch and graph store management
  • Limited built‑in data quality features
  • Customization may need Python and Node development

Managed products teams compare with

When teams consider Amundsen, these hosted platforms usually appear on the same shortlist.

Alation logo

Alation

Data catalog platform for data discovery, governance, and lineage

Ataccama logo

Ataccama

Unified data management platform combining catalog, governance, data quality, and MDM

Atlan logo

Atlan

Modern data catalog and collaborative metadata platform for data discovery and governance

Looking for a hosted option? These are the services engineering teams benchmark against before choosing open source.

Fit guide

Great for

  • Large organizations with diverse data platforms seeking unified discovery
  • Teams already using Elasticsearch and Neo4j/Atlas
  • Companies wanting a self‑hosted, extensible metadata layer
  • Data engineers comfortable with Python and Airflow

Not ideal when

  • Small teams lacking resources to manage multiple services
  • Environments that need a turnkey SaaS solution
  • Organizations requiring built‑in governance without extra tooling
  • Users preferring a single‑binary deployment

How teams use it

Find frequently queried tables for ad‑hoc analysis

Analysts locate high‑usage tables instantly, cutting discovery time from days to minutes.

Catalog dashboards across BI tools

Data consumers search and navigate to relevant Superset, Tableau, or Looker dashboards.

Expose ML feature metadata to model developers

Feature engineers retrieve feature definitions and lineage, ensuring consistency across models.

Integrate with Airflow to keep metadata fresh

Automated DAGs run databuilder jobs, continuously updating the search index as new tables appear.

Tech snapshot

Python67%
TypeScript30%
SCSS2%
HTML1%
Makefile1%
Scala1%

Tags

amundsenmetadatadata-cataloglinuxfoundationdata-discovery

Frequently asked questions

What storage backends does Amundsen support for metadata?

It can use Neo4j, Apache Atlas, relational databases via SQLAlchemy, or AWS Neptune through the Gremlin library.

How is search powered?

Search service leverages Elasticsearch, ranking results based on usage signals such as query frequency.

Can I add a custom data source?

Yes, the databuilder library lets you write a Python extractor and loader for any dbapi or SQLAlchemy‑compatible source.

What are the main components to deploy?

Frontend (Flask + React), Search service, Metadata service, and the ingestion library; each runs as a separate microservice.

Is there a community or support channel?

Amundsen is hosted by the LF AI & Data Foundation; users can join the Slack workspace and contribute via GitHub.

Project at a glance

Active
Stars
4,716
Watchers
4,716
Forks
973
LicenseApache-2.0
Repo age6 years old
Last commit2 weeks ago
Primary languagePython

Last synced 3 hours ago