Apache Gravitino logo

Apache Gravitino

Geo-distributed federated metadata lake for unified data governance

Apache Gravitino manages metadata across diverse sources, regions, and clouds through a unified API, enabling federated discovery, multi-region sync, and end-to-end governance for data and AI assets.

Apache Gravitino banner

Overview

Unified Metadata for Global Architectures

Apache Gravitino is a high-performance, geo-distributed metadata lake designed for organizations managing data and AI assets across multiple sources, regions, and clouds. It provides a single API and model to access metadata from Hive, MySQL, HDFS, S3, and other systems, eliminating silos and enabling federated discovery without migrating data.

Direct Integration and Multi-Engine Support

Gravitino connects directly to underlying metadata systems, ensuring changes are immediately reflected without batch synchronization. It integrates seamlessly with query engines like Trino and Spark, allowing teams to run federated queries without modifying SQL dialects. Built-in support for Iceberg REST catalog and evolving AI model lineage standards makes it suitable for modern lakehouse and AI workflows.

Enterprise-Grade Governance

The platform delivers end-to-end data governance with unified access control, auditing, and discovery across all metadata assets. Geo-distribution capabilities enable metadata sharing across hybrid and multi-cloud environments, supporting global teams and compliance requirements. Licensed under Apache 2.0, Gravitino is built with Gradle and offers Docker Compose–based playground environments for rapid evaluation.

Highlights

Unified API for managing metadata across Hive, MySQL, HDFS, S3, and more
Geo-distributed architecture for multi-region and multi-cloud metadata sharing
Direct connector integration with immediate reflection of upstream changes
Native Iceberg REST catalog and Trino connector for federated queries

Pros

  • Eliminates metadata silos with a single access layer across diverse sources
  • Supports global architectures with geo-distribution and multi-cloud capabilities
  • Seamless integration with Trino, Spark, and Iceberg without SQL dialect changes
  • End-to-end governance including access control, auditing, and discovery

Considerations

  • Windows builds are not currently supported
  • AI asset management features are still work-in-progress
  • Requires familiarity with distributed metadata concepts for optimal deployment
  • Java-based architecture may require JVM tuning for large-scale environments

Managed products teams compare with

When teams consider Apache Gravitino, these hosted platforms usually appear on the same shortlist.

Alation logo

Alation

Data catalog platform for data discovery, governance, and lineage

Amazon Redshift logo

Amazon Redshift

Fully managed, petabyte-scale cloud data warehouse for analytics and reporting

Ataccama logo

Ataccama

Unified data management platform combining catalog, governance, data quality, and MDM

Looking for a hosted option? These are the services engineering teams benchmark against before choosing open source.

Fit guide

Great for

  • Organizations with data spread across multiple clouds, regions, or on-premises systems
  • Teams needing federated metadata discovery across data lakes and warehouses
  • Enterprises requiring unified governance and audit trails for compliance
  • Data platforms integrating Trino, Spark, or Iceberg in hybrid architectures

Not ideal when

  • Single-source metadata scenarios where native tooling suffices
  • Windows-based development or deployment environments
  • Teams seeking mature AI model catalog features in production today
  • Projects requiring minimal infrastructure or lightweight metadata solutions

How teams use it

Multi-Cloud Data Lake Federation

Unified metadata access across AWS S3, Azure Data Lake, and on-premises HDFS, enabling cross-cloud analytics without data migration.

Global Metadata Synchronization

Geo-distributed teams share consistent metadata views across regions, supporting compliance and reducing query latency for local users.

Federated Query with Trino

Data engineers run SQL queries spanning Hive, MySQL, and Iceberg tables through Gravitino's Trino connector without rewriting queries.

Unified Data Governance

Centralized access control and audit logs across all metadata assets, simplifying compliance reporting and security policy enforcement.

Tech snapshot

Java86%
Python10%
JavaScript3%
Rust1%
Shell1%
Dockerfile1%

Tags

skycomputingstratospheremodel-catalogmetalakemetadatadata-cataloglakehousedatalakefederated-queryopendatacatalogai-catalog

Frequently asked questions

What metadata sources does Gravitino support?

Gravitino integrates with Hive, MySQL, HDFS, S3, Iceberg, and other systems through direct connectors, with changes reflected immediately in the unified metadata layer.

How does Gravitino handle geo-distributed metadata?

Gravitino shares metadata across regions and clouds, enabling global architectures where teams in different locations access consistent metadata views without replication delays.

Can I use Gravitino with existing query engines?

Yes. Gravitino provides native connectors for Trino and Spark, allowing federated queries without modifying SQL dialects or migrating existing workflows.

Is AI model metadata supported?

AI asset management, including model and feature tracking, is currently work-in-progress. Check the documentation for the latest status and roadmap.

What is the Gravitino Playground?

A Docker Compose–based environment that provides a full-stack Gravitino experience for evaluation, including sample data sources and query engines.

Project at a glance

Active
Stars
2,679
Watchers
2,679
Forks
713
LicenseApache-2.0
Repo age2 years old
Last commityesterday
Self-hostingSupported
Primary languageJava

Last synced yesterday