Mara Pipelines

Lightweight Python ETL framework with PostgreSQL and web UI

A transparent data transformation framework that defines pipelines as Python code, uses PostgreSQL for processing, and provides an extensive web interface for debugging and execution.

Overview

Mara Pipelines is a lightweight data transformation framework designed for teams building ETL workflows who value transparency and simplicity over distributed complexity. It positions itself between plain scripts and heavyweight orchestrators like Apache Airflow.

Core Philosophy

Pipelines, tasks, and commands are defined declaratively in Python code. The framework uses PostgreSQL as its data processing engine and relies on command-line tools rather than in-app data processing. Execution follows GNU make semantics where nodes depend on upstream completion, not data flows. Single-machine execution via Python's multiprocessing eliminates the need for distributed task queues, making debugging straightforward.

Key Capabilities

The extensive web UI serves as the primary interface for inspecting, running, and debugging pipelines. Each pipeline displays dependency graphs, 30-day runtime charts, node priority tables, and execution logs. Tasks show upstream/downstream relationships, historical performance, and command output. Cost-based priority queues automatically run expensive nodes first based on recorded runtimes.

Deployment

Install via pip and integrate into Flask applications. A PostgreSQL database stores runtime information and incremental processing status. Note: heavy use of forking means native Windows execution requires Docker or WSL.

Highlights

Declarative Python pipeline definitions with task dependencies and bash command execution

PostgreSQL-backed execution tracking with automatic schema migration and runtime storage

Comprehensive web UI for visualizing dependencies, monitoring performance, and running tasks

Cost-based priority queues that schedule expensive nodes first using historical runtime data

Pros

Simple single-machine execution eliminates distributed system complexity and debugging challenges
Rich web interface provides immediate visibility into pipeline structure, performance, and logs
Declarative Python code makes pipelines version-controllable and easy to review
Cost-based scheduling optimizes execution order based on actual historical performance

Considerations

Single-machine architecture limits scalability for very large or compute-intensive workloads
PostgreSQL dependency required even for simple pipelines without complex state needs
Does not run natively on Windows; requires Docker or WSL workarounds
Documentation is work-in-progress; relies heavily on example projects for onboarding

Managed products teams compare with

When teams consider Mara Pipelines, these hosted platforms usually appear on the same shortlist.

Airbyte

Open-source data integration engine for ELT pipelines across data sources

Azure Data Factory

Cloud-based data integration service to create, schedule, and orchestrate ETL/ELT data pipelines at scale

Fivetran

Managed ELT data pipelines into warehouses

Looking for a hosted option? These are the services engineering teams benchmark against before choosing open source.

Fit guide

Great for

Teams building medium-scale ETL workflows on single servers who prioritize simplicity over distribution
Data engineers comfortable with PostgreSQL who want transparent, debuggable pipeline execution
Organizations seeking a middle ground between custom scripts and complex orchestration platforms
Flask-based applications needing integrated pipeline management with visual monitoring

Not ideal when

Large-scale distributed data processing requiring horizontal scaling across multiple machines
Windows-native environments without Docker or WSL infrastructure available
Teams requiring real-time streaming or event-driven data flows rather than batch dependencies
Projects needing extensive out-of-the-box integrations with cloud services and SaaS platforms

How teams use it

Daily data warehouse refresh

Schedule nightly ETL jobs that extract from sources, transform via SQL, and load to PostgreSQL with automatic retry and performance tracking

Multi-stage reporting pipeline

Chain data extraction, cleaning, aggregation, and export tasks with dependency management and visual progress monitoring

Database migration orchestration

Coordinate complex schema changes and data backfills across multiple tables with rollback-friendly task isolation

Incremental data synchronization

Track file dependencies and timestamps to process only changed data sources, reducing runtime and resource consumption

Tech snapshot

Python84%

JavaScript13%

PLpgSQL1%

CSS1%

Makefile1%

Frequently asked questions

Does Mara Pipelines require a distributed task queue like Celery?

No. Mara Pipelines uses Python's multiprocessing for single-machine execution, eliminating the need for distributed task queues and simplifying debugging.

Can I run Mara Pipelines on Windows?

Not natively due to heavy use of forking. You must use Docker or the Windows Subsystem for Linux (WSL) to run it on Windows.

Is PostgreSQL required for all pipelines?

PostgreSQL is recommended for storing runtime information, execution logs, and incremental processing state. The framework is designed with PostgreSQL as the data processing engine.

How does cost-based priority scheduling work?

Mara Pipelines records historical run times for each node and schedules nodes with higher costs (longer runtimes) first to optimize overall pipeline completion time.

Can I integrate Mara Pipelines into an existing Flask application?

Yes. The framework provides web UI components designed for Flask integration. Reference the mara example projects for implementation patterns.

Project at a glance

Dormant

View repo

Stars: 2,086
Watchers: 2,086
Forks: 99

LicenseMIT

Repo age7 years old

Last commit2 years ago

Primary languagePython

Last synced 10 hours ago

Overview

Overview

Core Philosophy

Key Capabilities

Deployment

Highlights

Pros

Considerations

Managed products teams compare with

Airbyte

Azure Data Factory

Fivetran

Fit guide

Great for

Not ideal when

How teams use it

Tech snapshot

Tags

Frequently asked questions