Apache Samza logo

Apache Samza

Scalable, fault-tolerant stream processing with Kafka and YARN

Apache Samza delivers scalable, fault‑tolerant stream processing with a simple API, managed state, and tight integration with Kafka and YARN for Java and Scala workloads.

Overview

Overview

Apache Samza is a distributed stream processing framework that leverages Apache Kafka for ordered, replayable messaging and Apache YARN for resource isolation, security, and fault tolerance. It offers a callback‑based API that feels like MapReduce, making it easy for Java and Scala developers to write stateful jobs.

Capabilities & Deployment

Samza manages state snapshots and restores them consistently after failures, supports large per‑partition state, and guarantees message durability. It runs on YARN clusters (both 2.x and 3.x) and can be built with Java 8 or Java 11, as well as Scala 2.11 or 2.12 via Gradle. While Kafka is the default source, the pluggable architecture lets you connect other messaging systems. Deployment involves building with ./gradlew clean build and launching jobs via the Samza shell tools.

Who Should Use It

Ideal for teams already invested in the Hadoop ecosystem who need reliable, exactly‑once processing at scale, and who require managed state without writing custom checkpoint logic.

Highlights

Simple callback‑based API comparable to MapReduce
Managed state with automatic snapshotting and restoration
Fault‑tolerant execution using YARN container migration
Pluggable architecture for alternative messaging systems

Pros

  • Tight integration with Kafka ensures ordered, durable streams
  • YARN provides robust resource isolation and security
  • Scales horizontally by partitioning at every level
  • Mature Apache project with active community

Considerations

  • Production deployments rely on YARN, limiting container‑native flexibility
  • Some modules lack full Java 11 support
  • Requires familiarity with Hadoop/YARN configuration
  • Learning curve for state management concepts

Managed products teams compare with

When teams consider Apache Samza, these hosted platforms usually appear on the same shortlist.

Aiven for Apache Flink logo

Aiven for Apache Flink

Fully managed Apache Flink service by Aiven.

Amazon Managed Service for Apache Flink logo

Amazon Managed Service for Apache Flink

Serverless Apache Flink for real-time stream processing on AWS.

Azure Stream Analytics logo

Azure Stream Analytics

Serverless real-time analytics with SQL on streams.

Looking for a hosted option? These are the services engineering teams benchmark against before choosing open source.

Fit guide

Great for

  • Teams needing exactly‑once, stateful stream processing
  • Organizations already running Hadoop/YARN clusters
  • Java or Scala developers building low‑latency pipelines
  • Workloads that require large per‑partition state

Not ideal when

  • Projects preferring Kubernetes or other container orchestrators
  • Pure Python or lightweight data‑flow use cases
  • Small, ad‑hoc scripts where a heavyweight cluster is overkill
  • Ultra‑low‑latency scenarios beyond YARN’s scheduling granularity

How teams use it

Real‑time fraud detection

Processes transaction events from Kafka, maintains per‑account state, and flags suspicious activity with exactly‑once guarantees.

Clickstream aggregation

Aggregates website click events into hourly counts, persisting state to Kafka changelogs for fault‑tolerant roll‑ups.

IoT sensor enrichment

Joins incoming sensor streams with reference data, storing enriched results while handling node failures transparently.

Log processing for alerting

Consumes log streams, applies pattern matching, and triggers alerts without losing messages, even during cluster outages.

Tech snapshot

Java87%
Scala12%
Python1%
Shell1%
Scaml1%
Less1%

Tags

scalasamzabig-data

Frequently asked questions

Does Samza require Kafka as the messaging system?

Kafka is the default and fully supported source, but Samza’s pluggable API allows integration with other messaging systems.

Which Java runtimes are supported?

Samza runs on Java 8 and Java 11; Java 11 requires YARN 3.3.4+ and the `samza-yarn3` module.

How does Samza persist state?

State is checkpointed to Kafka changelog topics, enabling automatic restoration after failures.

Can Samza be run on Kubernetes?

Samza does not natively support Kubernetes; you would need to run YARN on Kubernetes or build a custom integration.

What is the recommended way to build Samza from source?

Use the Gradle wrapper: `./gradlew clean build`. Scala version can be selected with `-PscalaSuffix=2.12`.

Project at a glance

Stable
Stars
838
Watchers
838
Forks
331
LicenseApache-2.0
Repo age10 years old
Last commit9 months ago
Primary languageJava

Last synced yesterday