CloudQuery

High-performance ELT framework powered by Apache Arrow

CloudQuery is a composable data movement framework that extracts from cloud infrastructure and SaaS APIs to any destination, running entirely on your infrastructure.

Overview

What is CloudQuery?

CloudQuery is a high-performance data movement framework designed for developers who need complete control over their data pipelines. Built on Apache Arrow, it extracts data from cloud infrastructure, SaaS platforms, and APIs, delivering it to any destination—all while running entirely on your infrastructure.

Who Uses CloudQuery?

Engineering and security teams leverage CloudQuery for cloud security posture management (CSPM), asset inventory, FinOps, and attack surface management. Data engineers use it as a flexible ELT platform to eliminate data silos across security, infrastructure, marketing, and finance teams.

Key Capabilities

The framework offers a code-first, extensible plugin system with no vendor lock-in. Its composable architecture integrates with your existing languages, destinations, and orchestrators. Specialized plugins provide first-class support for complex data sources including AWS, GCP, Azure, and hundreds of other integrations. Because your data never touches external servers, CloudQuery fits regulated, secure, and performance-critical environments where privacy is paramount.

Built in Go and distributed under MPL-2.0, CloudQuery combines the flexibility of open-source tooling with enterprise-grade performance for large-scale data movement.

Highlights

Apache Arrow-powered engine for high-performance data movement at scale

Runs entirely on your infrastructure with zero data egress to external servers

Extensible plugin system with hundreds of source and destination integrations

Code-first architecture with multi-language SDK support and no vendor lock-in

Pros

Complete data privacy with on-premises execution model
Specialized plugins for cloud infrastructure, security, and FinOps data
Composable design integrates with existing tools and orchestrators
High performance for large-scale data movement using Apache Arrow

Considerations

Requires managing your own infrastructure and orchestration
Code-first approach may have steeper learning curve than GUI-based tools
Self-hosted model means you handle scaling and maintenance
Plugin ecosystem maturity varies across different integrations

Managed products teams compare with

When teams consider CloudQuery, these hosted platforms usually appear on the same shortlist.

Airbyte

Open-source data integration engine for ELT pipelines across data sources

Azure Data Factory

Cloud-based data integration service to create, schedule, and orchestrate ETL/ELT data pipelines at scale

Fivetran

Managed ELT data pipelines into warehouses

Looking for a hosted option? These are the services engineering teams benchmark against before choosing open source.

Fit guide

Great for

Security teams needing CSPM or attack surface management across multi-cloud
Data engineers building custom ELT pipelines with strict privacy requirements
FinOps teams consolidating billing data from multiple cloud providers
Organizations requiring on-premises data movement for compliance

Not ideal when

Teams seeking fully managed SaaS solutions without infrastructure overhead
Non-technical users preferring point-and-click configuration interfaces
Small projects where lightweight scripts suffice
Organizations unable to self-host and maintain data infrastructure

How teams use it

Cloud Security Posture Management

Monitor and enforce security policies across AWS, GCP, and Azure infrastructure with continuous compliance scanning and unified visibility.

Multi-Cloud Asset Inventory

Collect and centralize cloud configuration data from all major providers into a single queryable database for governance and auditing.

Cloud FinOps Optimization

Unify billing data across cloud providers to identify cost-saving opportunities and track spending trends in real time.

AI Model Data Pipelines

Feed LLM pipelines and AI applications with high-volume data from diverse sources using Apache Arrow's efficient columnar format.

Tech snapshot

Go91%

Python3%

Makefile2%

TypeScript2%

Java1%

Smarty1%

Frequently asked questions

Does CloudQuery store or process my data on external servers?

No. CloudQuery runs entirely on your infrastructure. Your data never touches CloudQuery's servers, ensuring complete privacy and compliance with data residency requirements.

What data sources and destinations does CloudQuery support?

CloudQuery supports hundreds of integrations including AWS, GCP, Azure, Kubernetes, GitHub, and many SaaS platforms. Destinations include PostgreSQL, BigQuery, Snowflake, S3, and more. Check the integrations hub for the full list.

How does CloudQuery compare to Airbyte or Fivetran?

CloudQuery is code-first and optimized for cloud infrastructure and security data, running on your infrastructure. It excels at CSPM, asset inventory, and FinOps use cases with specialized plugins, while Airbyte and Fivetran focus more on SaaS-to-warehouse replication.

Can I build custom plugins for proprietary data sources?

Yes. CloudQuery provides an open plugin SDK supporting multiple languages. You can develop, extend, and ship custom plugins without vendor approval or lock-in.

What license does CloudQuery use?

CloudQuery framework, CLI, SDK, and some integrations are licensed under MPL-2.0, allowing commercial use with specific copyleft requirements for modifications.

Project at a glance

Active

Visit site View repo

Stars: 6,464
Watchers: 6,464
Forks: 549

LicenseMPL-2.0

Repo age5 years old

Last commit4 days ago

Self-hostingSupported

Primary languageGo

Last synced 2 days ago