DataProfiler

Instantly profile data and uncover hidden sensitive information

Data Profiler loads CSV, JSON, Avro, Parquet, and text files, then automatically generates statistical summaries, schema insights, and AI‑driven PII/NPI detection, all with a few lines of Python.

Overview

Data Profiler is a Python library that simplifies data analysis, monitoring, and sensitive data detection. It targets data scientists, engineers, and compliance teams who need fast, reproducible insights from structured, unstructured, or graph datasets.

Capabilities & Deployment

With a single import, the library auto‑detects file formats (CSV, AVRO, Parquet, JSON, plain text, URLs) and loads them into a Pandas‑compatible DataFrame. The profiling engine produces a comprehensive data profile—including global and column‑level statistics, schema inference, and AI‑powered entity recognition for PII/NPI. A pre‑trained deep‑learning model handles sensitive data detection out of the box, and developers can extend it with custom entities or replace the pipeline entirely. Installation options range from a full package with all ML dependencies to slimmer builds that omit heavy libraries like TensorFlow, allowing flexible deployment in notebooks, CI pipelines, or production ETL jobs.

Highlights

Auto‑detects and loads multiple file formats into a Pandas DataFrame

Generates global and column‑level statistics with a single command

Built‑in deep‑learning model for PII/NPI detection

Modular installation extras for full, ML‑only, or lightweight use

Pros

Fast, one‑line data loading and profiling
Comprehensive statistical and entity insights
Extensible with custom entity pipelines
Flexible install options to suit resource constraints

Considerations

Full feature set requires heavy ML dependencies (e.g., TensorFlow)
Correlation matrix currently disabled by default
Limited to Python environments
Large datasets may need additional memory tuning

Managed products teams compare with

When teams consider DataProfiler, these hosted platforms usually appear on the same shortlist.

Amazon Macie

Managed sensitive data discovery and protection for Amazon S3.

BigID

Data intelligence platform focused on data privacy, security, and governance through sensitive data discovery and classification

OneTrust

Unified trust platform for privacy, consent, data governance, and compliance automation.

Looking for a hosted option? These are the services engineering teams benchmark against before choosing open source.

Fit guide

Great for

Data scientists needing quick exploratory analysis
Compliance teams monitoring for sensitive information
ETL pipelines that require automated data quality checks
Developers extending entity detection with domain‑specific labels

Not ideal when

Real‑time streaming data where latency is critical
Non‑Python ecosystems without a compatible bridge
Distributed big‑data platforms that need Spark‑style scaling
Environments that cannot install heavyweight ML libraries

How teams use it

Rapid data audit of a CSV file

Produces a compact JSON report with schema, statistics, and identified PII entities.

Automated data quality monitoring in an ETL workflow

Generates daily profiles and flags anomalies or new sensitive data for downstream alerts.

Custom entity detection for proprietary identifiers

Extends the pre‑trained model with user‑defined entities, enabling tailored compliance checks.

Documentation generation for data assets

Creates reproducible data profiles that can be embedded in data catalogs or stakeholder reports.

Tech snapshot

Python100%

Makefile1%

CSS1%

HTML1%

Batchfile1%

Shell1%

Frequently asked questions

Which file formats does Data Profiler support?

CSV, AVRO, Parquet, JSON, plain text, and URLs are auto‑detected and loaded.

How can I add custom entity types?

You can train or import a new model and register it with the profiler, or replace the labeling pipeline entirely.

What is the difference between the install extras?

`DataProfiler[full]` includes all ML dependencies and reporting; `DataProfiler[ml]` provides ML components without report generation; `DataProfiler[reports]` skips heavy ML libraries.

Do I need TensorFlow for basic profiling?

No. TensorFlow is only required for the default sensitive data detection. Use the `reports` extra to avoid it.

Can I enable the correlation matrix?

Yes. Set the `is_enable` option to true in the profiler configuration to compute it.

Project at a glance

Stable

Visit site View repo

Stars: 1,546
Watchers: 1,546
Forks: 185

LicenseApache-2.0

Repo age5 years old

Last commit5 months ago

Primary languagePython

Last synced 2 days ago

Overview

Overview

Capabilities & Deployment

Highlights

Pros

Considerations

Managed products teams compare with

Amazon Macie

BigID

OneTrust

Fit guide

Great for

Not ideal when

How teams use it

Tech snapshot

Tags

Frequently asked questions