DataProfiler logo

DataProfiler

Instantly profile data and uncover hidden sensitive information

Data Profiler loads CSV, JSON, Avro, Parquet, and text files, then automatically generates statistical summaries, schema insights, and AI‑driven PII/NPI detection, all with a few lines of Python.

DataProfiler banner

Overview

Overview

Data Profiler is a Python library that simplifies data analysis, monitoring, and sensitive data detection. It targets data scientists, engineers, and compliance teams who need fast, reproducible insights from structured, unstructured, or graph datasets.

Capabilities & Deployment

With a single import, the library auto‑detects file formats (CSV, AVRO, Parquet, JSON, plain text, URLs) and loads them into a Pandas‑compatible DataFrame. The profiling engine produces a comprehensive data profile—including global and column‑level statistics, schema inference, and AI‑powered entity recognition for PII/NPI. A pre‑trained deep‑learning model handles sensitive data detection out of the box, and developers can extend it with custom entities or replace the pipeline entirely. Installation options range from a full package with all ML dependencies to slimmer builds that omit heavy libraries like TensorFlow, allowing flexible deployment in notebooks, CI pipelines, or production ETL jobs.

Highlights

Auto‑detects and loads multiple file formats into a Pandas DataFrame
Generates global and column‑level statistics with a single command
Built‑in deep‑learning model for PII/NPI detection
Modular installation extras for full, ML‑only, or lightweight use

Pros

  • Fast, one‑line data loading and profiling
  • Comprehensive statistical and entity insights
  • Extensible with custom entity pipelines
  • Flexible install options to suit resource constraints

Considerations

  • Full feature set requires heavy ML dependencies (e.g., TensorFlow)
  • Correlation matrix currently disabled by default
  • Limited to Python environments
  • Large datasets may need additional memory tuning

Managed products teams compare with

When teams consider DataProfiler, these hosted platforms usually appear on the same shortlist.

Amazon Macie logo

Amazon Macie

Managed sensitive data discovery and protection for Amazon S3.

BigID logo

BigID

Data intelligence platform focused on data privacy, security, and governance through sensitive data discovery and classification

OneTrust logo

OneTrust

Unified trust platform for privacy, consent, data governance, and compliance automation.

Looking for a hosted option? These are the services engineering teams benchmark against before choosing open source.

Fit guide

Great for

  • Data scientists needing quick exploratory analysis
  • Compliance teams monitoring for sensitive information
  • ETL pipelines that require automated data quality checks
  • Developers extending entity detection with domain‑specific labels

Not ideal when

  • Real‑time streaming data where latency is critical
  • Non‑Python ecosystems without a compatible bridge
  • Distributed big‑data platforms that need Spark‑style scaling
  • Environments that cannot install heavyweight ML libraries

How teams use it

Rapid data audit of a CSV file

Produces a compact JSON report with schema, statistics, and identified PII entities.

Automated data quality monitoring in an ETL workflow

Generates daily profiles and flags anomalies or new sensitive data for downstream alerts.

Custom entity detection for proprietary identifiers

Extends the pre‑trained model with user‑defined entities, enabling tailored compliance checks.

Documentation generation for data assets

Creates reproducible data profiles that can be embedded in data catalogs or stakeholder reports.

Tech snapshot

Python100%
Makefile1%
CSS1%
HTML1%
Batchfile1%
Shell1%

Tags

tabular-datasensitive-datamachine-learningdata-analysispandasdatasetnlppythonpiiavrodataprofilinggraph-datacsvprivacynetwork-datagdprdata-labelssecuritydata-sciencenpi

Frequently asked questions

Which file formats does Data Profiler support?

CSV, AVRO, Parquet, JSON, plain text, and URLs are auto‑detected and loaded.

How can I add custom entity types?

You can train or import a new model and register it with the profiler, or replace the labeling pipeline entirely.

What is the difference between the install extras?

`DataProfiler[full]` includes all ML dependencies and reporting; `DataProfiler[ml]` provides ML components without report generation; `DataProfiler[reports]` skips heavy ML libraries.

Do I need TensorFlow for basic profiling?

No. TensorFlow is only required for the default sensitive data detection. Use the `reports` extra to avoid it.

Can I enable the correlation matrix?

Yes. Set the `is_enable` option to true in the profiler configuration to compute it.

Project at a glance

Stable
Stars
1,539
Watchers
1,539
Forks
181
LicenseApache-2.0
Repo age5 years old
Last commit4 months ago
Primary languagePython

Last synced yesterday