
Amazon Macie
Managed sensitive data discovery and protection for Amazon S3.
Discover top open-source software, updated regularly with real-world adoption signals.

Instantly profile data and uncover hidden sensitive information
Data Profiler loads CSV, JSON, Avro, Parquet, and text files, then automatically generates statistical summaries, schema insights, and AI‑driven PII/NPI detection, all with a few lines of Python.

Data Profiler is a Python library that simplifies data analysis, monitoring, and sensitive data detection. It targets data scientists, engineers, and compliance teams who need fast, reproducible insights from structured, unstructured, or graph datasets.
With a single import, the library auto‑detects file formats (CSV, AVRO, Parquet, JSON, plain text, URLs) and loads them into a Pandas‑compatible DataFrame. The profiling engine produces a comprehensive data profile—including global and column‑level statistics, schema inference, and AI‑powered entity recognition for PII/NPI. A pre‑trained deep‑learning model handles sensitive data detection out of the box, and developers can extend it with custom entities or replace the pipeline entirely. Installation options range from a full package with all ML dependencies to slimmer builds that omit heavy libraries like TensorFlow, allowing flexible deployment in notebooks, CI pipelines, or production ETL jobs.
When teams consider DataProfiler, these hosted platforms usually appear on the same shortlist.

Managed sensitive data discovery and protection for Amazon S3.

Data intelligence platform focused on data privacy, security, and governance through sensitive data discovery and classification

Unified trust platform for privacy, consent, data governance, and compliance automation.
Looking for a hosted option? These are the services engineering teams benchmark against before choosing open source.
Rapid data audit of a CSV file
Produces a compact JSON report with schema, statistics, and identified PII entities.
Automated data quality monitoring in an ETL workflow
Generates daily profiles and flags anomalies or new sensitive data for downstream alerts.
Custom entity detection for proprietary identifiers
Extends the pre‑trained model with user‑defined entities, enabling tailored compliance checks.
Documentation generation for data assets
Creates reproducible data profiles that can be embedded in data catalogs or stakeholder reports.
CSV, AVRO, Parquet, JSON, plain text, and URLs are auto‑detected and loaded.
You can train or import a new model and register it with the profiler, or replace the labeling pipeline entirely.
`DataProfiler[full]` includes all ML dependencies and reporting; `DataProfiler[ml]` provides ML components without report generation; `DataProfiler[reports]` skips heavy ML libraries.
No. TensorFlow is only required for the default sensitive data detection. Use the `reports` extra to avoid it.
Yes. Set the `is_enable` option to true in the profiler configuration to compute it.
Project at a glance
ActiveLast synced 4 days ago