
Amazon Macie
Managed sensitive data discovery and protection for Amazon S3.
Discover top open-source software, updated regularly with real-world adoption signals.

Detect and tag PII across databases and data warehouses
PIICatcher scans databases and data warehouses for PII/PHI using regex and NLP, supports incremental scans, plugins, and integrates with Datahub and Amundsen for automated tagging.

PIICatcher is a command‑line and Docker‑based scanner that discovers personally identifiable information (PII) and protected health information (PHI) in relational databases and modern data warehouses. It combines regular‑expression matching on column names with natural‑language processing (via a spaCy plugin) to examine sample data, delivering high‑confidence classification of data types such as email, address, gender, and more.
The tool supports incremental scans, allowing you to schedule recurring jobs that only evaluate new or unscanned columns, and offers flexible include/exclude filters for schemas and tables to control compute usage. Results can be exported directly to data catalogs like Datahub and Amundsen, where columns and tables are automatically tagged with the detected PII types. Extensibility is built‑in: developers can create custom metadata or datum detectors by subclassing provided base classes and registering them via Python entry points.
Deploy PIICatcher with a single Docker alias or install it from PyPI. A lightweight SQLite catalog stores scan state by default, while production environments can point to a persistent catalog backend. The project ships with plugins for major databases—including SQLite, MySQL, PostgreSQL, Redshift, Athena, Snowflake, and BigQuery—making it suitable for heterogeneous data landscapes.
When teams consider PIICatcher, these hosted platforms usually appear on the same shortlist.

Managed sensitive data discovery and protection for Amazon S3.

Data intelligence platform focused on data privacy, security, and governance through sensitive data discovery and classification

Unified trust platform for privacy, consent, data governance, and compliance automation.
Looking for a hosted option? These are the services engineering teams benchmark against before choosing open source.
Compliance audit of a data warehouse
Identify and catalog all PII locations to satisfy GDPR/CCPA requirements
Enriching a data catalog with privacy metadata
Automatically tag columns in Datahub or Amundsen, making PII searchable for downstream users
Scheduled incremental monitoring
Run nightly scans on newly added tables, maintaining continuous compliance with minimal compute
Custom domain‑specific PII detection
Develop and register a spaCy‑based plugin to detect industry‑specific identifiers
SQLite, MySQL, PostgreSQL, AWS Redshift, AWS Athena, Snowflake, and Google BigQuery.
PIICatcher records scanned columns in a catalog and only processes columns that are new or have not been scanned before, respecting include/exclude filters.
Yes, ingestion functions are provided for Datahub and Amundsen to automatically apply PII tags to tables and columns.
No, it runs as a CLI tool or Docker container. State is stored in a lightweight SQLite catalog by default.
Create a class inheriting from MetadataDetector or DatumDetector, implement a detect method returning a PIIType, and register it via Python entry points or the API.
Project at a glance
DormantLast synced 4 days ago