PIICatcher logo

PIICatcher

Detect and tag PII across databases and data warehouses

PIICatcher scans databases and data warehouses for PII/PHI using regex and NLP, supports incremental scans, plugins, and integrates with Datahub and Amundsen for automated tagging.

PIICatcher banner

Overview

Overview

PIICatcher is a command‑line and Docker‑based scanner that discovers personally identifiable information (PII) and protected health information (PHI) in relational databases and modern data warehouses. It combines regular‑expression matching on column names with natural‑language processing (via a spaCy plugin) to examine sample data, delivering high‑confidence classification of data types such as email, address, gender, and more.

Capabilities & Deployment

The tool supports incremental scans, allowing you to schedule recurring jobs that only evaluate new or unscanned columns, and offers flexible include/exclude filters for schemas and tables to control compute usage. Results can be exported directly to data catalogs like Datahub and Amundsen, where columns and tables are automatically tagged with the detected PII types. Extensibility is built‑in: developers can create custom metadata or datum detectors by subclassing provided base classes and registering them via Python entry points.

Getting Started

Deploy PIICatcher with a single Docker alias or install it from PyPI. A lightweight SQLite catalog stores scan state by default, while production environments can point to a persistent catalog backend. The project ships with plugins for major databases—including SQLite, MySQL, PostgreSQL, Redshift, Athena, Snowflake, and BigQuery—making it suitable for heterogeneous data landscapes.

Highlights

Regex‑based column name detection and NLP data sample analysis
Incremental scanning with schema/table include‑exclude filters
Plugin architecture for custom detectors (e.g., spaCy integration)
Direct ingestion into Datahub and Amundsen for automated PII tagging

Pros

  • Supports a wide range of databases and data warehouses
  • Extensible via plugins and custom detector classes
  • Incremental scans reduce compute cost for recurring jobs
  • Seamless integration with popular data catalog platforms

Considerations

  • Requires Python environment and familiarity with CLI/Docker
  • Plugin ecosystem is still growing; some detectors may be missing
  • No graphical user interface; operates via command line
  • Performance depends on data size and complexity of NLP models

Managed products teams compare with

When teams consider PIICatcher, these hosted platforms usually appear on the same shortlist.

Amazon Macie logo

Amazon Macie

Managed sensitive data discovery and protection for Amazon S3.

BigID logo

BigID

Data intelligence platform focused on data privacy, security, and governance through sensitive data discovery and classification

OneTrust logo

OneTrust

Unified trust platform for privacy, consent, data governance, and compliance automation.

Looking for a hosted option? These are the services engineering teams benchmark against before choosing open source.

Fit guide

Great for

  • Data engineering teams needing automated PII discovery across multiple sources
  • Organizations that already use Datahub or Amundsen for metadata management
  • Environments where Docker or Python deployment is standard
  • Teams comfortable extending functionality with custom Python plugins

Not ideal when

  • Real‑time streaming pipelines that require instant PII detection
  • Users seeking a full‑featured GUI for data classification
  • Small projects with only a few tables where the tool adds overhead
  • Organizations that need built‑in data masking or anonymization

How teams use it

Compliance audit of a data warehouse

Identify and catalog all PII locations to satisfy GDPR/CCPA requirements

Enriching a data catalog with privacy metadata

Automatically tag columns in Datahub or Amundsen, making PII searchable for downstream users

Scheduled incremental monitoring

Run nightly scans on newly added tables, maintaining continuous compliance with minimal compute

Custom domain‑specific PII detection

Develop and register a spaCy‑based plugin to detect industry‑specific identifiers

Tech snapshot

Python97%
Dockerfile1%
Shell1%

Tags

aws-gluecatalogsnowflakepythonpiiaws-redshiftaws-athenadata-catalogdatabasephidata

Frequently asked questions

Which databases does PIICatcher support?

SQLite, MySQL, PostgreSQL, AWS Redshift, AWS Athena, Snowflake, and Google BigQuery.

How does incremental scanning work?

PIICatcher records scanned columns in a catalog and only processes columns that are new or have not been scanned before, respecting include/exclude filters.

Can PIICatcher integrate with my existing data catalog?

Yes, ingestion functions are provided for Datahub and Amundsen to automatically apply PII tags to tables and columns.

Do I need to run a separate server for PIICatcher?

No, it runs as a CLI tool or Docker container. State is stored in a lightweight SQLite catalog by default.

How can I add custom PII detectors?

Create a class inheriting from MetadataDetector or DatumDetector, implement a detect method returning a PIIType, and register it via Python entry points or the API.

Project at a glance

Dormant
Stars
336
Watchers
336
Forks
99
LicenseApache-2.0
Repo age6 years old
Last commit2 years ago
Primary languagePython

Last synced yesterday