
Amazon Macie
Managed sensitive data discovery and protection for Amazon S3.
Discover top open-source software, updated regularly with real-world adoption signals.

Detect leaked PII in images, PDFs, and web directories
Octopii scans files, S3 buckets, and open web directories for government IDs, addresses, emails, and other PII using OCR, regex, and NLP, helping teams uncover hidden data leaks.

Octopii is a command‑line scanner that extracts personally identifiable information from images, PDFs, and other documents. By combining optical character recognition, regular‑expression patterns, and natural‑language processing, it can locate government IDs, email addresses, phone numbers, and physical addresses that may be unintentionally exposed.
Security engineers, compliance auditors, and privacy officers can run Octopii locally or on a server to audit file systems, Amazon S3 buckets, and Apache open directory listings. After installing Python dependencies, Tesseract OCR, and a spaCy language model, users invoke a single script with a path or URL, and the tool returns structured JSON describing any discovered PII.
Octopii requires only a Python runtime and the listed system packages. It is suited for ad‑hoc investigations, CI/CD integration, or scheduled scans in modest environments. Because it operates offline after installation, no external services are needed beyond optional cloud storage access.
When teams consider Octopii, these hosted platforms usually appear on the same shortlist.

Managed sensitive data discovery and protection for Amazon S3.

Data intelligence platform focused on data privacy, security, and governance through sensitive data discovery and classification

Unified trust platform for privacy, consent, data governance, and compliance automation.
Looking for a hosted option? These are the services engineering teams benchmark against before choosing open source.
Audit internal file shares for exposed driver’s licenses
Identifies hidden ID images and generates a report for remediation
Check public S3 buckets for leaked customer emails
Detects email addresses in stored PDFs and alerts the security team
Validate open web directories for accidental PII exposure
Scans directory listings and flags any documents containing personal data
Integrate PII scans into CI/CD pipelines
Automates detection of sensitive information before code deployment
Images (e.g., JPEG, PNG) and PDF documents are supported; other formats are ignored.
Run `pip install -r requirements.txt`, install Tesseract OCR via your package manager, and download the spaCy English model with `python -m spacy download en_core_web_sm`.
Yes, provide an S3 URL as the scan location; the tool will download and analyze the objects.
Only for initial dependency installation; scanning itself runs locally without external calls.
Accuracy depends on image quality and language support; OCR may miss poorly scanned text or produce false positives.
Project at a glance
StableLast synced 4 days ago