Octopii logo

Octopii

Detect leaked PII in images, PDFs, and web directories

Octopii scans files, S3 buckets, and open web directories for government IDs, addresses, emails, and other PII using OCR, regex, and NLP, helping teams uncover hidden data leaks.

Octopii banner

Overview

Overview

Octopii is a command‑line scanner that extracts personally identifiable information from images, PDFs, and other documents. By combining optical character recognition, regular‑expression patterns, and natural‑language processing, it can locate government IDs, email addresses, phone numbers, and physical addresses that may be unintentionally exposed.

Who It Serves & How It Works

Security engineers, compliance auditors, and privacy officers can run Octopii locally or on a server to audit file systems, Amazon S3 buckets, and Apache open directory listings. After installing Python dependencies, Tesseract OCR, and a spaCy language model, users invoke a single script with a path or URL, and the tool returns structured JSON describing any discovered PII.

Deployment

Octopii requires only a Python runtime and the listed system packages. It is suited for ad‑hoc investigations, CI/CD integration, or scheduled scans in modest environments. Because it operates offline after installation, no external services are needed beyond optional cloud storage access.

Highlights

Scans local files, S3 URLs, and Apache directory listings
Combines OCR, regex lists, and NLP for comprehensive detection
Command‑line interface with JSON output for easy automation
Supports image and PDF document formats

Pros

  • Detects PII hidden in images and PDFs
  • Works with multiple source types (filesystem, S3, web)
  • Open‑source Python project, easy to extend
  • Simple command‑line usage

Considerations

  • Requires Tesseract OCR and spaCy language model setup
  • Limited to English language models out of the box
  • No graphical user interface
  • Potential false positives/negatives inherent to OCR/NLP

Managed products teams compare with

When teams consider Octopii, these hosted platforms usually appear on the same shortlist.

Amazon Macie logo

Amazon Macie

Managed sensitive data discovery and protection for Amazon S3.

BigID logo

BigID

Data intelligence platform focused on data privacy, security, and governance through sensitive data discovery and classification

OneTrust logo

OneTrust

Unified trust platform for privacy, consent, data governance, and compliance automation.

Looking for a hosted option? These are the services engineering teams benchmark against before choosing open source.

Fit guide

Great for

  • Security teams automating PII discovery in code repositories
  • Compliance auditors reviewing data exposure risk
  • DevOps pipelines that need periodic document scans
  • Privacy officers assessing leak vectors in public assets

Not ideal when

  • Real‑time monitoring of streaming data
  • Large‑scale enterprise deployments without custom scaling
  • Scanning documents in languages other than English
  • Users requiring a built‑in GUI

How teams use it

Audit internal file shares for exposed driver’s licenses

Identifies hidden ID images and generates a report for remediation

Check public S3 buckets for leaked customer emails

Detects email addresses in stored PDFs and alerts the security team

Validate open web directories for accidental PII exposure

Scans directory listings and flags any documents containing personal data

Integrate PII scans into CI/CD pipelines

Automates detection of sensitive information before code deployment

Tech snapshot

Python100%

Tags

cybersecurityocroptical-character-recognitionmachine-learningpii-detectionnlppythonpiiblackhatcloudimage-processing

Frequently asked questions

What file types does Octopii support?

Images (e.g., JPEG, PNG) and PDF documents are supported; other formats are ignored.

How do I install the required dependencies?

Run `pip install -r requirements.txt`, install Tesseract OCR via your package manager, and download the spaCy English model with `python -m spacy download en_core_web_sm`.

Can Octopii scan remote S3 buckets?

Yes, provide an S3 URL as the scan location; the tool will download and analyze the objects.

Does Octopii need an internet connection during scanning?

Only for initial dependency installation; scanning itself runs locally without external calls.

How accurate is the OCR‑based detection?

Accuracy depends on image quality and language support; OCR may miss poorly scanned text or produce false positives.

Project at a glance

Stable
Stars
717
Watchers
717
Forks
61
Repo age3 years old
Last commit12 months ago
Primary languagePython

Last synced 4 hours ago