skyshield
v0.1.7
Published
Data Loss Prevention scanner for files, databases, and network traffic
Maintainers
Readme
██████╗ ██╗ ██████╗ ███████╗ ██████╗ █████╗ ███╗ ██╗
██╔══██╗██║ ██╔══██╗ ██╔════╝██╔════╝██╔══██╗████╗ ██║
██║ ██║██║ ██████╔╝█████╗███████╗██║ ███████║██╔██╗ ██║
██║ ██║██║ ██╔═══╝ ╚════╝╚════██║██║ ██╔══██║██║╚██╗██║
██████╔╝███████╗██║ ███████║╚██████╗██║ ██║██║ ╚████║
╚═════╝ ╚══════╝╚═╝ ╚══════╝ ╚═════╝╚═╝ ╚═╝╚═╝ ╚═══╝Data Loss Prevention scanner for files, databases, and network traffic.
This is a quick overview. Security theory, architecture, and full walkthroughs are in the learn modules.
What It Does
- Scans files (PDF, DOCX, XLSX, CSV, JSON, XML, YAML, Parquet, Avro, archives, emails) for PII, credentials, financial data, and PHI
- Scans databases (PostgreSQL, MySQL, MongoDB, SQLite) with schema introspection and sampling
- Scans network captures (PCAP/PCAPNG) with protocol parsing, TCP reassembly, and DNS exfiltration detection
- Confidence scoring pipeline: regex match, checksum validation (Luhn, Mod-97, Mod-11), context keyword proximity, entity co-occurrence
- Maps findings to compliance frameworks (HIPAA, PCI-DSS, GDPR, CCPA, SOX, GLBA, FERPA)
- Reports in console (Rich tables), JSON, SARIF 2.1.0, or CSV
Quick Start
bash install.sh
skyshield scan ./data[!TIP] This project uses
justas a command runner. Typejustto see all available commands.Install:
curl -sSf https://just.systems/install.sh | bash -s -- --to ~/.local/bin
Usage
skyshield scan ./data # scan files & directories
skyshield scan ./report.pdf -f json # scan with JSON output
skyshield scan postgres://user:pass@host/db --db # scan PostgreSQL
skyshield scan sqlite:///path/to/local.db --db # scan SQLite
skyshield scan capture.pcap --network # scan network traffic
skyshield scan ./data -f sarif -o results.sarif # SARIF output for CI/CD
skyshield report convert results.json -f csv # convert report format
skyshield report summary results.json # print summary stats
skyshield --web <fingerprint> # view scan in web UI
skyshield hooks # list available hooksGlobal Options
--config, -c Path to YAML config file
--verbose, -v Enable debug logging
--version Show versionOutput Formats
| Format | Flag | Use Case |
|--------|------|----------|
| Console | -f console | Interactive review with Rich tables |
| JSON | -f json | Structured analysis and archival |
| SARIF | -f sarif | GitHub code scanning, CI/CD integration |
| CSV | -f csv | Compliance team export, spreadsheet import |
Stack
Language: Python 3.12+
CLI: Typer 0.15+ with Rich integration
Detection: Regex + checksum validators + Shannon entropy + context keyword scoring
File Formats: PyMuPDF, python-docx, openpyxl, xlrd, defusedxml, lxml, pyarrow, fastavro, extract-msg
Databases: asyncpg (PostgreSQL), aiomysql (MySQL), pymongo async (MongoDB), aiosqlite (SQLite)
Network: dpkt (PCAP parsing), TCP reassembly, DPI protocol identification, DNS exfiltration heuristics
Config: Pydantic 2.10+ models with YAML config loading (ruamel.yaml)
Quality: ruff, mypy (strict), yapf, pytest + hypothesis, structlog
Configuration
Copy .dlp-scanner.yml to your project root and customize. Key settings:
detection:
min_confidence: 0.20 # minimum score to report
enable_rules: ["*"] # glob patterns for rule IDs
allowlists:
values: ["123-45-6789"] # suppress known test values
output:
format: "console" # console, json, sarif, csv
redaction_style: "partial" # partial, full, noneLearn
This project includes step-by-step learning materials covering security theory, architecture, and implementation.
| Module | Topic | |--------|-------| | 00 - Overview | Prerequisites and quick start | | 01 - Concepts | DLP theory and real-world breaches | | 02 - Architecture | System design and data flow | | 03 - Implementation | Code walkthrough | | 04 - Challenges | Extension ideas and exercises |
