Semgrep Integration
Scanipy can automatically clone and scan the top 10 repositories with Semgrep for security vulnerabilities.
Command Usage
Run from source using python scanipy.py.
Prerequisites
- Semgrep API service running
- S3 bucket for storing analysis results
- Kubernetes cluster (for running analysis jobs)
Basic Usage
# Run Semgrep analysis (requires --api-url and --s3-bucket)
python scanipy.py --query "extractall" --run-semgrep \
--api-url http://localhost:8000 \
--s3-bucket scanipy-results
Custom Rules
# Use custom Semgrep rules
python scanipy.py --query "extractall" --run-semgrep \
--rules ./my_rules.yaml \
--api-url http://localhost:8000 \
--s3-bucket scanipy-results
# Use built-in tarslip rules
python scanipy.py --query "extractall" --run-semgrep \
--rules ./tools/semgrep/rules/tarslip.yaml \
--api-url http://localhost:8000 \
--s3-bucket scanipy-results
Semgrep Pro
If you have a Semgrep Pro license:
python scanipy.py --query "extractall" --run-semgrep --pro \
--api-url http://localhost:8000 \
--s3-bucket scanipy-results
Additional Semgrep Arguments
Pass additional arguments directly to Semgrep:
python scanipy.py --query "extractall" --run-semgrep \
--semgrep-args "--severity ERROR --json" \
--api-url http://localhost:8000 \
--s3-bucket scanipy-results
Containerized Execution (Kubernetes)
Scanipy uses containerized execution via Kubernetes Jobs for parallel analysis across multiple repositories. All Semgrep analysis runs through the API service.
Prerequisites
- Kubernetes cluster (EKS, GKE, AKS, or local with k3d/kind)
- Semgrep API service running in the cluster
- Worker container image built and available
- S3 bucket for storing results (or S3-compatible storage)
Basic Usage
# Run Semgrep analysis (always uses containerized execution)
python scanipy.py --query "extractall" --run-semgrep \
--api-url http://scanipy-api:8000 \
--s3-bucket scanipy-results
Configuration Options
# Full example with all options
python scanipy.py --query "extractall" --run-semgrep \
--api-url http://scanipy-api:8000 \
--s3-bucket scanipy-results \
--k8s-namespace scanipy \
--max-parallel-jobs 20 \
--rules ./custom-rules.yaml \
--pro
Architecture
When using --run-semgrep:
- CLI creates a scan session via API service
- API Service creates Kubernetes Jobs (one per repository) up to
--max-parallel-jobslimit - Worker Containers (one per Job) execute in parallel:
- Clone the repository
- Run Semgrep analysis
- Upload results to S3
- Report status back to API service
- CLI polls API service for completion and fetches final results
Benefits
- Parallel Execution: Multiple repositories scanned simultaneously
- Scalability: Leverage Kubernetes cluster resources
- Isolation: Each scan runs in its own container
- Resilience: Failed jobs don't affect others
- Production-Ready: Designed for EKS deployment
- Monitoring and troubleshooting
Built-in Rules
Scanipy includes built-in security rules:
Tarslip Rules
Detect path traversal vulnerabilities in archive extraction:
python scanipy.py --query "extractall" --run-semgrep \
--rules ./tools/semgrep/rules/tarslip.yaml \
--api-url http://localhost:8000 \
--s3-bucket scanipy-results
Semgrep Options Reference
| Option | Description |
|---|---|
--run-semgrep |
Enable Semgrep analysis (requires --api-url and --s3-bucket) |
--rules |
Path to custom Semgrep rules |
--pro |
Use Semgrep Pro |
--semgrep-args |
Additional Semgrep CLI arguments |
--api-url |
Semgrep API service URL (required for --run-semgrep) |
--s3-bucket |
S3 bucket for storing results (required for --run-semgrep) |
--k8s-namespace |
Kubernetes namespace for jobs (default: default) |
--max-parallel-jobs |
Maximum parallel Kubernetes jobs (default: 10) |
--max-wait-time |
Maximum time to wait for analysis completion (seconds, default: 3600) |
--poll-interval |
Interval between status polls (seconds, default: 10) |