Semgrep Integration

Scanipy can automatically clone and scan the top 10 repositories with Semgrep for security vulnerabilities.

Command Usage

Run from source using python scanipy.py.

Prerequisites

Semgrep API service running
S3 bucket for storing analysis results
Kubernetes cluster (for running analysis jobs)

Basic Usage

# Run Semgrep analysis (requires --api-url and --s3-bucket)
python scanipy.py --query "extractall" --run-semgrep \
  --api-url http://localhost:8000 \
  --s3-bucket scanipy-results

Custom Rules

# Use custom Semgrep rules
python scanipy.py --query "extractall" --run-semgrep \
  --rules ./my_rules.yaml \
  --api-url http://localhost:8000 \
  --s3-bucket scanipy-results

# Use built-in tarslip rules
python scanipy.py --query "extractall" --run-semgrep \
  --rules ./tools/semgrep/rules/tarslip.yaml \
  --api-url http://localhost:8000 \
  --s3-bucket scanipy-results

Semgrep Pro

If you have a Semgrep Pro license:

python scanipy.py --query "extractall" --run-semgrep --pro \
  --api-url http://localhost:8000 \
  --s3-bucket scanipy-results

Additional Semgrep Arguments

Pass additional arguments directly to Semgrep:

python scanipy.py --query "extractall" --run-semgrep \
  --semgrep-args "--severity ERROR --json" \
  --api-url http://localhost:8000 \
  --s3-bucket scanipy-results

Containerized Execution (Kubernetes)

Scanipy uses containerized execution via Kubernetes Jobs for parallel analysis across multiple repositories. All Semgrep analysis runs through the API service.

Prerequisites

Kubernetes cluster (EKS, GKE, AKS, or local with k3d/kind)
Semgrep API service running in the cluster
Worker container image built and available
S3 bucket for storing results (or S3-compatible storage)

Basic Usage

# Run Semgrep analysis (always uses containerized execution)
python scanipy.py --query "extractall" --run-semgrep \
  --api-url http://scanipy-api:8000 \
  --s3-bucket scanipy-results

Configuration Options

# Full example with all options
python scanipy.py --query "extractall" --run-semgrep \
  --api-url http://scanipy-api:8000 \
  --s3-bucket scanipy-results \
  --k8s-namespace scanipy \
  --max-parallel-jobs 20 \
  --rules ./custom-rules.yaml \
  --pro

Architecture

When using --run-semgrep:

CLI creates a scan session via API service
API Service creates Kubernetes Jobs (one per repository) up to --max-parallel-jobs limit
Worker Containers (one per Job) execute in parallel:
- Clone the repository
- Run Semgrep analysis
- Upload results to S3
- Report status back to API service
CLI polls API service for completion and fetches final results

Benefits

Parallel Execution: Multiple repositories scanned simultaneously
Scalability: Leverage Kubernetes cluster resources
Isolation: Each scan runs in its own container
Resilience: Failed jobs don't affect others
Production-Ready: Designed for EKS deployment
Monitoring and troubleshooting

Built-in Rules

Scanipy includes built-in security rules:

Tarslip Rules

Detect path traversal vulnerabilities in archive extraction:

python scanipy.py --query "extractall" --run-semgrep \
  --rules ./tools/semgrep/rules/tarslip.yaml \
  --api-url http://localhost:8000 \
  --s3-bucket scanipy-results

Semgrep Options Reference

Option	Description
`--run-semgrep`	Enable Semgrep analysis (requires `--api-url` and `--s3-bucket`)
`--rules`	Path to custom Semgrep rules
`--pro`	Use Semgrep Pro
`--semgrep-args`	Additional Semgrep CLI arguments
`--api-url`	Semgrep API service URL (required for `--run-semgrep`)
`--s3-bucket`	S3 bucket for storing results (required for `--run-semgrep`)
`--k8s-namespace`	Kubernetes namespace for jobs (default: `default`)
`--max-parallel-jobs`	Maximum parallel Kubernetes jobs (default: 10)
`--max-wait-time`	Maximum time to wait for analysis completion (seconds, default: 3600)
`--poll-interval`	Interval between status polls (seconds, default: 10)