Examples

Real-world usage examples for security research and code analysis.

Command Usage

Run from source using python scanipy.py.

Security Research

Command Injection

Find potential command injection vulnerabilities:

python scanipy.py --query "os.system" --language python \
  --keywords "user,input,request" --run-semgrep \
  --api-url http://localhost:8000 --s3-bucket scanipy-results

SQL Injection

Find potential SQL injection:

python scanipy.py --query "execute(" --language python \
  --keywords "format,user,%s" --run-semgrep \
  --api-url http://localhost:8000 --s3-bucket scanipy-results

Unsafe Deserialization

Find unsafe pickle usage:

python scanipy.py --query "pickle.loads" --language python --run-semgrep \
  --api-url http://localhost:8000 --s3-bucket scanipy-results

Path Traversal (Tarslip)

Find path traversal vulnerabilities in archive extraction:

python scanipy.py --query "extractall" --language python \
  --run-semgrep --rules ./tools/semgrep/rules/tarslip.yaml \
  --api-url http://localhost:8000 --s3-bucket scanipy-results

With CodeQL for deeper analysis:

python scanipy.py --query "extractall" --language python --run-codeql \
  --codeql-queries "codeql/python-queries:Security/CWE-022/TarSlip.ql"

Hardcoded Secrets

Find potential hardcoded credentials:

python scanipy.py --query "password =" --language python \
  --keywords "secret,api_key,token" --run-semgrep \
  --api-url http://localhost:8000 --s3-bucket scanipy-results

Code Pattern Analysis

Deprecated API Usage

Find deprecated urllib2 usage:

python scanipy.py --query "urllib2" --language python

Library Usage

Find specific library usage in popular repos:

python scanipy.py --query "import tensorflow" --language python \
  --search-strategy tiered

Advanced Filtering

Exclude Organizations

Search but exclude specific organizations:

python scanipy.py --query "eval(" \
  --additional-params "stars:>1000 -org:microsoft -org:google"

High-Star Repos Only

Focus on very popular repositories:

python scanipy.py --query "subprocess.Popen" --language python \
  --additional-params "stars:>10000"

Combined Filters

python scanipy.py \
  --query "subprocess" \
  --language python \
  --keywords "shell=True,user" \
  --pages 10 \
  --search-strategy tiered \
  --run-semgrep \
  --api-url http://localhost:8000 --s3-bucket scanipy-results

Workflow Examples

Research Workflow

Search and save results:

python scanipy.py --query "extractall" --language python \
  --output tarslip_repos.json

Review results, then run analysis:

python scanipy.py --query "extractall" \
  --input-file tarslip_repos.json \
  --run-semgrep --rules ./tools/semgrep/rules/tarslip.yaml \
  --api-url http://localhost:8000 --s3-bucket scanipy-results

Run CodeQL for deeper analysis:

python scanipy.py --query "extractall" --language python \
  --input-file tarslip_repos.json \
  --run-codeql --codeql-output-dir ./tarslip_sarif

Long-Running Analysis

For large-scale analysis with resume capability:

# Start analysis (can be interrupted)
python scanipy.py --query "eval(" --language python \
  --pages 10 \
  --run-semgrep \
  --api-url http://localhost:8000 \
  --s3-bucket scanipy-results

Multi-Tool Analysis

Run both Semgrep and CodeQL on the same repositories:

# First, search and save results
python scanipy.py --query "extractall" --language python \
  --output repos.json

# Run Semgrep on those repos (via API)
python scanipy.py --query "extractall" --language python \
  --input-file repos.json \
  --run-semgrep \
  --api-url http://localhost:8000 \
  --s3-bucket scanipy-results

# Run CodeQL on the same repos
python scanipy.py --query "extractall" --language python \
  --input-file repos.json \
  --run-codeql \
  --codeql-output-dir ./sarif_results

Language-Specific Examples

JavaScript/TypeScript

python scanipy.py --query "eval(" --language javascript --run-codeql

Java

python scanipy.py --query "Runtime.exec" --language java --run-codeql

Go

python scanipy.py --query "os/exec" --language go --run-codeql

C/C++

python scanipy.py --query "strcpy" --language c --run-codeql \
  --codeql-queries "cpp-security-extended"

Resuming Interrupted Analysis

Both Semgrep and CodeQL support resuming interrupted analysis. This is useful for large scans that may be interrupted.

Resume Semgrep Analysis

# Run Semgrep analysis (results stored in API service database)
python scanipy.py --query "SQL injection" --language python \
  --run-semgrep \
  --api-url http://localhost:8000 \
  --s3-bucket scanipy-results

Resume CodeQL Analysis

# Start CodeQL analysis with database tracking
python scanipy.py --query "path traversal" --language python \
  --run-codeql --codeql-results-db path_analysis.db

# Resume interrupted analysis
python scanipy.py --query "path traversal" --language python \
  --run-codeql --codeql-results-db path_analysis.db --codeql-resume

Large-Scale Analysis Workflow

For analyzing hundreds of repositories:

# Day 1: Start large scan (100+ repos)
python scanipy.py --query "unsafe deserialization" --language java \
  --pages 10 --run-codeql --codeql-results-db deserialization_scan.db

# Analysis interrupted after 40 repositories...

# Day 2: Resume (skips first 40, continues with remaining)
python scanipy.py --query "unsafe deserialization" --language java \
  --pages 10 --run-codeql --codeql-results-db deserialization_scan.db \
  --codeql-resume

# Completed: All 100 repositories analyzed

Key Points

Session Matching: Resume works by matching query, language, and analysis parameters
Automatic Skipping: Already-analyzed repositories are automatically skipped
Incremental Saves: Results are saved after each repository
Crash Recovery: Analysis survives Ctrl+C, network errors, or system crashes

Containerized Execution (Semgrep via API)

For large-scale parallel analysis, Semgrep runs via the API service, which orchestrates Kubernetes Jobs:

Basic Semgrep Analysis

# Run Semgrep analysis (API orchestrates Kubernetes jobs)
python scanipy.py --query "extractall" --language python --run-semgrep \
  --api-url http://scanipy-api:8000 \
  --s3-bucket scanipy-results

High-Throughput Analysis

Run analysis on many repositories in parallel:

# Analyze 50+ repositories with 20 parallel jobs
python scanipy.py --query "pickle.loads" --language python \
  --pages 10 \
  --run-semgrep \
  --api-url http://scanipy-api:8000 \
  --s3-bucket scanipy-results \
  --k8s-namespace scanipy \
  --max-parallel-jobs 20 \
  --rules ./tools/semgrep/rules/tarslip.yaml

Production Deployment Workflow

Deploy API service:

kubectl apply -f k8s/api-service.yaml
kubectl apply -f k8s/rbac.yaml
kubectl apply -f k8s/configmap.yaml

Run analysis from CLI:

python scanipy.py --query "SQL injection" --language python \
  --run-semgrep \
  --api-url http://scanipy-api.scanipy.svc.cluster.local:8000 \
  --s3-bucket scanipy-results \
  --max-parallel-jobs 30

Monitor progress:

# Check Kubernetes jobs
kubectl get jobs -n scanipy

# View API logs
kubectl logs -n scanipy deployment/scanipy-api

# Check worker logs
kubectl logs -n scanipy job/semgrep-1-repo-name-abc123

Benefits

Parallel Execution: Multiple repositories analyzed simultaneously
Scalability: Leverage entire Kubernetes cluster
Isolation: Each scan runs in isolated container
Resilience: Failed jobs don't affect others
Production-Ready: Designed for EKS/GKE/AKS deployment

Examples

Security Research

Command Injection

SQL Injection

Unsafe Deserialization

Path Traversal (Tarslip)

Hardcoded Secrets

Code Pattern Analysis

Deprecated API Usage

Library Usage

Recently Updated

Advanced Filtering

Exclude Organizations

High-Star Repos Only

Combined Filters

Workflow Examples

Research Workflow

Long-Running Analysis

Multi-Tool Analysis

Language-Specific Examples

JavaScript/TypeScript

Java

Go

C/C++

Resuming Interrupted Analysis

Resume Semgrep Analysis

Resume CodeQL Analysis

Large-Scale Analysis Workflow

Key Points

Containerized Execution (Semgrep via API)

Basic Semgrep Analysis

High-Throughput Analysis

Production Deployment Workflow

Benefits