Examples
Real-world usage examples for security research and code analysis.
Command Usage
Run from source using python scanipy.py.
Security Research
Command Injection
Find potential command injection vulnerabilities:
python scanipy.py --query "os.system" --language python \
--keywords "user,input,request" --run-semgrep \
--api-url http://localhost:8000 --s3-bucket scanipy-results
SQL Injection
Find potential SQL injection:
python scanipy.py --query "execute(" --language python \
--keywords "format,user,%s" --run-semgrep \
--api-url http://localhost:8000 --s3-bucket scanipy-results
Unsafe Deserialization
Find unsafe pickle usage:
python scanipy.py --query "pickle.loads" --language python --run-semgrep \
--api-url http://localhost:8000 --s3-bucket scanipy-results
Path Traversal (Tarslip)
Find path traversal vulnerabilities in archive extraction:
python scanipy.py --query "extractall" --language python \
--run-semgrep --rules ./tools/semgrep/rules/tarslip.yaml \
--api-url http://localhost:8000 --s3-bucket scanipy-results
With CodeQL for deeper analysis:
python scanipy.py --query "extractall" --language python --run-codeql \
--codeql-queries "codeql/python-queries:Security/CWE-022/TarSlip.ql"
Hardcoded Secrets
Find potential hardcoded credentials:
python scanipy.py --query "password =" --language python \
--keywords "secret,api_key,token" --run-semgrep \
--api-url http://localhost:8000 --s3-bucket scanipy-results
Code Pattern Analysis
Deprecated API Usage
Find deprecated urllib2 usage:
python scanipy.py --query "urllib2" --language python
Library Usage
Find specific library usage in popular repos:
python scanipy.py --query "import tensorflow" --language python \
--search-strategy tiered
Recently Updated
Find recently updated repos using a pattern:
python scanipy.py --query "FastAPI" --language python --sort-by updated
Advanced Filtering
Exclude Organizations
Search but exclude specific organizations:
python scanipy.py --query "eval(" \
--additional-params "stars:>1000 -org:microsoft -org:google"
High-Star Repos Only
Focus on very popular repositories:
python scanipy.py --query "subprocess.Popen" --language python \
--additional-params "stars:>10000"
Combined Filters
python scanipy.py \
--query "subprocess" \
--language python \
--keywords "shell=True,user" \
--pages 10 \
--search-strategy tiered \
--run-semgrep \
--api-url http://localhost:8000 --s3-bucket scanipy-results
Workflow Examples
Research Workflow
-
Search and save results:
python scanipy.py --query "extractall" --language python \ --output tarslip_repos.json -
Review results, then run analysis:
python scanipy.py --query "extractall" \ --input-file tarslip_repos.json \ --run-semgrep --rules ./tools/semgrep/rules/tarslip.yaml \ --api-url http://localhost:8000 --s3-bucket scanipy-results -
Run CodeQL for deeper analysis:
python scanipy.py --query "extractall" --language python \ --input-file tarslip_repos.json \ --run-codeql --codeql-output-dir ./tarslip_sarif
Long-Running Analysis
For large-scale analysis with resume capability:
# Start analysis (can be interrupted)
python scanipy.py --query "eval(" --language python \
--pages 10 \
--run-semgrep \
--api-url http://localhost:8000 \
--s3-bucket scanipy-results
Multi-Tool Analysis
Run both Semgrep and CodeQL on the same repositories:
# First, search and save results
python scanipy.py --query "extractall" --language python \
--output repos.json
# Run Semgrep on those repos (via API)
python scanipy.py --query "extractall" --language python \
--input-file repos.json \
--run-semgrep \
--api-url http://localhost:8000 \
--s3-bucket scanipy-results
# Run CodeQL on the same repos
python scanipy.py --query "extractall" --language python \
--input-file repos.json \
--run-codeql \
--codeql-output-dir ./sarif_results
Language-Specific Examples
JavaScript/TypeScript
python scanipy.py --query "eval(" --language javascript --run-codeql
Java
python scanipy.py --query "Runtime.exec" --language java --run-codeql
Go
python scanipy.py --query "os/exec" --language go --run-codeql
C/C++
python scanipy.py --query "strcpy" --language c --run-codeql \
--codeql-queries "cpp-security-extended"
Resuming Interrupted Analysis
Both Semgrep and CodeQL support resuming interrupted analysis. This is useful for large scans that may be interrupted.
Resume Semgrep Analysis
# Run Semgrep analysis (results stored in API service database)
python scanipy.py --query "SQL injection" --language python \
--run-semgrep \
--api-url http://localhost:8000 \
--s3-bucket scanipy-results
Resume CodeQL Analysis
# Start CodeQL analysis with database tracking
python scanipy.py --query "path traversal" --language python \
--run-codeql --codeql-results-db path_analysis.db
# Resume interrupted analysis
python scanipy.py --query "path traversal" --language python \
--run-codeql --codeql-results-db path_analysis.db --codeql-resume
Large-Scale Analysis Workflow
For analyzing hundreds of repositories:
# Day 1: Start large scan (100+ repos)
python scanipy.py --query "unsafe deserialization" --language java \
--pages 10 --run-codeql --codeql-results-db deserialization_scan.db
# Analysis interrupted after 40 repositories...
# Day 2: Resume (skips first 40, continues with remaining)
python scanipy.py --query "unsafe deserialization" --language java \
--pages 10 --run-codeql --codeql-results-db deserialization_scan.db \
--codeql-resume
# Completed: All 100 repositories analyzed
Key Points
- Session Matching: Resume works by matching query, language, and analysis parameters
- Automatic Skipping: Already-analyzed repositories are automatically skipped
- Incremental Saves: Results are saved after each repository
- Crash Recovery: Analysis survives Ctrl+C, network errors, or system crashes
Containerized Execution (Semgrep via API)
For large-scale parallel analysis, Semgrep runs via the API service, which orchestrates Kubernetes Jobs:
Basic Semgrep Analysis
# Run Semgrep analysis (API orchestrates Kubernetes jobs)
python scanipy.py --query "extractall" --language python --run-semgrep \
--api-url http://scanipy-api:8000 \
--s3-bucket scanipy-results
High-Throughput Analysis
Run analysis on many repositories in parallel:
# Analyze 50+ repositories with 20 parallel jobs
python scanipy.py --query "pickle.loads" --language python \
--pages 10 \
--run-semgrep \
--api-url http://scanipy-api:8000 \
--s3-bucket scanipy-results \
--k8s-namespace scanipy \
--max-parallel-jobs 20 \
--rules ./tools/semgrep/rules/tarslip.yaml
Production Deployment Workflow
-
Deploy API service:
kubectl apply -f k8s/api-service.yaml kubectl apply -f k8s/rbac.yaml kubectl apply -f k8s/configmap.yaml -
Run analysis from CLI:
python scanipy.py --query "SQL injection" --language python \ --run-semgrep \ --api-url http://scanipy-api.scanipy.svc.cluster.local:8000 \ --s3-bucket scanipy-results \ --max-parallel-jobs 30 -
Monitor progress:
# Check Kubernetes jobs kubectl get jobs -n scanipy # View API logs kubectl logs -n scanipy deployment/scanipy-api # Check worker logs kubectl logs -n scanipy job/semgrep-1-repo-name-abc123
Benefits
- Parallel Execution: Multiple repositories analyzed simultaneously
- Scalability: Leverage entire Kubernetes cluster
- Isolation: Each scan runs in isolated container
- Resilience: Failed jobs don't affect others
- Production-Ready: Designed for EKS/GKE/AKS deployment