CodeQL Integration

Scanipy can run CodeQL semantic analysis on the top 10 repositories. CodeQL provides deep semantic security scanning using GitHub's code analysis engine.

Command Usage

Run from source using python scanipy.py.

Prerequisites

Install the CodeQL CLI before using this feature:

Download from GitHub Releases
Extract and add to your PATH

# Download and extract (Linux)
wget https://github.com/github/codeql-cli-binaries/releases/latest/download/codeql-linux64.zip
unzip codeql-linux64.zip

# Add to PATH
export PATH="$PWD/codeql:$PATH"

# Verify installation
codeql --version

For detailed instructions, see the CodeQL CLI documentation.

Basic Usage

CodeQL requires a language to be specified:

python scanipy.py --query "extractall" --language python --run-codeql

Supported Languages

Language	CodeQL Identifier
Python	`python`
JavaScript	`javascript`
TypeScript	`javascript` (uses JS extractor)
Java	`java`
Kotlin	`java` (uses Java extractor)
C	`cpp`
C++	`cpp`
C#	`csharp`
Go	`go`
Ruby	`ruby`
Swift	`swift`

Custom Query Suites

# Use a different query suite
python scanipy.py --query "extractall" --language python --run-codeql \
  --codeql-queries "python-security-extended"

# Run a specific query for faster analysis
python scanipy.py --query "extractall" --language python --run-codeql \
  --codeql-queries "codeql/python-queries:Security/CWE-022/TarSlip.ql"

Output Formats

# SARIF format (default)
python scanipy.py --query "extractall" --language python --run-codeql \
  --codeql-format sarif-latest

# CSV format
python scanipy.py --query "extractall" --language python --run-codeql \
  --codeql-format csv

# Text format
python scanipy.py --query "extractall" --language python --run-codeql \
  --codeql-format text

Saving SARIF Results

Save SARIF results to files for later analysis:

# Save to default directory (./codeql_results)
python scanipy.py --query "extractall" --language python --run-codeql

# Save to custom directory
python scanipy.py --query "extractall" --language python --run-codeql \
  --codeql-output-dir ./my_sarif_results

SARIF files are saved with timestamped filenames:

my_sarif_results/
├── owner_repo1_20251229_120000.sarif
├── owner_repo2_20251229_120100.sarif
└── ...

Resume Capability

CodeQL analysis can be interrupted and resumed from where it left off. This is useful for long-running analyses that may be interrupted by network issues, Ctrl+C, or system restarts.

Basic Resume

# Start analysis with a results database
python scanipy.py --query "extractall" --language python --run-codeql \
  --codeql-results-db codeql_analysis.db

# If interrupted, resume from the same session
python scanipy.py --query "extractall" --language python --run-codeql \
  --codeql-results-db codeql_analysis.db --codeql-resume

How It Works

Session Tracking: Each analysis run creates a session tracked by query, language, and query suite
Incremental Saves: Results are saved to SQLite after analyzing each repository
Smart Resume: Already-analyzed repositories are automatically skipped
Survives Interruptions: Analysis can survive Ctrl+C, network errors, or system crashes

Example Workflow

# Day 1: Start analyzing 100 repositories
python scanipy.py --query "path traversal" --language python --run-codeql \
  --codeql-results-db path_traversal.db --pages 10

# Analysis interrupted after 40 repos...
# Ctrl+C

# Day 2: Resume analysis (skips first 40 repos)
python scanipy.py --query "path traversal" --language python --run-codeql \
  --codeql-results-db path_traversal.db --codeql-resume

# Continue where you left off - remaining 60 repos analyzed

Session Matching

Resume works by matching:

Query: The search query used
Language: The programming language
Query Suite: The CodeQL query suite (if specified)

If any of these change, a new session is created:

# Creates session 1
python scanipy.py --query "pickle.loads" --language python --run-codeql \
  --codeql-results-db analysis.db

# Creates session 2 (different query)
python scanipy.py --query "eval" --language python --run-codeql \
  --codeql-results-db analysis.db

# Creates session 3 (different query suite)
python scanipy.py --query "pickle.loads" --language python --run-codeql \
  --codeql-results-db analysis.db --codeql-queries "security-extended"

Viewing Results

The database stores:

Repository names and URLs
Success/failure status
Error messages (for failures)
SARIF file paths (for successes)
Analysis timestamps

You can query the database directly using SQLite:

sqlite3 codeql_analysis.db "SELECT repo_name, success FROM codeql_results"

Best Practices

Use Descriptive Database Names: Name databases after the vulnerability or pattern you're searching for
```
--codeql-results-db sql_injection_scan.db
```
Always Use Resume Flag: When continuing analysis, always specify --codeql-resume
Match Parameters: Ensure query, language, and query suite match the original analysis
Check Session Info: The tool prints session information showing how many repos were already analyzed

Performance Tips

Use Specific Queries

Running the full security suite can take a long time. For faster analysis, use specific queries:

# Full suite (slow)
python scanipy.py --query "extractall" --language python --run-codeql

# Specific query (fast)
python scanipy.py --query "extractall" --language python --run-codeql \
  --codeql-queries "codeql/python-queries:Security/CWE-022/TarSlip.ql"

Limit Pages

Reduce the number of repositories to analyze:

python scanipy.py --query "extractall" --language python --run-codeql --pages 1

CodeQL Options Reference

Option	Description	Default
`--run-codeql`	Enable CodeQL analysis	False
`--codeql-queries`	Query suite or path	Default suite
`--codeql-format`	Output format (sarif-latest, csv, text)	`sarif-latest`
`--codeql-output-dir`	Directory to save SARIF results	`./codeql_results`
`--codeql-results-db`	Path to SQLite database for results	None
`--codeql-resume`	Resume from previous session	False

Understanding Results

CodeQL results are displayed in a summary format:

--- CodeQL results for owner/repo ---
  [ERROR] py/tarslip at src/file.py:42
    This file extraction depends on a potentially untrusted source.

  Total findings: 1

SARIF files contain detailed information including:

Rule descriptions and severity
Code locations (file, line, column)
Code flow paths
Remediation suggestions