scanipy taint engine reference
Status: v1 (0.2.0). This documents the taint engine that consumes the taint-DSL specs and the normalized IR. It is the companion to the DSL reference: the DSL is what to detect; this is how the engine finds it. Detection knowledge lives entirely in the YAML specs — the engine is class-agnostic (principle P4).
The engine takes an IRModule (from scanipy.frontends.PythonFrontend.parse) and
the active DetectorSpec pack and returns a deterministic, witness-backed
list[Finding]:
from scanipy.frontends.python_frontend import PythonFrontend
from scanipy.registry import load_builtin_detectors
from scanipy.engine import TaintEngine
module = PythonFrontend().parse(path) # IRModule | None
findings = TaintEngine(load_builtin_detectors()).analyze(module)
TaintEngine.analyze raises TypeError on anything that is not an IRModule. It
performs no network I/O and no file writes (principle P1); each module is
analyzed in isolation with no cross-file or global mutable state.
1. Pipeline
Analysis runs in two deterministic phases per module (P3):
IRModule ──▶ Phase 1: function summaries (TITO) summaries.py
│ reverse-topo over the in-file call graph; bounded SCC fixpoint
▼
Phase 2: intraprocedural dataflow taint.py
flow-sensitive forward worklist per CFG, union at joins
▼
dedup ─▶ fingerprint ─▶ total-order sort ─▶ list[Finding]
Module layout:
| File | Responsibility |
|---|---|
engine/taint_state.py |
Access-path taint lattice (AccessPath, TaintLabel, TaintEnv). |
engine/witness.py |
Witness chain selection + sha256 fingerprints. |
engine/propagation.py |
expr_taint: generic built-in + DSL propagation, calls, summaries. |
engine/summaries.py |
TITO FunctionSummary computation + call-graph SCC fixpoint. |
engine/taint.py |
TaintEngine.analyze, the CFG dataflow, sink emission, finalization. |
engine/matcher.py |
Pure Pattern ↔ IR matcher (shared with the frontend). |
2. Taint state: access paths
Taint is keyed by an access path — a base variable plus a bounded suffix of
.attr / constant-[index] steps (depth cap STEPS_CAP = 2):
x,x.a,x.a.b,d["k"]are tracked precisely.- At the cap a deeper path collapses to its prefix and over-approximates: a read of any sibling under the collapsed prefix sees the taint. This biases toward false positives, never false negatives (P5-safe), and deliberately diverges from tools that drop deep taint.
- A dynamic subscript
x[i](non-constanti) taints the whole containerx(collapse to base) on write, and reads from a tainted container conservatively.
The environment (TaintEnv) is an immutable AccessPath → frozenset[TaintLabel]
map. Every operation returns a new env:
- assign — kill the target path (and its extensions), then bind the RHS labels
(kill-then-reassign; reassigning
xclears priorx.a). - augment (
x += t) — union the RHS taint with the target's existing taint. - sanitize — remove one spec's labels at a path (and its extensions), one sided (only on the path where the sanitizer demonstrably runs).
- join — per-path union of label sets (see §4).
For bounded state and deterministic witnesses, the env keeps a single best
provenance per (access_path, spec_id) (shortest witness chain, then
lexicographically smallest). This also gives the lattice finite height, so the
forward dataflow always reaches a fixpoint.
3. Propagation (expr_taint)
expr_taint(expr, env, ctx) returns the labels flowing out of an expression. It
is detector-agnostic — every rule applies to every spec equally (P4):
- Sources. Any sub-expression matching a spec's
sourcepattern introduces a fresh label for that spec. Because source detection is part ofexpr_taint, a source nested directly in a sink (os.system(input())) still seeds taint, and one site can seed several specs (input()feeds both os-command and sql). - Names / attributes / subscripts — look up the access path in the env.
- String-shaped operators —
+,%,*carry taint from either operand; f-strings, containers, comprehensions, andand/or/conditional value-unions carry the taint of their values. The condition of anif-expression is not propagated (implicit / control-dependence flow is out of scope — P7). strmethods (.strip(),.format(),.replace(), …) — a built-in default carries receiver/argument taint to the return value.- Calls — dispatched as follows:
- A sanitizer match cleans that spec's taint from the return value.
- Matching DSL propagators move taint per their
flow(from → to). - An in-file summary (§5) propagates param/self/source → return and emits interprocedural sink findings.
- An unknown external callee falls back to
any-arg → return(conservative pass-through — FP-biased, P5-safe, documented P7).
4. Intraprocedural dataflow
For each function the engine runs a flow-sensitive forward worklist over the
frontend's per-function CFG (IRBlock + successors):
- a block's in-env is the union of its predecessors' out-envs;
- statements transfer the env (assign / augment / sanitize);
- joins union, never intersect — the load-bearing P5 rule. Code sanitized on
only one branch of an
ifis still flagged, because the other branch falls through to the join carrying live taint. - loops iterate to a bounded fixpoint (monotone lattice + a
FIXPOINT_CAPsafety net) so analysis always terminates.
Sink emission. For each sink Pattern that matches a call, the engine checks
the restricted positional argument indices (Pattern.args, receiver-excluded). If
a checked argument carries an unsanitized label for that spec and the when
constraints hold, it emits a Finding whose witness is the label's provenance
chain plus a final SINK step. Severity / CWE / message / id are copied from
the matched spec; the engine invents nothing.
5. Interprocedural summaries (TITO)
Intra-file interprocedural reach is achieved with transfer-input/transfer-output (TITO) function summaries, not inlining. The engine:
- builds the in-file call graph by matching
IRCall.callee_pathagainst in-file function qualnames (bare names resolve to top-level functions; method / nested resolution is approximate — P7); - condenses it into SCCs (deterministic Tarjan over sorted qualnames) and processes them in reverse-topological order (callees before callers);
- summarizes each function by analyzing it once with each formal parameter seeded
as a symbolic marker label, harvesting the flows that reach the return value
or a sink:
param_i → return,self → return, in-bodysource → return,param_i → sink S,self → sink S, in-bodysource → sink S, each carrying a witness fragment for splicing;
- solves cyclic SCCs (recursion / mutual recursion) with a bounded monotone
worklist fixpoint (
SUMMARY_FIXPOINT_CAP) — flows only ever grow, so it terminates.
Application + witness splicing. At a call site, a caller-side label feeding a
param_i → sink flow emits a finding whose witness is:
[caller chain: source → … → arg] + [call-site PROPAGATOR: arg enters param] + [callee fragment ending in SINK]
A param_i → return flow continues the label (with the splice) into the caller's
value, so chained wrappers compose.
6. Determinism (P3)
Output is byte-identical across runs, machines, and spec-input order:
- specs are iterated in sorted
idorder; functions by qualname; blocks by index; summary flows by a total key. Nodict/setiteration order leaks out. - Witness selection keeps the canonical chain per
(spec, sink, source): shortest, then the lexicographically smallest tuple of(role, file, line, column, end_line, end_column). - Dedup key:
(detector_id, sink location, source-start location). This also collapses a sink matched by two overlapping patterns (e.g.*.cursor.executeand*.execute) into one finding. - Fingerprint (
Finding.fingerprint): asha256hex digest overdetector_id,cwe, the sink location, and the ordered witness step tuples — derived only from field values (noid()/hash()/PYTHONHASHSEED). - Final sort key ends in the fingerprint, making the order total even when two findings share a location and detector id.
7. Honest scope (P7)
The OSS engine is single-language (Python) and intra-file. Known limitations, documented rather than hidden:
- No cross-file / whole-program analysis. Summaries are intra-file only.
- Method call resolution is approximate. A call's
callee_pathis matched to an in-file function by its last segment (obj.run→run), so a tainted value passed to a non-self positional parameter of a method (obj.m(t)wheretbinds the second formal) can be missed because the receiver shifts the argument indices — a false negative, P5-safe. A same-named top-level function may also be applied (a false positive — acceptable noise). Tainted receivers (self) resolve correctly. - Alias through mutation (
a = b; b.x = t; a.x…), dynamic subscripts, and dynamic /*imports are best-effort or out of scope; taint is tracked per access path, not per heap object. - Implicit / control-dependence flows (tainting via a branch condition) are out of scope.
when: {keyword: …}is literal-equality only —shell=<truthy variable>is a false negative.argsis positional-only — a dangerous value passed by keyword is not covered by a positionalargsrestriction (model such sinks withwhen).- External-callee fallback over-taints benign wrappers (FP-biased) by design; add a DSL propagator/sanitizer to refine.
The one-sidedness guarantee covers sanitizers (a missing sanitizer is noise, a false positive — never a silently-suppressed real vulnerability), not overall soundness.