Skip to content

Probes & sweeps

steerkit.probe.Probe dataclass

Multi-direction linear probe at a single layer.

directions holds one unit-normalized [d_model] vector per probe family. metrics holds named scalars (e.g. auc_test_logistic, cohens_d_logistic). default_method chooses which direction Probe.steer() uses by default.

direction property

The default-method direction tensor, ready for arithmetic.

auc property

Held-out logistic AUC if available, else train AUC, else NaN.

fit_all(activations, model, *, hook_site='resid_post', test_fraction=0.2, seed=42, default_method='logistic') classmethod

Fit one Probe per layer with all three candidate directions + cheap-tier metrics.

For each layer the training pipeline is

1) split pairs into train/test (test_fraction; pair-level split, no leakage). 2) fit logistic regression on train activations; store coef + AUC train/test. 3) compute difference-of-means direction on train; score AUC train/test via cosine. 4) fit LDA with Ledoit-Wolf shrinkage on train; score AUC train/test via decision_function. 5) compute Cohen's d on the held-out logistic decision-function scores.

best_layer(probes, by='auc_test_logistic') staticmethod

Pick the layer that maximizes the given metric. Falls back to a cheap-tier alternative if the requested metric isn't present (e.g. when test_fraction=0).

get_direction(method=None)

Return the unit direction for the given method (defaults to default_method).

score_tokens(model, prompt, response=None, *, method=None, include_prompt=False)

Project every token's residual-stream activation at this probe's layer onto this probe's direction. Returns a TokenScores with token strings aligned to scalar scores, one per position.

This is the interpretability complement to steer(): where steer() uses the direction to push generation, score_tokens() measures where in a sequence the direction is most active. Useful for asking questions like "which tokens in this refusal actually carry the refusal signal?" or "did my steering hook push activations the way I expected?".

Parameters:

Name Type Description Default
model ModelHandle

a loaded ModelHandle (any model — the probe's model_id is checked only at steering time, not here).

required
prompt str

the user prompt to score (or pre-pend to response).

required
response str | None

optional assistant response to score. Default None scores the prompt-only formatting.

None
method str | None

which probe-family direction to project onto. Defaults to default_method.

None
include_prompt bool

when True, scores are returned for every token in the full chat-formatted sequence, with response_start indicating where the response begins. When False (default) and response is supplied, prompt tokens are sliced off so you get only the response-side scores.

False

Returns:

Type Description
TokenScores

TokenScores(tokens, scores, layer, method, response_start). Call

TokenScores

.plot() for a heatmap-style visualization.

Note: scores are raw direction · activation projections — the sign and magnitude are interpretable within one sequence (which token fires hardest) but not across sequences without calibration. The logistic-method bias is omitted because it shifts all positions equally and does not change the relative ranking that this view is for.

steer(model, prompt, alpha=None, *, method=None, op='addition', target=None, gamma=None, max_new_tokens=60, temperature=0.0)

Generate a steered completion. op selects one of the four interventions:

addition (default) — act ← act + α·v. alpha=None uses the calibrated auto_alpha, else 2.0. projection — act ← act − (act·v̂)v̂. Ablates the concept component; alpha/target/gamma are ignored. clamp — act ← act + (target − act·v̂)v̂. Requires target. Forces the concept's projection to a fixed value. multiplicative — act ← act + (γ−1)(act·v̂)v̂. Requires gamma. Scales the existing component along the direction.

Pass method to override default_method (which probe-family direction).

ablate(model, prompt, **kwargs)

Convenience: steer with op='projection' (remove the concept component).

clamp(model, prompt, target, **kwargs)

Convenience: steer with op='clamp' at the given target projection value.

amplify(model, prompt, gamma, **kwargs)

Convenience: steer with op='multiplicative' at the given gamma scaling factor.

plot_logit_lens(model, **kwargs)

Convenience: render the steering direction projected through the unembed.

predict_at_mask(model, text, *, top_k=10, alpha=None, method=None, op='addition', target=None, gamma=None)

Run a single forward pass with the steering hook on, then read the top-K vocabulary predictions at every [MASK] token in text.

This is the encoder analog of Probe.steer(...) — encoder models don't autoregressively generate, but they expose token-level logits at each position. For a sentence like "I think this movie is [MASK].", this returns the top-K most-probable fillers for the mask, with the steering direction applied to the residual stream at the probe's layer.

Pass alpha=0.0 for the unsteered baseline (the hook is still installed but contributes nothing) — the typical use is to call this once with alpha=0.0 and once at the calibrated auto_alpha and compare the resulting top-K distributions side-by-side.

Returns a {mask_position: [(token_string, probability), ...]} dict. Raises ValueError if the input contains no [MASK] tokens.

export_gguf(path, *, method=None, scale=1.0)

Convenience: export this single-layer Probe to llama.cpp gguf format.

report(*, model=None, activations=None, per_layer=None, out=None, title=None)

Render a one-page HTML report. Returns the HTML string; if out is set, also writes it to disk and returns the path-as-string.

steerkit.probe.TokenScores dataclass

Per-token probe scores along a single sequence.

Built by Probe.score_tokens(...). scores[i] is the projection of the token-i residual-stream activation (at the probe's layer) onto the probe's direction. Higher values mean the direction is more active at that token; the sign matches positive vs. negative side of the probe.

Attributes:

Name Type Description
tokens list[str]

decoded token strings, one per position scored.

scores Tensor

1-D float tensor with one entry per token.

layer int

the probe's layer index (so plots can label themselves).

method str

which probe-family direction was used (logistic / diff_of_means / mass_mean).

response_start int

index into tokens where the assistant response begins. Always 0 when score_tokens was called with include_prompt=False (the prompt portion has been sliced off); otherwise marks the user-prompt → assistant-response boundary.

plot(**kwargs)

Convenience: render with steerkit.viz.plot_token_scores.

steerkit.probe.MultinomialProbe dataclass

Multi-class linear classifier across the concepts of a mutually_exclusive ConceptGroup.

Diagnostic, not for steering: useful for "which concept is this activation expressing?" and for cross-concept similarity heatmaps (rows of weights are direction vectors, one per concept). Steering is still done with the per-concept binary Probes.

fit_at_layer(activations_by_concept, layer, model, *, hook_site='resid_post', test_fraction=0.2, seed=42) classmethod

Fit a multinomial probe at one chosen layer, using each concept's positive activations as its class. Returns a single MultinomialProbe with held-out accuracy.

fit_best_layer(activations_by_concept, model, *, hook_site='resid_post', test_fraction=0.2, seed=42) classmethod

Fit a multinomial probe at every layer; return the one with best test accuracy (or train accuracy if no test split was kept).

similarity_matrix()

Cosine-similarity matrix between class direction vectors (rows of weights). Returns a [n_classes, n_classes] tensor.

plot_similarity(**kwargs)

Convenience: render the cross-class similarity heatmap.

steerkit.sweep.sweep(group, model, *, hook_site='resid_post', test_fraction=0.2, seed=42, cache_dir=None, select_by='auc_test_logistic', with_steering_eval=False, teacher=None, eval_top_k=5, eval_alpha=4.0)

Run the full Phase-3+4 sweep on a ConceptGroup.

Steps

1) extract_group_activations per concept (Zarr-cached if cache_dir set). 2) Probe.fit_all per concept + select best layer by select_by. 3) For mutually_exclusive groups with ≥2 concepts: fit a MultinomialProbe at the layer with highest held-out multinomial accuracy. 4) If with_steering_eval=True and teacher is provided: call the LLM-judge expensive tier on each concept's top-K layers and attach metrics["steering_effect"].

steerkit.sweep.GroupFit dataclass

Result of sweep(group, model). Indexable by concept name to get the chosen Probe.

plot_layer_selection(concept_name, **kwargs)

Render the layer-selection dual-curve for one concept (requires per_concept).

plot_similarity(**kwargs)

Render the cross-concept similarity heatmap. Uses the multinomial probe if present, otherwise falls back to the per-concept best probes' steering directions.

report(*, model=None, out=None, title=None)

Render a one-page HTML report for this GroupFit. Returns the HTML string; if out is set, also writes it to disk and returns the path-as-string.

window(concept_name, *, k=1)

Build a window-of-(2k+1) multi-layer composite around the chosen best layer for concept_name. Requires the full per_concept fits (i.e. not loaded from disk).

save(dir_path)

Save the chosen Probe per concept + multinomial (if present) + the ConceptGroup snapshot (without contrast pairs) into a directory.

Layout

dir_path/ group.json {concept_name}.probe.safetensors (one per concept) multinomial.probe.safetensors (only if mutex group)

load(dir_path) classmethod

Load a directory written by GroupFit.save.

steerkit.sweep.CompositeProbe dataclass

Multiple probes' steering vectors composed at inference time.

Each probe's direction is added at its own layer with its own per-probe weight (and per-probe alpha if desired). Probes targeting the same TL hook are combined into a single hook function.

export_gguf(path, *, method=None, scale=1.0)

Convenience: export this composite as a multi-layer gguf control vector.

steer(model, prompt, *, alphas=None, methods=None, max_new_tokens=60, temperature=0.0)

Generate a completion with all composed probes' steering vectors active.

alphas: per-probe alpha override; defaults to each probe's auto_alpha (or 2.0). methods: per-probe direction method override.

steerkit.sweep.compose(probes, weights=None)

Compose multiple probes for simultaneous steering at inference time.

weights defaults to all 1.0 (equal contribution). Probes can come from different ConceptGroups — the design memory's "axes" composition.

steerkit.sweep.window(probes, center_layer, *, k=1)

Build a multi-layer steering composite over a window of layers around center_layer.

Parameters:

Name Type Description Default
probes dict[int, Probe]

full per-layer dict from Probe.fit_all (or GroupFit.per_concept[name]).

required
center_layer int

the chosen "best" layer; window is built around it.

required
k int

half-window size. k=1 (default) selects [center-1, center, center+1] — the "window-of-3" mode the design memory commits to. k=0 collapses to the single best probe wrapped in a CompositeProbe.

1

The returned CompositeProbe has weights = [1/n]*n so each layer's contribution is scaled down proportionally. With each probe's auto_alpha left as the default α at steer time, the per-layer push is auto_alpha / n rather than auto_alpha, keeping the total perturbation roughly comparable to a single-layer steer.

steerkit.calibrate.calibrate_alpha(probe, model, *, prompts=None, candidates=DEFAULT_ALPHA_CANDIDATES, perplexity_ratio_max=1.5, max_new_tokens=30, method=None, attach=True)

Pick the largest α whose steered-output perplexity is within ratio_max of unsteered.

Parameters:

Name Type Description Default
probe Probe

a fitted Probe to calibrate.

required
model ModelHandle

the model to steer.

required
prompts list[str] | None

small calibration set (default: 4 generic prompts).

None
candidates list[float] | tuple[float, ...]

α values to sweep, in increasing order is fine but we sort.

DEFAULT_ALPHA_CANDIDATES
perplexity_ratio_max float

ceiling on steered_ppl / baseline_ppl.

1.5
max_new_tokens int

response length per generation.

30
method str | None

which probe direction to use (default: probe.default_method).

None
attach bool

if True, set probe.auto_alpha = chosen value before returning.

True

Returns:

Type Description
(best_alpha, ratios)

best alpha and the full {alpha: perplexity_ratio} mapping.

dict[float, float]

best_alpha is 0.0 if no candidate satisfies the constraint (steering destroys coherence

tuple[float, dict[float, float]]

for every α tried).

steerkit.eval.evaluate_steering_effect(probes, model, teacher, *, concept_description, eval_prompts=None, top_k=5, by='auc_test_logistic', alpha=4.0, max_new_tokens=60, method=None, on_failure=None, attach=True)

Score steering effect size for the top-K probes by a cheap-tier metric.

Parameters:

Name Type Description Default
probes dict[int, Probe]

a dict of layer -> Probe (e.g. from Probe.fit_all).

required
model ModelHandle

the model to steer.

required
teacher TeacherModel

the TeacherModel to use as judge (often the same as the generator).

required
concept_description str

short description of the concept (e.g. "refusal", "verbose, expansive language").

required
eval_prompts list[str] | None

list of evaluation prompts; defaults to a small bundled set.

None
top_k int

how many top layers (by by) to evaluate. Use a number >= len(probes) to evaluate all.

5
by str

cheap-tier metric for narrowing. Falls back through train metrics if missing.

'auc_test_logistic'
alpha float

steering strength during evaluation.

4.0
max_new_tokens int

response length per generation.

60
method str | None

which probe direction to use (defaults to each probe's default_method).

None
on_failure Callable[[int, str, str], None] | None

optional callback (layer, response, raw_judge_text) when rating parsing fails.

None
attach bool

if True, write metrics["steering_effect"] and metrics["steering_effect_n"] on each evaluated probe in place.

True

Returns a dict mapping layer index -> mean steering-effect score. Layers not in the top-K are not evaluated and not present in the return.

steerkit.extract.extract_activations(pairs, model, hook_site='resid_post', *, include_boundaries=True, cache_dir=None, batch_size=8, pooling='last')

Extract pooled activations for each (positive, negative) response in each pair, per layer.

Returns a dict mapping layer index -> tensor of shape [n_pairs, 2, d_model]. Layer index 0 is the positive response and 1 is the negative response in the second axis.

With include_boundaries=True (default), the dict also contains entries at: - layer = -1 (embedding output, TL hook 'hook_embed') - layer = n_layers (final layernorm output, TL hook 'ln_final.hook_normalized') These let the cheap-tier sweep span [embed → 0..N-1 → final_ln] in one pass.

pooling selects how the per-token residual stream is reduced to a single [d_model] vector per (pair, response):

  • "last" (default): final real-token position. Standard for decoder-only autoregressive LMs (Qwen / Llama / Gemma / Pythia / GPT-2): causal attention means the last token has attended to everything before it, so it carries a "summary" of the response.
  • "mean": average across all real positions. Use this for encoder models (BERT, RoBERTa, DeBERTa, ...) — bidirectional attention means every position sees the whole input, so the last token has no special summary status. Mean-pooling matches BERT-style classification heads.
  • "max": element-wise max across real positions. Picks up punctate signals at unknown positions; less common but occasionally useful.

cache_dir: optional directory for a Zarr v3 activation cache. The cache key is derived from (model_id, hook_site, include_boundaries, pooling, pairs hash). On a cache hit we skip the model entirely and load tensors from disk.

batch_size: number of (pair, response) sequences run through the model in a single forward pass. Sequences are right-padded to max length per batch; pad positions are sliced off before pooling so they never affect mean/max. Set to 1 for a strictly sequential path (e.g. for memory-constrained big models). Default 8 is a reasonable speedup-to-memory tradeoff for ≤4B-parameter models on MPS.

steerkit.extract.extract_group_activations(group, model, hook_site='resid_post', *, include_boundaries=True, cache_dir=None, batch_size=8, pooling='last')

Extract activations for every concept in a ConceptGroup.

Returns a dict mapping concept_name -> {layer: tensor [n_pairs, 2, d_model]}. Each concept's pairs are passed through the same activation pipeline. The cache_dir is forwarded to per-concept extraction; cache keys differ per concept because the dataset hash differs, so each concept gets its own Zarr store and they're loaded/written independently.