Probes & sweeps¶

`steerkit.probe.Probe` `dataclass` ¶

Multi-direction linear probe at a single layer.

directions holds one unit-normalized [d_model] vector per probe family. metrics holds named scalars (e.g. auc_test_logistic, cohens_d_logistic). default_method chooses which direction Probe.steer() uses by default.

`direction` `property` ¶

The default-method direction tensor, ready for arithmetic.

`auc` `property` ¶

Held-out logistic AUC if available, else train AUC, else NaN.

`fit_all(activations, model, *, hook_site='resid_post', test_fraction=0.2, seed=42, default_method='logistic')` `classmethod` ¶

Fit one Probe per layer with all three candidate directions + cheap-tier metrics.

For each layer the training pipeline is

1) split pairs into train/test (test_fraction; pair-level split, no leakage). 2) fit logistic regression on train activations; store coef + AUC train/test. 3) compute difference-of-means direction on train; score AUC train/test via cosine. 4) fit LDA with Ledoit-Wolf shrinkage on train; score AUC train/test via decision_function. 5) compute Cohen's d on the held-out logistic decision-function scores.

`best_layer(probes, by='auc_test_logistic')` `staticmethod` ¶

Pick the layer that maximizes the given metric. Falls back to a cheap-tier alternative if the requested metric isn't present (e.g. when test_fraction=0).

`get_direction(method=None)` ¶

Return the unit direction for the given method (defaults to default_method).

`score_tokens(model, prompt, response=None, *, method=None, include_prompt=False)` ¶

Project every token's residual-stream activation at this probe's layer onto this probe's direction. Returns a TokenScores with token strings aligned to scalar scores, one per position.

This is the interpretability complement to steer(): where steer() uses the direction to push generation, score_tokens() measures where in a sequence the direction is most active. Useful for asking questions like "which tokens in this refusal actually carry the refusal signal?" or "did my steering hook push activations the way I expected?".

Parameters:

Name	Type	Description	Default
`model`	`ModelHandle`	a loaded ModelHandle (any model — the probe's model_id is checked only at steering time, not here).	required
`prompt`	`str`	the user prompt to score (or pre-pend to `response`).	required
`response`	`str \| None`	optional assistant response to score. Default `None` scores the prompt-only formatting.	`None`
`method`	`str \| None`	which probe-family direction to project onto. Defaults to `default_method`.	`None`
`include_prompt`	`bool`	when True, scores are returned for every token in the full chat-formatted sequence, with `response_start` indicating where the response begins. When False (default) and `response` is supplied, prompt tokens are sliced off so you get only the response-side scores.	`False`

Returns:

Type	Description
`TokenScores`	TokenScores(tokens, scores, layer, method, response_start). Call
`TokenScores`	`.plot()` for a heatmap-style visualization.

Note: scores are raw direction · activation projections — the sign and magnitude are interpretable within one sequence (which token fires hardest) but not across sequences without calibration. The logistic-method bias is omitted because it shifts all positions equally and does not change the relative ranking that this view is for.

`steer(model, prompt, alpha=None, *, method=None, op='addition', target=None, gamma=None, max_new_tokens=60, temperature=0.0)` ¶

Generate a steered completion. op selects one of the four interventions:

addition (default) — act ← act + α·v. alpha=None uses the calibrated auto_alpha, else 2.0. projection — act ← act − (act·v̂)v̂. Ablates the concept component; alpha/target/gamma are ignored. clamp — act ← act + (target − act·v̂)v̂. Requires target. Forces the concept's projection to a fixed value. multiplicative — act ← act + (γ−1)(act·v̂)v̂. Requires gamma. Scales the existing component along the direction.

Pass method to override default_method (which probe-family direction).

`ablate(model, prompt, **kwargs)` ¶

Convenience: steer with op='projection' (remove the concept component).

`clamp(model, prompt, target, **kwargs)` ¶

Convenience: steer with op='clamp' at the given target projection value.

`amplify(model, prompt, gamma, **kwargs)` ¶

Convenience: steer with op='multiplicative' at the given gamma scaling factor.

`plot_logit_lens(model, **kwargs)` ¶

Convenience: render the steering direction projected through the unembed.

`predict_at_mask(model, text, *, top_k=10, alpha=None, method=None, op='addition', target=None, gamma=None)` ¶

Run a single forward pass with the steering hook on, then read the top-K vocabulary predictions at every [MASK] token in text.

This is the encoder analog of Probe.steer(...) — encoder models don't autoregressively generate, but they expose token-level logits at each position. For a sentence like "I think this movie is [MASK].", this returns the top-K most-probable fillers for the mask, with the steering direction applied to the residual stream at the probe's layer.

Pass alpha=0.0 for the unsteered baseline (the hook is still installed but contributes nothing) — the typical use is to call this once with alpha=0.0 and once at the calibrated auto_alpha and compare the resulting top-K distributions side-by-side.

Returns a {mask_position: [(token_string, probability), ...]} dict. Raises ValueError if the input contains no [MASK] tokens.

`export_gguf(path, *, method=None, scale=1.0)` ¶

Convenience: export this single-layer Probe to llama.cpp gguf format.

`report(*, model=None, activations=None, per_layer=None, out=None, title=None)` ¶

Render a one-page HTML report. Returns the HTML string; if out is set, also writes it to disk and returns the path-as-string.

`steerkit.probe.TokenScores` `dataclass` ¶

Per-token probe scores along a single sequence.

Built by Probe.score_tokens(...). scores[i] is the projection of the token-i residual-stream activation (at the probe's layer) onto the probe's direction. Higher values mean the direction is more active at that token; the sign matches positive vs. negative side of the probe.

Attributes:

Name	Type	Description
`tokens`	`list[str]`	decoded token strings, one per position scored.
`scores`	`Tensor`	1-D float tensor with one entry per token.
`layer`	`int`	the probe's layer index (so plots can label themselves).
`method`	`str`	which probe-family direction was used (logistic / diff_of_means / mass_mean).
`response_start`	`int`	index into `tokens` where the assistant response begins. Always 0 when `score_tokens` was called with `include_prompt=False` (the prompt portion has been sliced off); otherwise marks the user-prompt → assistant-response boundary.

`plot(**kwargs)` ¶

Convenience: render with steerkit.viz.plot_token_scores.

`steerkit.probe.MultinomialProbe` `dataclass` ¶

Multi-class linear classifier across the concepts of a mutually_exclusive ConceptGroup.

Diagnostic, not for steering: useful for "which concept is this activation expressing?" and for cross-concept similarity heatmaps (rows of weights are direction vectors, one per concept). Steering is still done with the per-concept binary Probes.

`fit_at_layer(activations_by_concept, layer, model, *, hook_site='resid_post', test_fraction=0.2, seed=42)` `classmethod` ¶

Fit a multinomial probe at one chosen layer, using each concept's positive activations as its class. Returns a single MultinomialProbe with held-out accuracy.

`fit_best_layer(activations_by_concept, model, *, hook_site='resid_post', test_fraction=0.2, seed=42)` `classmethod` ¶

Fit a multinomial probe at every layer; return the one with best test accuracy (or train accuracy if no test split was kept).

`similarity_matrix()` ¶

Cosine-similarity matrix between class direction vectors (rows of weights). Returns a [n_classes, n_classes] tensor.

`plot_similarity(**kwargs)` ¶

Convenience: render the cross-class similarity heatmap.

`steerkit.sweep.sweep(group, model, *, hook_site='resid_post', test_fraction=0.2, seed=42, cache_dir=None, select_by='auc_test_logistic', with_steering_eval=False, teacher=None, eval_top_k=5, eval_alpha=4.0)` ¶

Run the full Phase-3+4 sweep on a ConceptGroup.

Steps

1) extract_group_activations per concept (Zarr-cached if cache_dir set). 2) Probe.fit_all per concept + select best layer by select_by. 3) For mutually_exclusive groups with ≥2 concepts: fit a MultinomialProbe at the layer with highest held-out multinomial accuracy. 4) If with_steering_eval=True and teacher is provided: call the LLM-judge expensive tier on each concept's top-K layers and attach metrics["steering_effect"].

`steerkit.sweep.GroupFit` `dataclass` ¶

Result of sweep(group, model). Indexable by concept name to get the chosen Probe.

`plot_layer_selection(concept_name, **kwargs)` ¶

Render the layer-selection dual-curve for one concept (requires per_concept).

`plot_similarity(**kwargs)` ¶

Render the cross-concept similarity heatmap. Uses the multinomial probe if present, otherwise falls back to the per-concept best probes' steering directions.

`report(*, model=None, out=None, title=None)` ¶

Render a one-page HTML report for this GroupFit. Returns the HTML string; if out is set, also writes it to disk and returns the path-as-string.

`window(concept_name, *, k=1)` ¶

Build a window-of-(2k+1) multi-layer composite around the chosen best layer for concept_name. Requires the full per_concept fits (i.e. not loaded from disk).

`save(dir_path)` ¶

Save the chosen Probe per concept + multinomial (if present) + the ConceptGroup snapshot (without contrast pairs) into a directory.

Layout

dir_path/ group.json {concept_name}.probe.safetensors (one per concept) multinomial.probe.safetensors (only if mutex group)

`load(dir_path)` `classmethod` ¶

Load a directory written by GroupFit.save.

`steerkit.sweep.CompositeProbe` `dataclass` ¶

Multiple probes' steering vectors composed at inference time.

Each probe's direction is added at its own layer with its own per-probe weight (and per-probe alpha if desired). Probes targeting the same TL hook are combined into a single hook function.

`export_gguf(path, *, method=None, scale=1.0)` ¶

Convenience: export this composite as a multi-layer gguf control vector.

`steer(model, prompt, *, alphas=None, methods=None, max_new_tokens=60, temperature=0.0)` ¶

Generate a completion with all composed probes' steering vectors active.

alphas: per-probe alpha override; defaults to each probe's auto_alpha (or 2.0). methods: per-probe direction method override.

`steerkit.sweep.compose(probes, weights=None)` ¶

Compose multiple probes for simultaneous steering at inference time.

weights defaults to all 1.0 (equal contribution). Probes can come from different ConceptGroups — the design memory's "axes" composition.

`steerkit.sweep.window(probes, center_layer, *, k=1)` ¶

Build a multi-layer steering composite over a window of layers around center_layer.

Parameters:

Name	Type	Description	Default
`probes`	`dict[int, Probe]`	full per-layer dict from `Probe.fit_all` (or `GroupFit.per_concept[name]`).	required
`center_layer`	`int`	the chosen "best" layer; window is built around it.	required
`k`	`int`	half-window size. k=1 (default) selects [center-1, center, center+1] — the "window-of-3" mode the design memory commits to. k=0 collapses to the single best probe wrapped in a CompositeProbe.	`1`

The returned CompositeProbe has weights = [1/n]*n so each layer's contribution is scaled down proportionally. With each probe's auto_alpha left as the default α at steer time, the per-layer push is auto_alpha / n rather than auto_alpha, keeping the total perturbation roughly comparable to a single-layer steer.

`steerkit.calibrate.calibrate_alpha(probe, model, *, prompts=None, candidates=DEFAULT_ALPHA_CANDIDATES, perplexity_ratio_max=1.5, max_new_tokens=30, method=None, attach=True)` ¶

Pick the largest α whose steered-output perplexity is within ratio_max of unsteered.

Parameters:

Name	Type	Description	Default
`probe`	`Probe`	a fitted Probe to calibrate.	required
`model`	`ModelHandle`	the model to steer.	required
`prompts`	`list[str] \| None`	small calibration set (default: 4 generic prompts).	`None`
`candidates`	`list[float] \| tuple[float, ...]`	α values to sweep, in increasing order is fine but we sort.	`DEFAULT_ALPHA_CANDIDATES`
`perplexity_ratio_max`	`float`	ceiling on steered_ppl / baseline_ppl.	`1.5`
`max_new_tokens`	`int`	response length per generation.	`30`
`method`	`str \| None`	which probe direction to use (default: probe.default_method).	`None`
`attach`	`bool`	if True, set probe.auto_alpha = chosen value before returning.	`True`

Returns:

Type	Description
`(best_alpha, ratios)`	best alpha and the full {alpha: perplexity_ratio} mapping.
`dict[float, float]`	best_alpha is 0.0 if no candidate satisfies the constraint (steering destroys coherence
`tuple[float, dict[float, float]]`	for every α tried).

`steerkit.eval.evaluate_steering_effect(probes, model, teacher, *, concept_description, eval_prompts=None, top_k=5, by='auc_test_logistic', alpha=4.0, max_new_tokens=60, method=None, on_failure=None, attach=True)` ¶

Score steering effect size for the top-K probes by a cheap-tier metric.

Parameters:

Name	Type	Description	Default
`probes`	`dict[int, Probe]`	a dict of layer -> Probe (e.g. from Probe.fit_all).	required
`model`	`ModelHandle`	the model to steer.	required
`teacher`	`TeacherModel`	the TeacherModel to use as judge (often the same as the generator).	required
`concept_description`	`str`	short description of the concept (e.g. "refusal", "verbose, expansive language").	required
`eval_prompts`	`list[str] \| None`	list of evaluation prompts; defaults to a small bundled set.	`None`
`top_k`	`int`	how many top layers (by `by`) to evaluate. Use a number >= len(probes) to evaluate all.	`5`
`by`	`str`	cheap-tier metric for narrowing. Falls back through train metrics if missing.	`'auc_test_logistic'`
`alpha`	`float`	steering strength during evaluation.	`4.0`
`max_new_tokens`	`int`	response length per generation.	`60`
`method`	`str \| None`	which probe direction to use (defaults to each probe's default_method).	`None`
`on_failure`	`Callable[[int, str, str], None] \| None`	optional callback (layer, response, raw_judge_text) when rating parsing fails.	`None`
`attach`	`bool`	if True, write `metrics["steering_effect"]` and `metrics["steering_effect_n"]` on each evaluated probe in place.	`True`

Returns a dict mapping layer index -> mean steering-effect score. Layers not in the top-K are not evaluated and not present in the return.

`steerkit.extract.extract_activations(pairs, model, hook_site='resid_post', *, include_boundaries=True, cache_dir=None, batch_size=8, pooling='last')` ¶

Extract pooled activations for each (positive, negative) response in each pair, per layer.

Returns a dict mapping layer index -> tensor of shape [n_pairs, 2, d_model]. Layer index 0 is the positive response and 1 is the negative response in the second axis.

With include_boundaries=True (default), the dict also contains entries at: - layer = -1 (embedding output, TL hook 'hook_embed') - layer = n_layers (final layernorm output, TL hook 'ln_final.hook_normalized') These let the cheap-tier sweep span [embed → 0..N-1 → final_ln] in one pass.

pooling selects how the per-token residual stream is reduced to a single [d_model] vector per (pair, response):

"last" (default): final real-token position. Standard for decoder-only autoregressive LMs (Qwen / Llama / Gemma / Pythia / GPT-2): causal attention means the last token has attended to everything before it, so it carries a "summary" of the response.
"mean": average across all real positions. Use this for encoder models (BERT, RoBERTa, DeBERTa, ...) — bidirectional attention means every position sees the whole input, so the last token has no special summary status. Mean-pooling matches BERT-style classification heads.
"max": element-wise max across real positions. Picks up punctate signals at unknown positions; less common but occasionally useful.

cache_dir: optional directory for a Zarr v3 activation cache. The cache key is derived from (model_id, hook_site, include_boundaries, pooling, pairs hash). On a cache hit we skip the model entirely and load tensors from disk.

batch_size: number of (pair, response) sequences run through the model in a single forward pass. Sequences are right-padded to max length per batch; pad positions are sliced off before pooling so they never affect mean/max. Set to 1 for a strictly sequential path (e.g. for memory-constrained big models). Default 8 is a reasonable speedup-to-memory tradeoff for ≤4B-parameter models on MPS.

`steerkit.extract.extract_group_activations(group, model, hook_site='resid_post', *, include_boundaries=True, cache_dir=None, batch_size=8, pooling='last')` ¶

Extract activations for every concept in a ConceptGroup.

Returns a dict mapping concept_name -> {layer: tensor [n_pairs, 2, d_model]}. Each concept's pairs are passed through the same activation pipeline. The cache_dir is forwarded to per-concept extraction; cache keys differ per concept because the dataset hash differs, so each concept gets its own Zarr store and they're loaded/written independently.

Probes & sweeps¶

steerkit.probe.Probe dataclass ¶

direction property ¶

auc property ¶

fit_all(activations, model, *, hook_site='resid_post', test_fraction=0.2, seed=42, default_method='logistic') classmethod ¶

best_layer(probes, by='auc_test_logistic') staticmethod ¶

get_direction(method=None) ¶

score_tokens(model, prompt, response=None, *, method=None, include_prompt=False) ¶

steer(model, prompt, alpha=None, *, method=None, op='addition', target=None, gamma=None, max_new_tokens=60, temperature=0.0) ¶

ablate(model, prompt, **kwargs) ¶

clamp(model, prompt, target, **kwargs) ¶

amplify(model, prompt, gamma, **kwargs) ¶

plot_logit_lens(model, **kwargs) ¶

predict_at_mask(model, text, *, top_k=10, alpha=None, method=None, op='addition', target=None, gamma=None) ¶

export_gguf(path, *, method=None, scale=1.0) ¶

report(*, model=None, activations=None, per_layer=None, out=None, title=None) ¶

steerkit.probe.TokenScores dataclass ¶

plot(**kwargs) ¶

steerkit.probe.MultinomialProbe dataclass ¶

fit_at_layer(activations_by_concept, layer, model, *, hook_site='resid_post', test_fraction=0.2, seed=42) classmethod ¶

fit_best_layer(activations_by_concept, model, *, hook_site='resid_post', test_fraction=0.2, seed=42) classmethod ¶

similarity_matrix() ¶

plot_similarity(**kwargs) ¶

steerkit.sweep.sweep(group, model, *, hook_site='resid_post', test_fraction=0.2, seed=42, cache_dir=None, select_by='auc_test_logistic', with_steering_eval=False, teacher=None, eval_top_k=5, eval_alpha=4.0) ¶

steerkit.sweep.GroupFit dataclass ¶

plot_layer_selection(concept_name, **kwargs) ¶

plot_similarity(**kwargs) ¶

report(*, model=None, out=None, title=None) ¶

window(concept_name, *, k=1) ¶

save(dir_path) ¶

load(dir_path) classmethod ¶

steerkit.sweep.CompositeProbe dataclass ¶

export_gguf(path, *, method=None, scale=1.0) ¶

steer(model, prompt, *, alphas=None, methods=None, max_new_tokens=60, temperature=0.0) ¶

steerkit.sweep.compose(probes, weights=None) ¶

steerkit.sweep.window(probes, center_layer, *, k=1) ¶

steerkit.calibrate.calibrate_alpha(probe, model, *, prompts=None, candidates=DEFAULT_ALPHA_CANDIDATES, perplexity_ratio_max=1.5, max_new_tokens=30, method=None, attach=True) ¶

steerkit.eval.evaluate_steering_effect(probes, model, teacher, *, concept_description, eval_prompts=None, top_k=5, by='auc_test_logistic', alpha=4.0, max_new_tokens=60, method=None, on_failure=None, attach=True) ¶

steerkit.extract.extract_activations(pairs, model, hook_site='resid_post', *, include_boundaries=True, cache_dir=None, batch_size=8, pooling='last') ¶

steerkit.extract.extract_group_activations(group, model, hook_site='resid_post', *, include_boundaries=True, cache_dir=None, batch_size=8, pooling='last') ¶

`steerkit.probe.Probe` `dataclass` ¶

`direction` `property` ¶

`auc` `property` ¶

`fit_all(activations, model, *, hook_site='resid_post', test_fraction=0.2, seed=42, default_method='logistic')` `classmethod` ¶

`best_layer(probes, by='auc_test_logistic')` `staticmethod` ¶

`get_direction(method=None)` ¶

`score_tokens(model, prompt, response=None, *, method=None, include_prompt=False)` ¶

`steer(model, prompt, alpha=None, *, method=None, op='addition', target=None, gamma=None, max_new_tokens=60, temperature=0.0)` ¶

`ablate(model, prompt, **kwargs)` ¶

`clamp(model, prompt, target, **kwargs)` ¶

`amplify(model, prompt, gamma, **kwargs)` ¶

`plot_logit_lens(model, **kwargs)` ¶

`predict_at_mask(model, text, *, top_k=10, alpha=None, method=None, op='addition', target=None, gamma=None)` ¶

`export_gguf(path, *, method=None, scale=1.0)` ¶

`report(*, model=None, activations=None, per_layer=None, out=None, title=None)` ¶

`steerkit.probe.TokenScores` `dataclass` ¶

`plot(**kwargs)` ¶

`steerkit.probe.MultinomialProbe` `dataclass` ¶

`fit_at_layer(activations_by_concept, layer, model, *, hook_site='resid_post', test_fraction=0.2, seed=42)` `classmethod` ¶

`fit_best_layer(activations_by_concept, model, *, hook_site='resid_post', test_fraction=0.2, seed=42)` `classmethod` ¶

`similarity_matrix()` ¶

`plot_similarity(**kwargs)` ¶

`steerkit.sweep.sweep(group, model, *, hook_site='resid_post', test_fraction=0.2, seed=42, cache_dir=None, select_by='auc_test_logistic', with_steering_eval=False, teacher=None, eval_top_k=5, eval_alpha=4.0)` ¶

`steerkit.sweep.GroupFit` `dataclass` ¶

`plot_layer_selection(concept_name, **kwargs)` ¶

`plot_similarity(**kwargs)` ¶

`report(*, model=None, out=None, title=None)` ¶

`window(concept_name, *, k=1)` ¶

`save(dir_path)` ¶

`load(dir_path)` `classmethod` ¶

`steerkit.sweep.CompositeProbe` `dataclass` ¶

`export_gguf(path, *, method=None, scale=1.0)` ¶

`steer(model, prompt, *, alphas=None, methods=None, max_new_tokens=60, temperature=0.0)` ¶

`steerkit.sweep.compose(probes, weights=None)` ¶

`steerkit.sweep.window(probes, center_layer, *, k=1)` ¶

`steerkit.calibrate.calibrate_alpha(probe, model, *, prompts=None, candidates=DEFAULT_ALPHA_CANDIDATES, perplexity_ratio_max=1.5, max_new_tokens=30, method=None, attach=True)` ¶

`steerkit.eval.evaluate_steering_effect(probes, model, teacher, *, concept_description, eval_prompts=None, top_k=5, by='auc_test_logistic', alpha=4.0, max_new_tokens=60, method=None, on_failure=None, attach=True)` ¶

`steerkit.extract.extract_activations(pairs, model, hook_site='resid_post', *, include_boundaries=True, cache_dir=None, batch_size=8, pooling='last')` ¶

`steerkit.extract.extract_group_activations(group, model, hook_site='resid_post', *, include_boundaries=True, cache_dir=None, batch_size=8, pooling='last')` ¶