Probes & sweeps¶
steerkit.probe.Probe
dataclass
¶
Multi-direction linear probe at a single layer.
directions holds one unit-normalized [d_model] vector per probe family.
metrics holds named scalars (e.g. auc_test_logistic, cohens_d_logistic).
default_method chooses which direction Probe.steer() uses by default.
direction
property
¶
The default-method direction tensor, ready for arithmetic.
auc
property
¶
Held-out logistic AUC if available, else train AUC, else NaN.
fit_all(activations, model, *, hook_site='resid_post', test_fraction=0.2, seed=42, default_method='logistic')
classmethod
¶
Fit one Probe per layer with all three candidate directions + cheap-tier metrics.
For each layer the training pipeline is
1) split pairs into train/test (test_fraction; pair-level split, no leakage). 2) fit logistic regression on train activations; store coef + AUC train/test. 3) compute difference-of-means direction on train; score AUC train/test via cosine. 4) fit LDA with Ledoit-Wolf shrinkage on train; score AUC train/test via decision_function. 5) compute Cohen's d on the held-out logistic decision-function scores.
best_layer(probes, by='auc_test_logistic')
staticmethod
¶
Pick the layer that maximizes the given metric. Falls back to a cheap-tier alternative if the requested metric isn't present (e.g. when test_fraction=0).
get_direction(method=None)
¶
Return the unit direction for the given method (defaults to default_method).
score_tokens(model, prompt, response=None, *, method=None, include_prompt=False)
¶
Project every token's residual-stream activation at this probe's layer
onto this probe's direction. Returns a TokenScores with token strings
aligned to scalar scores, one per position.
This is the interpretability complement to steer(): where steer() uses
the direction to push generation, score_tokens() measures where in a
sequence the direction is most active. Useful for asking questions like
"which tokens in this refusal actually carry the refusal signal?" or
"did my steering hook push activations the way I expected?".
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model
|
ModelHandle
|
a loaded ModelHandle (any model — the probe's model_id is checked only at steering time, not here). |
required |
prompt
|
str
|
the user prompt to score (or pre-pend to |
required |
response
|
str | None
|
optional assistant response to score. Default |
None
|
method
|
str | None
|
which probe-family direction to project onto. Defaults to
|
None
|
include_prompt
|
bool
|
when True, scores are returned for every token in the
full chat-formatted sequence, with |
False
|
Returns:
| Type | Description |
|---|---|
TokenScores
|
TokenScores(tokens, scores, layer, method, response_start). Call |
TokenScores
|
|
Note: scores are raw direction · activation projections — the sign and
magnitude are interpretable within one sequence (which token fires hardest)
but not across sequences without calibration. The logistic-method bias is
omitted because it shifts all positions equally and does not change the
relative ranking that this view is for.
steer(model, prompt, alpha=None, *, method=None, op='addition', target=None, gamma=None, max_new_tokens=60, temperature=0.0)
¶
Generate a steered completion. op selects one of the four interventions:
addition (default) — act ← act + α·v. alpha=None uses the calibrated
auto_alpha, else 2.0.
projection — act ← act − (act·v̂)v̂. Ablates the concept component;
alpha/target/gamma are ignored.
clamp — act ← act + (target − act·v̂)v̂. Requires target. Forces
the concept's projection to a fixed value.
multiplicative — act ← act + (γ−1)(act·v̂)v̂. Requires gamma. Scales
the existing component along the direction.
Pass method to override default_method (which probe-family direction).
ablate(model, prompt, **kwargs)
¶
Convenience: steer with op='projection' (remove the concept component).
clamp(model, prompt, target, **kwargs)
¶
Convenience: steer with op='clamp' at the given target projection value.
amplify(model, prompt, gamma, **kwargs)
¶
Convenience: steer with op='multiplicative' at the given gamma scaling factor.
plot_logit_lens(model, **kwargs)
¶
Convenience: render the steering direction projected through the unembed.
predict_at_mask(model, text, *, top_k=10, alpha=None, method=None, op='addition', target=None, gamma=None)
¶
Run a single forward pass with the steering hook on, then read the
top-K vocabulary predictions at every [MASK] token in text.
This is the encoder analog of Probe.steer(...) — encoder models
don't autoregressively generate, but they expose token-level logits at
each position. For a sentence like "I think this movie is [MASK].",
this returns the top-K most-probable fillers for the mask, with the
steering direction applied to the residual stream at the probe's layer.
Pass alpha=0.0 for the unsteered baseline (the hook is still
installed but contributes nothing) — the typical use is to call this
once with alpha=0.0 and once at the calibrated auto_alpha and
compare the resulting top-K distributions side-by-side.
Returns a {mask_position: [(token_string, probability), ...]} dict.
Raises ValueError if the input contains no [MASK] tokens.
export_gguf(path, *, method=None, scale=1.0)
¶
Convenience: export this single-layer Probe to llama.cpp gguf format.
report(*, model=None, activations=None, per_layer=None, out=None, title=None)
¶
Render a one-page HTML report. Returns the HTML string; if out is set,
also writes it to disk and returns the path-as-string.
steerkit.probe.TokenScores
dataclass
¶
Per-token probe scores along a single sequence.
Built by Probe.score_tokens(...). scores[i] is the projection of the
token-i residual-stream activation (at the probe's layer) onto the
probe's direction. Higher values mean the direction is more active at
that token; the sign matches positive vs. negative side of the probe.
Attributes:
| Name | Type | Description |
|---|---|---|
tokens |
list[str]
|
decoded token strings, one per position scored. |
scores |
Tensor
|
1-D float tensor with one entry per token. |
layer |
int
|
the probe's layer index (so plots can label themselves). |
method |
str
|
which probe-family direction was used (logistic / diff_of_means / mass_mean). |
response_start |
int
|
index into |
plot(**kwargs)
¶
Convenience: render with steerkit.viz.plot_token_scores.
steerkit.probe.MultinomialProbe
dataclass
¶
Multi-class linear classifier across the concepts of a mutually_exclusive ConceptGroup.
Diagnostic, not for steering: useful for "which concept is this activation expressing?"
and for cross-concept similarity heatmaps (rows of weights are direction vectors,
one per concept). Steering is still done with the per-concept binary Probes.
fit_at_layer(activations_by_concept, layer, model, *, hook_site='resid_post', test_fraction=0.2, seed=42)
classmethod
¶
Fit a multinomial probe at one chosen layer, using each concept's positive activations as its class. Returns a single MultinomialProbe with held-out accuracy.
fit_best_layer(activations_by_concept, model, *, hook_site='resid_post', test_fraction=0.2, seed=42)
classmethod
¶
Fit a multinomial probe at every layer; return the one with best test accuracy (or train accuracy if no test split was kept).
similarity_matrix()
¶
Cosine-similarity matrix between class direction vectors (rows of weights).
Returns a [n_classes, n_classes] tensor.
plot_similarity(**kwargs)
¶
Convenience: render the cross-class similarity heatmap.
steerkit.sweep.sweep(group, model, *, hook_site='resid_post', test_fraction=0.2, seed=42, cache_dir=None, select_by='auc_test_logistic', with_steering_eval=False, teacher=None, eval_top_k=5, eval_alpha=4.0)
¶
Run the full Phase-3+4 sweep on a ConceptGroup.
Steps
1) extract_group_activations per concept (Zarr-cached if cache_dir set).
2) Probe.fit_all per concept + select best layer by select_by.
3) For mutually_exclusive groups with ≥2 concepts: fit a MultinomialProbe
at the layer with highest held-out multinomial accuracy.
4) If with_steering_eval=True and teacher is provided: call the LLM-judge
expensive tier on each concept's top-K layers and attach metrics["steering_effect"].
steerkit.sweep.GroupFit
dataclass
¶
Result of sweep(group, model). Indexable by concept name to get the chosen Probe.
plot_layer_selection(concept_name, **kwargs)
¶
Render the layer-selection dual-curve for one concept (requires per_concept).
plot_similarity(**kwargs)
¶
Render the cross-concept similarity heatmap. Uses the multinomial probe if present, otherwise falls back to the per-concept best probes' steering directions.
report(*, model=None, out=None, title=None)
¶
Render a one-page HTML report for this GroupFit. Returns the HTML string;
if out is set, also writes it to disk and returns the path-as-string.
window(concept_name, *, k=1)
¶
Build a window-of-(2k+1) multi-layer composite around the chosen best layer
for concept_name. Requires the full per_concept fits (i.e. not loaded from disk).
save(dir_path)
¶
Save the chosen Probe per concept + multinomial (if present) + the ConceptGroup snapshot (without contrast pairs) into a directory.
Layout
dir_path/ group.json {concept_name}.probe.safetensors (one per concept) multinomial.probe.safetensors (only if mutex group)
load(dir_path)
classmethod
¶
Load a directory written by GroupFit.save.
steerkit.sweep.CompositeProbe
dataclass
¶
Multiple probes' steering vectors composed at inference time.
Each probe's direction is added at its own layer with its own per-probe weight (and per-probe alpha if desired). Probes targeting the same TL hook are combined into a single hook function.
export_gguf(path, *, method=None, scale=1.0)
¶
Convenience: export this composite as a multi-layer gguf control vector.
steer(model, prompt, *, alphas=None, methods=None, max_new_tokens=60, temperature=0.0)
¶
Generate a completion with all composed probes' steering vectors active.
alphas: per-probe alpha override; defaults to each probe's auto_alpha (or 2.0).
methods: per-probe direction method override.
steerkit.sweep.compose(probes, weights=None)
¶
Compose multiple probes for simultaneous steering at inference time.
weights defaults to all 1.0 (equal contribution). Probes can come from
different ConceptGroups — the design memory's "axes" composition.
steerkit.sweep.window(probes, center_layer, *, k=1)
¶
Build a multi-layer steering composite over a window of layers around center_layer.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
probes
|
dict[int, Probe]
|
full per-layer dict from |
required |
center_layer
|
int
|
the chosen "best" layer; window is built around it. |
required |
k
|
int
|
half-window size. k=1 (default) selects [center-1, center, center+1] — the "window-of-3" mode the design memory commits to. k=0 collapses to the single best probe wrapped in a CompositeProbe. |
1
|
The returned CompositeProbe has weights = [1/n]*n so each layer's contribution
is scaled down proportionally. With each probe's auto_alpha left as the default
α at steer time, the per-layer push is auto_alpha / n rather than auto_alpha,
keeping the total perturbation roughly comparable to a single-layer steer.
steerkit.calibrate.calibrate_alpha(probe, model, *, prompts=None, candidates=DEFAULT_ALPHA_CANDIDATES, perplexity_ratio_max=1.5, max_new_tokens=30, method=None, attach=True)
¶
Pick the largest α whose steered-output perplexity is within ratio_max of unsteered.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
probe
|
Probe
|
a fitted Probe to calibrate. |
required |
model
|
ModelHandle
|
the model to steer. |
required |
prompts
|
list[str] | None
|
small calibration set (default: 4 generic prompts). |
None
|
candidates
|
list[float] | tuple[float, ...]
|
α values to sweep, in increasing order is fine but we sort. |
DEFAULT_ALPHA_CANDIDATES
|
perplexity_ratio_max
|
float
|
ceiling on steered_ppl / baseline_ppl. |
1.5
|
max_new_tokens
|
int
|
response length per generation. |
30
|
method
|
str | None
|
which probe direction to use (default: probe.default_method). |
None
|
attach
|
bool
|
if True, set probe.auto_alpha = chosen value before returning. |
True
|
Returns:
| Type | Description |
|---|---|
(best_alpha, ratios)
|
best alpha and the full {alpha: perplexity_ratio} mapping. |
dict[float, float]
|
best_alpha is 0.0 if no candidate satisfies the constraint (steering destroys coherence |
tuple[float, dict[float, float]]
|
for every α tried). |
steerkit.eval.evaluate_steering_effect(probes, model, teacher, *, concept_description, eval_prompts=None, top_k=5, by='auc_test_logistic', alpha=4.0, max_new_tokens=60, method=None, on_failure=None, attach=True)
¶
Score steering effect size for the top-K probes by a cheap-tier metric.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
probes
|
dict[int, Probe]
|
a dict of layer -> Probe (e.g. from Probe.fit_all). |
required |
model
|
ModelHandle
|
the model to steer. |
required |
teacher
|
TeacherModel
|
the TeacherModel to use as judge (often the same as the generator). |
required |
concept_description
|
str
|
short description of the concept (e.g. "refusal", "verbose, expansive language"). |
required |
eval_prompts
|
list[str] | None
|
list of evaluation prompts; defaults to a small bundled set. |
None
|
top_k
|
int
|
how many top layers (by |
5
|
by
|
str
|
cheap-tier metric for narrowing. Falls back through train metrics if missing. |
'auc_test_logistic'
|
alpha
|
float
|
steering strength during evaluation. |
4.0
|
max_new_tokens
|
int
|
response length per generation. |
60
|
method
|
str | None
|
which probe direction to use (defaults to each probe's default_method). |
None
|
on_failure
|
Callable[[int, str, str], None] | None
|
optional callback (layer, response, raw_judge_text) when rating parsing fails. |
None
|
attach
|
bool
|
if True, write |
True
|
Returns a dict mapping layer index -> mean steering-effect score. Layers not in the top-K are not evaluated and not present in the return.
steerkit.extract.extract_activations(pairs, model, hook_site='resid_post', *, include_boundaries=True, cache_dir=None, batch_size=8, pooling='last')
¶
Extract pooled activations for each (positive, negative) response in each pair, per layer.
Returns a dict mapping layer index -> tensor of shape [n_pairs, 2, d_model]. Layer index 0 is the positive response and 1 is the negative response in the second axis.
With include_boundaries=True (default), the dict also contains entries at:
- layer = -1 (embedding output, TL hook 'hook_embed')
- layer = n_layers (final layernorm output, TL hook 'ln_final.hook_normalized')
These let the cheap-tier sweep span [embed → 0..N-1 → final_ln] in one pass.
pooling selects how the per-token residual stream is reduced to a single
[d_model] vector per (pair, response):
- "last" (default): final real-token position. Standard for decoder-only autoregressive LMs (Qwen / Llama / Gemma / Pythia / GPT-2): causal attention means the last token has attended to everything before it, so it carries a "summary" of the response.
- "mean": average across all real positions. Use this for encoder models (BERT, RoBERTa, DeBERTa, ...) — bidirectional attention means every position sees the whole input, so the last token has no special summary status. Mean-pooling matches BERT-style classification heads.
- "max": element-wise max across real positions. Picks up punctate signals at unknown positions; less common but occasionally useful.
cache_dir: optional directory for a Zarr v3 activation cache. The cache key
is derived from (model_id, hook_site, include_boundaries, pooling, pairs hash).
On a cache hit we skip the model entirely and load tensors from disk.
batch_size: number of (pair, response) sequences run through the model in a
single forward pass. Sequences are right-padded to max length per batch; pad
positions are sliced off before pooling so they never affect mean/max. Set to
1 for a strictly sequential path (e.g. for memory-constrained big models).
Default 8 is a reasonable speedup-to-memory tradeoff for ≤4B-parameter models
on MPS.
steerkit.extract.extract_group_activations(group, model, hook_site='resid_post', *, include_boundaries=True, cache_dir=None, batch_size=8, pooling='last')
¶
Extract activations for every concept in a ConceptGroup.
Returns a dict mapping concept_name -> {layer: tensor [n_pairs, 2, d_model]}.
Each concept's pairs are passed through the same activation pipeline. The
cache_dir is forwarded to per-concept extraction; cache keys differ per
concept because the dataset hash differs, so each concept gets its own
Zarr store and they're loaded/written independently.