Skip to content

Visualization

steerkit.viz.plot_layer_selection(probes, *, by_classifier='auc_test_logistic', by_steering='steering_effect', x_axis='layer', title=None)

Dual-curve plot of probe-classifier metric and (if available) LLM-judge steering effect as a function of layer depth.

The classifier curve is always drawn (left y-axis); the steering curve is drawn on the right y-axis only if at least one probe has the by_steering metric attached.

steerkit.viz.plot_activation_projection(activations, *, method='pca', title=None, pos_label='concept', neg_label='neutral')

2D projection of a [n_pairs, 2, d_model] activations tensor, colored by class.

The second axis is the contrast pair: index 0 is the positive (concept-bearing) response and index 1 is the negative (neutral) response. PCA only for now; UMAP can be added as an optional extra later.

steerkit.viz.plot_alpha_curve(ratios, *, ratio_max=1.5, chosen_alpha=None, title=None)

Plot α vs perplexity ratio from calibrate_alpha's output.

A horizontal line at ratio_max shows the coherence ceiling; the chosen α (if provided) is annotated with a vertical marker. Intent is to make the auto-α decision transparent: which α values stayed under the ceiling, and which one was picked.

steerkit.viz.plot_logit_lens(probe, model, *, top_k=20, method=None, title=None)

Push the steering direction through the model's unembedding to get vocab logits, and render the top-K tokens as a horizontal bar chart.

A high-quality steering direction for "joy" should produce top tokens like "happy", "joyful", "delighted"; if the top tokens look unrelated, the probe is likely broken — this plot is the cheapest interpretability sanity check.

steerkit.viz.plot_similarity_heatmap(source, *, method=None, title=None, cmap='RdBu_r')

Cosine-similarity heatmap between class direction vectors.

Accepts either
  • a MultinomialProbe whose weights rows are per-class directions
  • a dict[name -> Probe] whose entries each carry a binary direction (typically GroupFit.best).

A diagonal of 1.0 is expected; off-diagonals at ~0 indicate orthogonal concepts; off-diagonals near ±1 indicate redundancy (e.g., joy ≈ −sadness).

steerkit.viz.plot_cross_model_overlay(probes_per_model, *, by='auc_test_logistic', title=None, mark_best=True)

Overlay layer-selection curves from multiple models on a normalized-depth x-axis.

Useful for comparing where the same concept is most cleanly classified across models — a methodology-comparison plot, not a steering-vector-transfer claim. Each entry of probes_per_model is model_label -> dict[int, Probe] (e.g. the per-layer fits returned by Probe.fit_all). Curves are aligned via Probe.normalized_depth so models with different layer counts can be compared visually.

Parameters:

Name Type Description Default
probes_per_model dict[str, dict[int, Probe]]

Mapping from model label to per-layer Probe.fit_all results.

required
by str

Metric to plot, for example auc_test_logistic or cohens_d_logistic.

'auc_test_logistic'
title str | None

Optional figure title.

None
mark_best bool

If True, mark each model's best layer with a larger hollow dot.

True

steerkit.viz.plot_token_scores(scores, *, title=None, figsize=None, color_pos='tab:red', color_neg='tab:blue', mark_response_start=True)

Render per-token probe scores as a horizontal bar chart.

Parameters:

Name Type Description Default
scores TokenScores

a TokenScores from Probe.score_tokens(...).

required
title str | None

optional figure title; defaults to one mentioning the layer + method.

None
figsize tuple[float, float] | None

matplotlib figure size; defaults scale with the number of tokens.

None
color_pos str

bar color for positive scores (concept-active positions).

'tab:red'
color_neg str

bar color for negative scores.

'tab:blue'
mark_response_start bool

when True and scores.response_start > 0, draws a horizontal divider between the prompt and response tokens.

True

Returns the Figure (no plt.show() / plt.close() — caller decides).