Concept gallery¶
A linear probe is only as good as the concept you point it at.
Good contrast pairs share the same user prompt and differ mainly in the target behaviour. If positives also change topic, length, sentiment, and factual content all at once, the probe may learn the wrong thing.
Run steerkit lint-pairs --pairs your.jsonl before sweeping — see the CLI reference for the checks.
Bundled concepts¶
| Concept | File | What it is |
|---|---|---|
| Sycophancy | sycophancy.jsonl |
Validating preface ("Great question!") before answering, vs. answering directly. Used by the headline walkthrough + showcase figures. |
| Verbosity | verbosity.jsonl |
Long, qualified, example-heavy answers vs. concise one-or-two-sentence answers. Watch for LENGTH_SKEW. |
| Formality | formality.jsonl |
Formal business-tone vs. casual conversational. Orthogonal to verbosity, so it pairs well for composition. |
| Refusal | refusal_pairs.jsonl |
Canonical no-help refusals ("I can't help with that.") vs. helpful answers. Walkthrough at examples/case_studies/refusal_walkthrough.ipynb. |
| Emotion (joy / sadness / anger) | emotion.group.json |
A ConceptGroup with three mutually exclusive concepts; demonstrates sweep() on a multi-class group + multinomial diagnostic. |
Other concepts to try¶
Difficulty: easy = generates cleanly with a teacher and probes well at one or two layers; medium = needs careful concept-description authoring; hard = signal might not be cleanly linear, or the concept is contested in the literature.
| Concept | Positive vs. Negative | Difficulty |
|---|---|---|
| Hedging | "I think...", "perhaps...", "I'm not sure but..." vs. confident assertions | easy |
| First-person | "I" / "my" pervasive vs. impersonal third-person | easy |
| Question-asking | Ends with a follow-up question vs. statement only | easy |
| Markdown-heavy | Headers + bullets + bold vs. flowing prose | easy |
| Optimism / pessimism | Hopeful framings vs. resigned framings | medium |
| Empathetic acknowledgement | "That sounds hard..." vs. immediately solving | medium |
| Confidence / uncertainty | Authoritative vs. tentative | medium |
| Calibration / honesty about uncertainty | "I don't know" vs. confident confabulation | medium |
| Step-by-step reasoning ("CoT") | Visible chain-of-thought vs. final answer | medium |
| Self-identification as AI | Mentions being an LLM/model vs. responds as a person | medium |
| Medical / legal advice tone | Clinical / disclaimed vs. layperson | medium |
| Truthfulness (TruthfulQA-style) | Honest vs. confident-and-wrong | hard |
| Sandbagging / underperforming | Capability hiding | hard |
| Power-seeking / instrumental reasoning | Goal-directed vs. neutral | hard |
Authoring a new concept¶
- Write a behavior-specific concept description ("uses warm emotional language" beats "nice").
- Pick a single neutral reference instruction shared across all pairs.
- Generate or hand-write 30–100 contrast pairs — same prompt, two responses.
steerkit lint-pairs --pairs your.jsonl(use--strictin CI).- Sweep at small scale; check
plot_layer_selectionfor clean held-out separation before scaling up. - Test steering on prompts not in the dataset.
If you build a clean dataset that fills a gap, open a PR.