Repositories / more_nnsight.git
more_nnsight.git
Clone (read-only): git clone http://git.guha-anderson.com/git/more_nnsight.git
@@ -0,0 +1,127 @@ +# more-nnsight + +This library adds one abstraction on top of NNSight: `SavedActivation`, a +container that holds activation slices keyed by their model path and token +position. + +NNSight already lets you save and patch individual activations. The problem +starts when you need several at once — across layers, token positions, or +prompts. You end up juggling loose tensors and remembering which one came from +where. NNSight also requires that you access modules in layer-major order +within a trace: you cannot touch layer 2, then layer 5, then go back to layer +2. `SavedActivation` keeps activations together, enforces canonical ordering +internally, and gives you batch, arithmetic, and patching operations over the +whole set. + +## API + +```python +from more_nnsight import SavedActivation, save_activations, updates +``` + +Call `save_activations` inside an NNSight trace to capture activations. Paths +follow the model's attribute structure; the final bracket is the token position. + +```python +with model.trace() as tracer: + with tracer.invoke(prompts): + saved = save_activations(model, [ + "model.transformer.h[2].output[10]", + "model.transformer.h[3].output[-1]", + ]) +``` + +Use `[:]` to expand over a `ModuleList` — `"model.transformer.h[:].output[10]"` +becomes one key per layer. Each saved tensor has shape `(batch, hidden)`. + +Values come back by string or by attribute traversal: + +```python +saved.get("model.transformer.h[2].output[10]") +saved.values.transformer.h[2].output[10] +saved.keys() # list of saved path strings +``` + +You can slice the batch dimension, take the mean, or do arithmetic across all +saved paths at once: + +```python +saved.slice[0:3] # first three batch rows +saved.mean() # (1, hidden) per path +direction = positive.mean() - negative.mean() # +, -, scalar * all work +``` + +`subset` narrows to specific paths; `union` merges two disjoint sets: + +```python +focused = saved.subset(["model.transformer.h[2].output[10]"]) +combined = patch_a.union(patch_b) +``` + +To write stored activations into a later forward pass, use `apply`: + +```python +with model.trace() as tracer: + with tracer.invoke(corrupted_prompts): + saved.apply(model) +``` + +When the replacement depends on the live activation, use `updates` instead. +It is a generator that walks paths in layer-major order and yields three values +per path: `key` (the path string), `current` (the activation from the current +forward pass at that path and token position), and `update` (a callback that +writes a new tensor back to the same location). At each step, you call +`update(new_value)` with whatever you want to write. This lets you express +per-layer logic — for example, adding a scaled steering vector at one layer +while replacing outright at another: + +```python +with model.trace() as tracer: + with tracer.invoke(prompts): + for key, current, update in updates(model, saved.keys()): + if key == "model.transformer.h[2].output[-1]": + update(current + 2.0 * saved.get(key)) + else: + update(saved.get(key)) +``` + +You can also build a `SavedActivation` directly from tensors you already have: + +```python +patch = SavedActivation.from_pairs( + ("model.transformer.h[2].output[10]", tensor_a), + ("model.transformer.h[3].output[9]", tensor_b), +) +``` + +## Typical workflows + +**Activation patching** — save from a clean run, replay into a corrupted run: + +```python +with model.trace() as tracer: + with tracer.invoke([clean_prompt]): + patch = save_activations(model, ["model.transformer.h[2].output[9]"]) + +with model.trace() as tracer: + with tracer.invoke([corrupted_prompt]): + patch.apply(model) + logits = model.lm_head.output.save() +``` + +**Steering** — compute a direction from contrastive prompts, add it to a +neutral run: + +```python +prompts = positive_prompts + negative_prompts + [neutral_prompt] +path = "model.transformer.h[5].output[-1]" + +with model.trace() as tracer: + with tracer.invoke(prompts): + saved = save_activations(model, [path]) + pos = saved.slice[0:3].mean() + neg = saved.slice[3:6].mean() + neutral = saved.slice[6] + +steered = neutral + 1.5 * (pos - neg) +```
@@ -1,431 +0,0 @@ -# SavedActivation - -`SavedActivation` is for the point where plain NNSight starts to get awkward: -you are no longer saving one activation, but several, and now you have to keep -track of which tensor came from which layer and position. It keeps those saves -together in one object, addressed by the same paths you used to create them, so -operations like subsetting, averaging, and patching stay attached to the -activations themselves instead of turning into bookkeeping. This file documents -the API and shows the equivalent direct NNSight code. - -## Imports - -```python -from more_nnsight import SavedActivation, save_activations, updates -``` - -## Core Rule - -`save_activations(...)` must be called inside an already-active NNSight -`trace`/`invoke` context. - -```python -with model.trace() as tracer: - with tracer.invoke(prompts): - saved = save_activations(model, ["model.transformer.h[2].output[10]"]) -``` - -This matches ordinary NNSight usage: activation saves only work inside a trace. - -## Path Syntax - -Paths follow the model's real attribute/index structure. - -Examples: - -- `model.transformer.h[2].output[10]` -- `model.transformer.h[3].output[-1]` -- `model.transformer.h[:].output[10]` -- `model.model.layers[:].output[2]` - -Everything before the final bracket names the activation tensor. The final -bracket gives the token position to save. Intermediate `[:]` syntax expands -over repeated blocks such as GPT-2 `transformer.h[:]` or Qwen `model.layers[:]`. - -For example, `model.transformer.h[2].output[10]` means "take layer 2, take its -output, and save token position 10", producing a tensor of shape -`(batch_size, hidden_size)`. - -## Saving Activations - -### Single path - -```python -with model.trace() as tracer: - with tracer.invoke(prompts): - saved = save_activations(model, ["model.transformer.h[2].output[10]"]) -``` - -Direct NNSight equivalent: - -```python -with model.trace() as tracer: - with tracer.invoke(prompts): - direct = model.transformer.h[2].output[:, 10, :].save() -``` - -Equivalent access: - -```python -saved.get("model.transformer.h[2].output[10]") == direct -``` - -### Multiple paths - -```python -with model.trace() as tracer: - with tracer.invoke(prompts): - saved = save_activations( - model, - [ - "model.transformer.h[2].output[10]", - "model.transformer.h[3].output[-1]", - ], - ) -``` - -Direct NNSight equivalent: - -```python -with model.trace() as tracer: - with tracer.invoke(prompts): - layer_2 = model.transformer.h[2].output[:, 10, :].save() - layer_3 = model.transformer.h[3].output[:, -1, :].save() -``` - -### All layers with `[:]` - -```python -with model.trace() as tracer: - with tracer.invoke(prompts): - saved = save_activations(model, ["model.transformer.h[:].output[10]"]) -``` - -Direct NNSight equivalent: - -```python -with model.trace() as tracer: - with tracer.invoke(prompts): - direct = [ - model.transformer.h[layer].output[:, 10, :].save() - for layer in range(len(model.transformer.h)) - ] -``` - -The saved keys become concrete: - -```python -saved.keys() -# [ -# "model.transformer.h[0].output[10]", -# "model.transformer.h[1].output[10]", -# ... -# ] -``` - -## Accessing Saved Values - -Saved values are exposed under `.values`. - -```python -tensor = saved.values.transformer.h[2].output[10] -``` - -Equivalent string lookup: - -```python -tensor = saved.get("model.transformer.h[2].output[10]") -``` - -`saved.keys()` returns the saved path strings, and `saved.get(path)` returns one -saved tensor. Missing paths raise immediately. - -## Building From Explicit Pairs - -If you already have tensors and want to build a `SavedActivation` directly, use -`SavedActivation.from_pairs(...)`: - -```python -patch = SavedActivation.from_pairs( - ("model.model.layers[34].output[10]", tensor_a), - ("model.model.layers[35].output[9]", tensor_b), -) -``` - -Each entry is `(path, value)`. The paths are parsed the same way as -`save_activations(...)`, the internal order is canonicalized, and duplicate -paths raise an error. - -## Subsetting by Path - -```python -focused = saved.subset(["model.transformer.h[2].output[10]"]) -``` - -This keeps only the listed saved activations and drops the rest. In direct -NNSight, you would usually do this by manually building a smaller Python -structure. - -## Union - -Use `union` when you want to combine different saved paths into one -`SavedActivation`: - -```python -combined = layer_2_patch.union(layer_3_patch) -``` - -This is different from `+`: - -- `a + b` means "add values on the same paths" -- `a.union(b)` means "combine different paths into one object" - -`union` requires the two objects to have disjoint keys. If the same saved path -appears in both, it raises an error. - -## Batch Slicing - -Use bracket syntax on `.slice`: - -```python -first = saved.slice[0] -first_two = saved.slice[0:2] -mixed = saved.slice[0:2, 5, 8:10] -``` - -This slices the batch dimension of every saved tensor. - -Direct NNSight equivalent for one path: - -```python -saved.get("model.transformer.h[2].output[10]")[0:2] -``` - -The point is that the same batch selection is applied across every saved -activation at once. - -## Mean Over Batch - -```python -mean_saved = saved.mean() -``` - -This reduces each saved tensor from shape `(batch_size, hidden_size)` to -`(1, hidden_size)`. - -Direct NNSight equivalent for one path: - -```python -with model.trace() as tracer: - with tracer.invoke(prompts): - direct_mean = model.transformer.h[2].output[:, 10, :].mean(dim=0, keepdim=True).save() -``` - -If `saved.mean()` is called inside the active trace, the reduction happens -there before the reduced value is saved. That is more memory-efficient than -saving the full batch and averaging later. - -## Arithmetic - -If two `SavedActivation` objects have the same keys, you can combine them: - -```python -direction = positive.mean() - negative.mean() -steered = neutral + 1.5 * direction -``` - -Supported operations are `a + b`, `a - b`, `scalar * a`, and `a * scalar`. -They are applied elementwise across matching saved tensors. - -Direct NNSight equivalent for one path: - -```python -direction = positive_tensor.mean(dim=0, keepdim=True) - negative_tensor.mean(dim=0, keepdim=True) -steered = neutral_tensor + 1.5 * direction -``` - -## Applying Saved Activations - -You can patch a later run with: - -```python -with model.trace() as tracer: - with tracer.invoke(corrupted_prompts): - saved.apply(model) -``` - -This writes each stored tensor back into the live traced activation at its -saved path and token position. - -Direct NNSight equivalent for one path: - -```python -with model.trace() as tracer: - with tracer.invoke(corrupted_prompts): - model.transformer.h[2].output[:, 10, :] = saved_tensor -``` - -`SavedActivation.apply(model)` performs that assignment for every saved key. - -## Layer-Major Updates - -When you need to read the current activation and write back an updated value in -the same invoke, use `updates(model, keys)`. - -```python -for key, current_value, update in updates(model, saved.keys()): - update(current_value + 2.0 * saved.get(key)) -``` - -This is the interleaving-safe pattern for multi-layer current-pass updates. It -walks the keys in canonical layer-major order and gives you: - -- `key`: the saved-activation path string -- `current_value`: the live activation slice from the current forward pass -- `update(new_value)`: a callback that writes a new value back to that same - slice - -`updates(...)` requires concrete saved keys in canonical order. In practice, -`saved.keys()` is the intended input. - -Direct NNSight equivalent for one path: - -```python -current = model.transformer.h[2].output[:, 10, :] -model.transformer.h[2].output[:, 10, :] = current + 2.0 * saved_tensor -``` - -Use `updates(...)` when the new value depends on the current forward-pass -activation. Use `saved.apply(model)` when you just want to replay stored -activations unchanged. - -`save_activations(...)` and `updates(...)` should usually happen in different -invocations. Saving layer 2 and then revisiting layer 2 after later layers have -already been touched violates NNSight's interleaving rules. - -## `saved.save()` - -`SavedActivation.save()` is different from `save_activations(...)`. - -- `save_activations(...)` captures activation values -- `saved.save()` registers the `SavedActivation` object itself with NNSight - -New `SavedActivation` objects created inside the trace, such as the result of -`save_activations(...)`, `saved.mean()`, `saved.subset(...)`, or arithmetic, -are registered automatically by the library, so the usual pattern works: - -```python -with model.trace() as tracer: - with tracer.invoke(prompts): - saved = save_activations(model, ["model.transformer.h[2].output[10]"]) - mean_saved = saved.mean() -``` - -Both `saved` and `mean_saved` remain usable after trace exit. You only need -`.save()` if you want to register an existing `SavedActivation` object -yourself inside the trace. - -## Activation Steering Example - -Single forward pass: - -```python -positive_prompts = [ - "The movie was absolutely wonderful and I felt", - "The dinner was excellent and I left feeling", - "The vacation was amazing and it made me feel", -] -negative_prompts = [ - "The movie was absolutely terrible and I felt", - "The dinner was awful and I left feeling", - "The vacation was horrible and it made me feel", -] -neutral_prompt = "The day was long and by the end I felt" -prompts = positive_prompts + negative_prompts + [neutral_prompt] - -path = "model.transformer.h[5].output[-1]" - -with model.trace() as tracer: - with tracer.invoke(prompts): - saved = save_activations(model, [path]) - positive = saved.slice[0:3].mean() - negative = saved.slice[3:6].mean() - neutral = saved.slice[6] - -direction = positive - negative -steered = neutral + 1.5 * direction -``` - -Direct NNSight equivalent for one path: - -```python -with model.trace() as tracer: - with tracer.invoke(prompts): - full = model.transformer.h[5].output[:, -1, :].save() - -positive = full[0:3].mean(dim=0, keepdim=True) -negative = full[3:6].mean(dim=0, keepdim=True) -neutral = full[6:7] -direction = positive - negative -steered = neutral + 1.5 * direction -``` - -The difference is that `SavedActivation` keeps the same pattern workable when -you are carrying several saved paths at once instead of a single tensor. - -## Patching Example - -```python -clean_prompt = "After John and Mary went to the store, Mary gave a bottle of milk to" -corrupted_prompt = "After John and Mary went to the store, John gave a bottle of milk to" -patch_paths = [ - "model.transformer.h[0].output[9]", - "model.transformer.h[1].output[9]", - "model.transformer.h[2].output[9]", - "model.transformer.h[3].output[9]", -] - -with model.trace() as tracer: - with tracer.invoke([clean_prompt]): - clean_saved = save_activations(model, patch_paths) - -focused_patch = clean_saved.subset(["model.transformer.h[2].output[9]"]) - -with model.trace() as tracer: - with tracer.invoke([corrupted_prompt]): - focused_patch.apply(model) - logits = model.lm_head.output.save() -``` - -Direct NNSight equivalent for one path: - -```python -with model.trace() as tracer: - with tracer.invoke([clean_prompt]): - clean_layer_2 = model.transformer.h[2].output[:, 9, :].save() - -with model.trace() as tracer: - with tracer.invoke([corrupted_prompt]): - model.transformer.h[2].output[:, 9, :] = clean_layer_2 - logits = model.lm_head.output.save() -``` - -If you want to modify the current activation instead of replacing it outright, -use `updates(...)` in the later run: - -```python -with model.trace() as tracer: - with tracer.invoke([corrupted_prompt]): - for key, current_value, update in updates(model, focused_patch.keys()): - update(current_value + 2.0 * focused_patch.get(key)) - logits = model.lm_head.output.save() -``` - -## Other Models - -Use the model's real path names. For example, a Qwen-style decoder stack can -use: - -```python -save_activations(model, ["model.model.layers[:].output[2]"]) -```