Repositories / more_nnsight.git

more_nnsight.git

Clone (read-only): git clone http://git.guha-anderson.com/git/more_nnsight.git

Branch

Moved SKILL to README

Author
Arjun Guha <a.guha@northeastern.edu>
Date
2026-03-31 15:23:13 -0400
Commit
12de53f84f05e152ecf8422554ded0f8df6d0f9b
README.md
index e69de29..e6b903d 100644
--- a/README.md
+++ b/README.md
@@ -0,0 +1,127 @@
+# more-nnsight
+
+This library adds one abstraction on top of NNSight: `SavedActivation`, a
+container that holds activation slices keyed by their model path and token
+position.
+
+NNSight already lets you save and patch individual activations. The problem
+starts when you need several at once — across layers, token positions, or
+prompts. You end up juggling loose tensors and remembering which one came from
+where. NNSight also requires that you access modules in layer-major order
+within a trace: you cannot touch layer 2, then layer 5, then go back to layer
+2. `SavedActivation` keeps activations together, enforces canonical ordering
+internally, and gives you batch, arithmetic, and patching operations over the
+whole set.
+
+## API
+
+```python
+from more_nnsight import SavedActivation, save_activations, updates
+```
+
+Call `save_activations` inside an NNSight trace to capture activations. Paths
+follow the model's attribute structure; the final bracket is the token position.
+
+```python
+with model.trace() as tracer:
+    with tracer.invoke(prompts):
+        saved = save_activations(model, [
+            "model.transformer.h[2].output[10]",
+            "model.transformer.h[3].output[-1]",
+        ])
+```
+
+Use `[:]` to expand over a `ModuleList` — `"model.transformer.h[:].output[10]"`
+becomes one key per layer. Each saved tensor has shape `(batch, hidden)`.
+
+Values come back by string or by attribute traversal:
+
+```python
+saved.get("model.transformer.h[2].output[10]")
+saved.values.transformer.h[2].output[10]
+saved.keys()  # list of saved path strings
+```
+
+You can slice the batch dimension, take the mean, or do arithmetic across all
+saved paths at once:
+
+```python
+saved.slice[0:3]                                   # first three batch rows
+saved.mean()                                        # (1, hidden) per path
+direction = positive.mean() - negative.mean()       # +, -, scalar * all work
+```
+
+`subset` narrows to specific paths; `union` merges two disjoint sets:
+
+```python
+focused = saved.subset(["model.transformer.h[2].output[10]"])
+combined = patch_a.union(patch_b)
+```
+
+To write stored activations into a later forward pass, use `apply`:
+
+```python
+with model.trace() as tracer:
+    with tracer.invoke(corrupted_prompts):
+        saved.apply(model)
+```
+
+When the replacement depends on the live activation, use `updates` instead.
+It is a generator that walks paths in layer-major order and yields three values
+per path: `key` (the path string), `current` (the activation from the current
+forward pass at that path and token position), and `update` (a callback that
+writes a new tensor back to the same location). At each step, you call
+`update(new_value)` with whatever you want to write. This lets you express
+per-layer logic — for example, adding a scaled steering vector at one layer
+while replacing outright at another:
+
+```python
+with model.trace() as tracer:
+    with tracer.invoke(prompts):
+        for key, current, update in updates(model, saved.keys()):
+            if key == "model.transformer.h[2].output[-1]":
+                update(current + 2.0 * saved.get(key))
+            else:
+                update(saved.get(key))
+```
+
+You can also build a `SavedActivation` directly from tensors you already have:
+
+```python
+patch = SavedActivation.from_pairs(
+    ("model.transformer.h[2].output[10]", tensor_a),
+    ("model.transformer.h[3].output[9]", tensor_b),
+)
+```
+
+## Typical workflows
+
+**Activation patching** — save from a clean run, replay into a corrupted run:
+
+```python
+with model.trace() as tracer:
+    with tracer.invoke([clean_prompt]):
+        patch = save_activations(model, ["model.transformer.h[2].output[9]"])
+
+with model.trace() as tracer:
+    with tracer.invoke([corrupted_prompt]):
+        patch.apply(model)
+        logits = model.lm_head.output.save()
+```
+
+**Steering** — compute a direction from contrastive prompts, add it to a
+neutral run:
+
+```python
+prompts = positive_prompts + negative_prompts + [neutral_prompt]
+path = "model.transformer.h[5].output[-1]"
+
+with model.trace() as tracer:
+    with tracer.invoke(prompts):
+        saved = save_activations(model, [path])
+        pos = saved.slice[0:3].mean()
+        neg = saved.slice[3:6].mean()
+        neutral = saved.slice[6]
+
+steered = neutral + 1.5 * (pos - neg)
+```
SKILL.md
deleted file mode 100644
index 45650d2..0000000
--- a/SKILL.md
+++ /dev/null
@@ -1,431 +0,0 @@
-# SavedActivation
-
-`SavedActivation` is for the point where plain NNSight starts to get awkward:
-you are no longer saving one activation, but several, and now you have to keep
-track of which tensor came from which layer and position. It keeps those saves
-together in one object, addressed by the same paths you used to create them, so
-operations like subsetting, averaging, and patching stay attached to the
-activations themselves instead of turning into bookkeeping. This file documents
-the API and shows the equivalent direct NNSight code.
-
-## Imports
-
-```python
-from more_nnsight import SavedActivation, save_activations, updates
-```
-
-## Core Rule
-
-`save_activations(...)` must be called inside an already-active NNSight
-`trace`/`invoke` context.
-
-```python
-with model.trace() as tracer:
-    with tracer.invoke(prompts):
-        saved = save_activations(model, ["model.transformer.h[2].output[10]"])
-```
-
-This matches ordinary NNSight usage: activation saves only work inside a trace.
-
-## Path Syntax
-
-Paths follow the model's real attribute/index structure.
-
-Examples:
-
-- `model.transformer.h[2].output[10]`
-- `model.transformer.h[3].output[-1]`
-- `model.transformer.h[:].output[10]`
-- `model.model.layers[:].output[2]`
-
-Everything before the final bracket names the activation tensor. The final
-bracket gives the token position to save. Intermediate `[:]` syntax expands
-over repeated blocks such as GPT-2 `transformer.h[:]` or Qwen `model.layers[:]`.
-
-For example, `model.transformer.h[2].output[10]` means "take layer 2, take its
-output, and save token position 10", producing a tensor of shape
-`(batch_size, hidden_size)`.
-
-## Saving Activations
-
-### Single path
-
-```python
-with model.trace() as tracer:
-    with tracer.invoke(prompts):
-        saved = save_activations(model, ["model.transformer.h[2].output[10]"])
-```
-
-Direct NNSight equivalent:
-
-```python
-with model.trace() as tracer:
-    with tracer.invoke(prompts):
-        direct = model.transformer.h[2].output[:, 10, :].save()
-```
-
-Equivalent access:
-
-```python
-saved.get("model.transformer.h[2].output[10]") == direct
-```
-
-### Multiple paths
-
-```python
-with model.trace() as tracer:
-    with tracer.invoke(prompts):
-        saved = save_activations(
-            model,
-            [
-                "model.transformer.h[2].output[10]",
-                "model.transformer.h[3].output[-1]",
-            ],
-        )
-```
-
-Direct NNSight equivalent:
-
-```python
-with model.trace() as tracer:
-    with tracer.invoke(prompts):
-        layer_2 = model.transformer.h[2].output[:, 10, :].save()
-        layer_3 = model.transformer.h[3].output[:, -1, :].save()
-```
-
-### All layers with `[:]`
-
-```python
-with model.trace() as tracer:
-    with tracer.invoke(prompts):
-        saved = save_activations(model, ["model.transformer.h[:].output[10]"])
-```
-
-Direct NNSight equivalent:
-
-```python
-with model.trace() as tracer:
-    with tracer.invoke(prompts):
-        direct = [
-            model.transformer.h[layer].output[:, 10, :].save()
-            for layer in range(len(model.transformer.h))
-        ]
-```
-
-The saved keys become concrete:
-
-```python
-saved.keys()
-# [
-#   "model.transformer.h[0].output[10]",
-#   "model.transformer.h[1].output[10]",
-#   ...
-# ]
-```
-
-## Accessing Saved Values
-
-Saved values are exposed under `.values`.
-
-```python
-tensor = saved.values.transformer.h[2].output[10]
-```
-
-Equivalent string lookup:
-
-```python
-tensor = saved.get("model.transformer.h[2].output[10]")
-```
-
-`saved.keys()` returns the saved path strings, and `saved.get(path)` returns one
-saved tensor. Missing paths raise immediately.
-
-## Building From Explicit Pairs
-
-If you already have tensors and want to build a `SavedActivation` directly, use
-`SavedActivation.from_pairs(...)`:
-
-```python
-patch = SavedActivation.from_pairs(
-    ("model.model.layers[34].output[10]", tensor_a),
-    ("model.model.layers[35].output[9]", tensor_b),
-)
-```
-
-Each entry is `(path, value)`. The paths are parsed the same way as
-`save_activations(...)`, the internal order is canonicalized, and duplicate
-paths raise an error.
-
-## Subsetting by Path
-
-```python
-focused = saved.subset(["model.transformer.h[2].output[10]"])
-```
-
-This keeps only the listed saved activations and drops the rest. In direct
-NNSight, you would usually do this by manually building a smaller Python
-structure.
-
-## Union
-
-Use `union` when you want to combine different saved paths into one
-`SavedActivation`:
-
-```python
-combined = layer_2_patch.union(layer_3_patch)
-```
-
-This is different from `+`:
-
-- `a + b` means "add values on the same paths"
-- `a.union(b)` means "combine different paths into one object"
-
-`union` requires the two objects to have disjoint keys. If the same saved path
-appears in both, it raises an error.
-
-## Batch Slicing
-
-Use bracket syntax on `.slice`:
-
-```python
-first = saved.slice[0]
-first_two = saved.slice[0:2]
-mixed = saved.slice[0:2, 5, 8:10]
-```
-
-This slices the batch dimension of every saved tensor.
-
-Direct NNSight equivalent for one path:
-
-```python
-saved.get("model.transformer.h[2].output[10]")[0:2]
-```
-
-The point is that the same batch selection is applied across every saved
-activation at once.
-
-## Mean Over Batch
-
-```python
-mean_saved = saved.mean()
-```
-
-This reduces each saved tensor from shape `(batch_size, hidden_size)` to
-`(1, hidden_size)`.
-
-Direct NNSight equivalent for one path:
-
-```python
-with model.trace() as tracer:
-    with tracer.invoke(prompts):
-        direct_mean = model.transformer.h[2].output[:, 10, :].mean(dim=0, keepdim=True).save()
-```
-
-If `saved.mean()` is called inside the active trace, the reduction happens
-there before the reduced value is saved. That is more memory-efficient than
-saving the full batch and averaging later.
-
-## Arithmetic
-
-If two `SavedActivation` objects have the same keys, you can combine them:
-
-```python
-direction = positive.mean() - negative.mean()
-steered = neutral + 1.5 * direction
-```
-
-Supported operations are `a + b`, `a - b`, `scalar * a`, and `a * scalar`.
-They are applied elementwise across matching saved tensors.
-
-Direct NNSight equivalent for one path:
-
-```python
-direction = positive_tensor.mean(dim=0, keepdim=True) - negative_tensor.mean(dim=0, keepdim=True)
-steered = neutral_tensor + 1.5 * direction
-```
-
-## Applying Saved Activations
-
-You can patch a later run with:
-
-```python
-with model.trace() as tracer:
-    with tracer.invoke(corrupted_prompts):
-        saved.apply(model)
-```
-
-This writes each stored tensor back into the live traced activation at its
-saved path and token position.
-
-Direct NNSight equivalent for one path:
-
-```python
-with model.trace() as tracer:
-    with tracer.invoke(corrupted_prompts):
-        model.transformer.h[2].output[:, 10, :] = saved_tensor
-```
-
-`SavedActivation.apply(model)` performs that assignment for every saved key.
-
-## Layer-Major Updates
-
-When you need to read the current activation and write back an updated value in
-the same invoke, use `updates(model, keys)`.
-
-```python
-for key, current_value, update in updates(model, saved.keys()):
-    update(current_value + 2.0 * saved.get(key))
-```
-
-This is the interleaving-safe pattern for multi-layer current-pass updates. It
-walks the keys in canonical layer-major order and gives you:
-
-- `key`: the saved-activation path string
-- `current_value`: the live activation slice from the current forward pass
-- `update(new_value)`: a callback that writes a new value back to that same
-  slice
-
-`updates(...)` requires concrete saved keys in canonical order. In practice,
-`saved.keys()` is the intended input.
-
-Direct NNSight equivalent for one path:
-
-```python
-current = model.transformer.h[2].output[:, 10, :]
-model.transformer.h[2].output[:, 10, :] = current + 2.0 * saved_tensor
-```
-
-Use `updates(...)` when the new value depends on the current forward-pass
-activation. Use `saved.apply(model)` when you just want to replay stored
-activations unchanged.
-
-`save_activations(...)` and `updates(...)` should usually happen in different
-invocations. Saving layer 2 and then revisiting layer 2 after later layers have
-already been touched violates NNSight's interleaving rules.
-
-## `saved.save()`
-
-`SavedActivation.save()` is different from `save_activations(...)`.
-
-- `save_activations(...)` captures activation values
-- `saved.save()` registers the `SavedActivation` object itself with NNSight
-
-New `SavedActivation` objects created inside the trace, such as the result of
-`save_activations(...)`, `saved.mean()`, `saved.subset(...)`, or arithmetic,
-are registered automatically by the library, so the usual pattern works:
-
-```python
-with model.trace() as tracer:
-    with tracer.invoke(prompts):
-        saved = save_activations(model, ["model.transformer.h[2].output[10]"])
-        mean_saved = saved.mean()
-```
-
-Both `saved` and `mean_saved` remain usable after trace exit. You only need
-`.save()` if you want to register an existing `SavedActivation` object
-yourself inside the trace.
-
-## Activation Steering Example
-
-Single forward pass:
-
-```python
-positive_prompts = [
-    "The movie was absolutely wonderful and I felt",
-    "The dinner was excellent and I left feeling",
-    "The vacation was amazing and it made me feel",
-]
-negative_prompts = [
-    "The movie was absolutely terrible and I felt",
-    "The dinner was awful and I left feeling",
-    "The vacation was horrible and it made me feel",
-]
-neutral_prompt = "The day was long and by the end I felt"
-prompts = positive_prompts + negative_prompts + [neutral_prompt]
-
-path = "model.transformer.h[5].output[-1]"
-
-with model.trace() as tracer:
-    with tracer.invoke(prompts):
-        saved = save_activations(model, [path])
-        positive = saved.slice[0:3].mean()
-        negative = saved.slice[3:6].mean()
-        neutral = saved.slice[6]
-
-direction = positive - negative
-steered = neutral + 1.5 * direction
-```
-
-Direct NNSight equivalent for one path:
-
-```python
-with model.trace() as tracer:
-    with tracer.invoke(prompts):
-        full = model.transformer.h[5].output[:, -1, :].save()
-
-positive = full[0:3].mean(dim=0, keepdim=True)
-negative = full[3:6].mean(dim=0, keepdim=True)
-neutral = full[6:7]
-direction = positive - negative
-steered = neutral + 1.5 * direction
-```
-
-The difference is that `SavedActivation` keeps the same pattern workable when
-you are carrying several saved paths at once instead of a single tensor.
-
-## Patching Example
-
-```python
-clean_prompt = "After John and Mary went to the store, Mary gave a bottle of milk to"
-corrupted_prompt = "After John and Mary went to the store, John gave a bottle of milk to"
-patch_paths = [
-    "model.transformer.h[0].output[9]",
-    "model.transformer.h[1].output[9]",
-    "model.transformer.h[2].output[9]",
-    "model.transformer.h[3].output[9]",
-]
-
-with model.trace() as tracer:
-    with tracer.invoke([clean_prompt]):
-        clean_saved = save_activations(model, patch_paths)
-
-focused_patch = clean_saved.subset(["model.transformer.h[2].output[9]"])
-
-with model.trace() as tracer:
-    with tracer.invoke([corrupted_prompt]):
-        focused_patch.apply(model)
-        logits = model.lm_head.output.save()
-```
-
-Direct NNSight equivalent for one path:
-
-```python
-with model.trace() as tracer:
-    with tracer.invoke([clean_prompt]):
-        clean_layer_2 = model.transformer.h[2].output[:, 9, :].save()
-
-with model.trace() as tracer:
-    with tracer.invoke([corrupted_prompt]):
-        model.transformer.h[2].output[:, 9, :] = clean_layer_2
-        logits = model.lm_head.output.save()
-```
-
-If you want to modify the current activation instead of replacing it outright,
-use `updates(...)` in the later run:
-
-```python
-with model.trace() as tracer:
-    with tracer.invoke([corrupted_prompt]):
-        for key, current_value, update in updates(model, focused_patch.keys()):
-            update(current_value + 2.0 * focused_patch.get(key))
-        logits = model.lm_head.output.save()
-```
-
-## Other Models
-
-Use the model's real path names. For example, a Qwen-style decoder stack can
-use:
-
-```python
-save_activations(model, ["model.model.layers[:].output[2]"])
-```