Repositories / agent-snapshot.git
README.md
Clone (read-only): git clone http://git.guha-anderson.com/git/agent-snapshot.git
# Agent Snapshot
Agent Snapshot is a Linux-only command-line tool for running another program under
`ptrace` and recording the filesystem state that program observes or changes. It
is intended for compact snapshots of command execution: enough information to
reconstruct relevant file state without copying the whole filesystem.
The tool traces the launched process and children created with `fork`, `vfork`,
or `clone`. It records paths seen through filesystem-related syscalls such as
opens, stats, directory traversal, renames, deletes, and truncation. For each
recorded path, the snapshot stores a before-state captured at first observation
and an after-state captured when the traced process tree exits.
## What Is Saved
Agent Snapshot saves state for paths the traced program actually observes or
mutates. It does not scan the filesystem before launch.
At a high level, the snapshot includes:
- Paths opened for reading or writing.
- Paths checked for existence or metadata.
- Directories traversed by the program.
- Files created, renamed, truncated, modified, or deleted.
- Before and after metadata for observed paths.
- Content blobs for files whose contents are needed for reconstruction.
- Git repository roots and commit hashes for Git-tracked files.
The snapshot is intentionally compact. Clean Git-tracked files that are only read
are represented by Git metadata instead of copied into the snapshot. Files owned
by another user and not writable by the current user are treated as part of the
external system environment and are not recorded.
There are important exceptions:
- If the traced program writes a regular file, Agent Snapshot saves its
after-state blob even if the file is clean and Git-tracked when the program
exits.
- Dirty Git-tracked files, untracked files, and Git-ignored files under a Git
repo are copied when their contents are needed.
- Deleted files are represented with tombstones.
- Any path inside a `.git` directory is ignored.
- Paths explicitly listed in the ignore configuration are ignored.
- The ignore configuration file itself is ignored.
Agent Snapshot currently does not aim to save every possible source of process
behavior. It does not snapshot environment variables, process limits, network
state, complete directory entry listings, or arbitrary non-filesystem resources.
It is also currently focused on Linux x86_64 syscall decoding.
The manifest is always written as UTF-8 JSON. Existing valid UTF-8 strings are
preserved. Non-UTF-8 path bytes are converted through Latin-1 before writing so
the manifest remains readable JSON, but byte-exact path restoration for such
names is not yet represented separately in the snapshot format.
## Snapshot Format
A snapshot is a directory bundle:
```text
snapshot-dir/
manifest.json
blobs.parquet
```
`blobs.parquet` is created only when at least one blob is captured. It is a
Parquet table with `key` and binary `content` columns
(Snappy-compressed column chunks). Rows are written in bounded row groups during
capture so the tool does not keep every blob in memory.
`manifest.json` contains:
- `command`: command and arguments that were launched.
- `exit_status`: recorded command status field.
- `start_cwd`: working directory where Agent Snapshot was launched.
- `uid` and `gid`: user and group running Agent Snapshot.
- `git_repositories`: Git repositories observed by the traced program.
- `files`: per-path records.
Each file record contains:
- `path`: absolute path.
- `operations`: observed capabilities such as `read`, `write`, `existence`,
`directory`, or `delete`.
- `before`: state captured the first time the path was observed.
- `after`: state captured after the traced process tree exited.
- `git`: Git classification for the path when applicable.
Metadata records include whether the path exists, file type, mode, mtime,
and optionally a `blob` key. Blob keys are state-qualified absolute paths such
as `before:/repo/input.txt` or `after:/repo/generated.txt`; payloads for those
keys are stored in `blobs.parquet` as described above.
Clean Git-tracked reads typically have no blob:
```json
{
"path": "/repo/file.txt",
"operations": ["read"],
"before": {
"exists": true,
"type": "file",
"mode": 33188,
"mtime": 1770000000
},
"after": {
"exists": true,
"type": "file",
"mode": 33188,
"mtime": 1770000000
},
"git": {
"in_repo": true,
"root": "/repo",
"head": "abc123...",
"relative_path": "file.txt",
"tracked": true,
"dirty": false,
"ignored": false
}
}
```
Captured file contents appear as blob references:
```json
{
"path": "/repo/generated.txt",
"operations": ["write"],
"before": {
"exists": false
},
"after": {
"exists": true,
"type": "file",
"mode": 33188,
"mtime": 1770000001,
"blob": "after:/repo/generated.txt"
}
}
```
Deleted files are represented by an after-state tombstone:
```json
{
"path": "/repo/deleted.txt",
"operations": ["delete"],
"before": {
"exists": true,
"type": "file",
"blob": "..."
},
"after": {
"exists": false,
"tombstone": true
}
}
```
## Usage
Build with Dune:
```bash
dune build src/ocaml/agent_snapshot.exe
```
The executable is written to:
```text
_build/default/src/ocaml/agent_snapshot.exe
```
The project depends on OCaml, Dune, Yojson, Camomile, and the `ocaml-git`
package. Install the latter with opam (for a local checkout next to this repo,
`opam install ../../homebox/ocaml-git` from the repository root).
On first use, if `ignore.json` is missing, Agent Snapshot creates the config
directory and writes that file using the default path list in the
[Configuration](#configuration) section. You can edit the file before or after
the first run.
Run a command under Agent Snapshot:
```bash
dune exec -- agent-snapshot --snapshot-dir snapshot-dir command arg1 arg2
```
For example:
```bash
dune exec -- agent-snapshot --snapshot-dir snapshot-python /usr/bin/python3 script.py
```
`--output` remains available as an alias for `--snapshot-dir`.
<!-- Restore is temporarily disabled while the bare command form is the default.
Restore captured final-state blobs and tombstones in place:
```bash
dune exec -- agent-snapshot restore snapshot-dir
```
Restore only applies files that have blobs and tombstones. Clean Git-tracked
files and reconstructable system files are represented in the manifest but are
not rewritten by restore.
-->
## Configuration
Snapshot runs read an ignore configuration file. If it is missing, Agent
Snapshot creates it with the default JSON array shown below and continues. If
the file exists but is not a JSON array of strings, Agent Snapshot aborts before
launching the traced command.
The config path is:
```text
$XDG_CONFIG_HOME/agent-snapshot/ignore.json
```
If `XDG_CONFIG_HOME` is unset, the fallback is:
```text
$HOME/.config/agent-snapshot/ignore.json
```
The file is a JSON list of file or directory paths:
```json
[
"$HOME/.cache",
"$HOME/.claude",
"$HOME/.codex",
"$HOME/.cursor",
"$XDG_CONFIG_HOME/agent-snapshot/ignore.json",
"/tmp/scratch-output",
"/proc",
"/dev",
"/usr",
"/bin"
]
```
Entries may begin with `$HOME` or `$XDG_CONFIG_HOME`. These prefixes are expanded
before matching. If `XDG_CONFIG_HOME` is unset, `$XDG_CONFIG_HOME` expands to
`$HOME/.config`.
An ignored path suppresses both exact matches and descendants. For example,
ignoring `/tmp/work` also ignores `/tmp/work/output.txt`.
The ignore file itself is always ignored, even if it is read by the traced
program.
## Tests
Run the test suite with `uv`:
```bash
uv run pytest
```