Repositories / agent-snapshot.git
agent-snapshot.git
Clone (read-only): git clone http://git.guha-anderson.com/git/agent-snapshot.git
@@ -0,0 +1,240 @@ +# Agent Snapshot + +Agent Snapshot is a Linux-only command-line tool for running another program under +`ptrace` and recording the filesystem state that program observes or changes. It +is intended for compact snapshots of command execution: enough information to +reconstruct relevant file state without copying the whole filesystem. + +The tool traces the launched process and children created with `fork`, `vfork`, +or `clone`. It records paths seen through filesystem-related syscalls such as +opens, stats, directory traversal, renames, deletes, and truncation. For each +recorded path, the snapshot stores a before-state captured at first observation +and an after-state captured when the traced process tree exits. + +## What Is Saved + +Agent Snapshot saves state for paths the traced program actually observes or +mutates. It does not scan the filesystem before launch. + +At a high level, the snapshot includes: + +- Paths opened for reading or writing. +- Paths checked for existence or metadata. +- Directories traversed by the program. +- Files created, renamed, truncated, modified, or deleted. +- Before and after metadata for observed paths. +- Content blobs for files whose contents are needed for reconstruction. +- Git repository roots and commit hashes for Git-tracked files. + +The snapshot is intentionally compact. Clean Git-tracked files that are only read +are represented by Git metadata instead of copied into the snapshot. Files owned +by another user and not writable by the current user are treated as part of the +external system environment and are not copied. + +There are important exceptions: + +- If the traced program writes a regular file, Agent Snapshot saves its + after-state blob even if the file is clean and Git-tracked when the program + exits. +- Dirty Git-tracked files, untracked files, and Git-ignored files under a Git + repo are copied when their contents are needed. +- Deleted files are represented with tombstones. +- Any path inside a `.git` directory is ignored. +- Paths explicitly listed in the ignore configuration are ignored. +- The ignore configuration file itself is ignored. + +Agent Snapshot currently does not aim to save every possible source of process +behavior. It does not snapshot environment variables, process limits, network +state, complete directory entry listings, or arbitrary non-filesystem resources. +It is also currently focused on Linux x86_64 syscall decoding. + +Non-UTF-8 pathnames are a known limitation: the current JSON manifest stores +paths as JSON strings, and the JSON library rejects invalid UTF-8. + +## Snapshot Format + +A snapshot is a directory bundle: + +```text +snapshot-dir/ + manifest.json + blobs/ + <content-digest> +``` + +`manifest.json` contains: + +- `format_version`: snapshot format version. +- `command`: command and arguments that were launched. +- `exit_status`: recorded command status field. +- `start_cwd`: working directory where Agent Snapshot was launched. +- `uid` and `gid`: user and group running Agent Snapshot. +- `git_repositories`: Git repositories observed by the traced program. +- `files`: per-path records. + +Each file record contains: + +- `path`: absolute path. +- `operations`: observed capabilities such as `read`, `write`, `existence`, + `directory`, or `delete`. +- `before`: state captured the first time the path was observed. +- `after`: state captured after the traced process tree exited. +- `git`: Git classification for the path when applicable. + +Metadata records include whether the path exists, file type, mode, size, mtime, +and optionally a `blob` digest. Blob files live under `blobs/` and are addressed +by digest. The digest is currently an internal content-addressing key, not a +cryptographic integrity guarantee. + +Clean Git-tracked reads typically have no blob: + +```json +{ + "path": "/repo/file.txt", + "operations": ["read"], + "before": { + "exists": true, + "type": "file", + "mode": 33188, + "size": 12, + "mtime": 1770000000 + }, + "after": { + "exists": true, + "type": "file", + "mode": 33188, + "size": 12, + "mtime": 1770000000 + }, + "git": { + "in_repo": true, + "root": "/repo", + "head": "abc123...", + "relative_path": "file.txt", + "tracked": true, + "dirty": false, + "ignored": false + } +} +``` + +Captured file contents appear as blob references: + +```json +{ + "path": "/repo/generated.txt", + "operations": ["write"], + "before": { + "exists": false + }, + "after": { + "exists": true, + "type": "file", + "mode": 33188, + "size": 18, + "mtime": 1770000001, + "blob": "0d88229adcb64ea7" + } +} +``` + +Deleted files are represented by an after-state tombstone: + +```json +{ + "path": "/repo/deleted.txt", + "operations": ["delete"], + "before": { + "exists": true, + "type": "file", + "blob": "..." + }, + "after": { + "exists": false, + "tombstone": true + } +} +``` + +## Usage + +Build with CMake: + +```bash +cmake -S . -B build +cmake --build build --parallel +``` + +Create the required ignore configuration before running snapshots: + +```bash +mkdir -p "${XDG_CONFIG_HOME:-$HOME/.config}/agent-snapshot" +printf '[]\n' > "${XDG_CONFIG_HOME:-$HOME/.config}/agent-snapshot/ignore.json" +``` + +Run a command under Agent Snapshot: + +```bash +build/agent-snapshot --output snapshot-dir -- command arg1 arg2 +``` + +For example: + +```bash +build/agent-snapshot --output snapshot-python -- /usr/bin/python3 script.py +``` + +Restore captured final-state blobs and tombstones in place: + +```bash +build/agent-snapshot restore snapshot-dir +``` + +Restore only applies files that have blobs and tombstones. Clean Git-tracked +files and reconstructable system files are represented in the manifest but are +not rewritten by restore. + +## Configuration + +Snapshot runs require an ignore configuration file. If it is missing or not a +JSON array of strings, Agent Snapshot aborts before launching the traced command. + +The config path is: + +```text +$XDG_CONFIG_HOME/agent-snapshot/ignore.json +``` + +If `XDG_CONFIG_HOME` is unset, the fallback is: + +```text +$HOME/.config/agent-snapshot/ignore.json +``` + +The file is a JSON list of file or directory paths: + +```json +[ + "$HOME/.cache", + "$XDG_CONFIG_HOME/agent-snapshot/ignore.json", + "/tmp/scratch-output" +] +``` + +Entries may begin with `$HOME` or `$XDG_CONFIG_HOME`. These prefixes are expanded +before matching. If `XDG_CONFIG_HOME` is unset, `$XDG_CONFIG_HOME` expands to +`$HOME/.config`. + +An ignored path suppresses both exact matches and descendants. For example, +ignoring `/tmp/work` also ignores `/tmp/work/output.txt`. + +The ignore file itself is always ignored, even if it is read by the traced +program. + +## Tests + +Run the test suite with `uv`: + +```bash +uv run pytest +```