Repositories / agent-snapshot.git

agent-snapshot.git

Clone (read-only): git clone http://git.guha-anderson.com/git/agent-snapshot.git

Branch
.gitignoreblob
dune-projectblob
pyproject.tomlblob
README.mdblob
srctree
test_programstree
testdatatree
teststree
uv.lockblob

README.md

# Agent Snapshot Agent Snapshot is a Linux-only command-line tool for running another program under `ptrace` and recording the filesystem state that program observes or changes. It is intended for compact snapshots of command execution: enough information to reconstruct relevant file state without copying the whole filesystem. The tool traces the launched process and children created with `fork`, `vfork`, or `clone`. It records paths seen through filesystem-related syscalls such as opens, stats, directory traversal, renames, deletes, and truncation. For each recorded path, the snapshot stores a before-state captured at first observation and an after-state captured when the traced process tree exits. ## What Is Saved Agent Snapshot saves state for paths the traced program actually observes or mutates. It does not scan the filesystem before launch. At a high level, the snapshot includes: - Paths opened for reading or writing. - Paths checked for existence or metadata. - Directories traversed by the program. - Files created, renamed, truncated, modified, or deleted. - Before and after metadata for observed paths. - Content blobs for files whose contents are needed for reconstruction. - Git repository roots and commit hashes for Git-tracked files. The snapshot is intentionally compact. Clean Git-tracked files that are only read are represented by Git metadata instead of copied into the snapshot. Files owned by another user and not writable by the current user are treated as part of the external system environment and are not recorded. There are important exceptions: - If the traced program writes a regular file, Agent Snapshot saves its after-state blob even if the file is clean and Git-tracked when the program exits. - Dirty Git-tracked files, untracked files, and Git-ignored files under a Git repo are copied when their contents are needed. - Deleted files are represented with tombstones. - Any path inside a `.git` directory is ignored. - Paths explicitly listed in the ignore configuration are ignored. - The ignore configuration file itself is ignored. Agent Snapshot currently does not aim to save every possible source of process behavior. It does not snapshot environment variables, process limits, network state, complete directory entry listings, or arbitrary non-filesystem resources. It is also currently focused on Linux x86_64 syscall decoding. The manifest is always written as UTF-8 JSON. Existing valid UTF-8 strings are preserved. Non-UTF-8 path bytes are converted through Latin-1 before writing so the manifest remains readable JSON, but byte-exact path restoration for such names is not yet represented separately in the snapshot format. ## Snapshot Format A snapshot is a directory bundle: ```text snapshot-dir/ manifest.json blobs.parquet ``` `blobs.parquet` is created only when at least one blob is captured. It is a Parquet table with `key` and binary `content` columns (Snappy-compressed column chunks). Rows are written in bounded row groups during capture so the tool does not keep every blob in memory. `manifest.json` contains: - `command`: command and arguments that were launched. - `exit_status`: recorded command status field. - `start_cwd`: working directory where Agent Snapshot was launched. - `uid` and `gid`: user and group running Agent Snapshot. - `git_repositories`: Git repositories observed by the traced program. - `files`: per-path records. Each file record contains: - `path`: absolute path. - `operations`: observed capabilities such as `read`, `write`, `existence`, `directory`, or `delete`. - `before`: state captured the first time the path was observed. - `after`: state captured after the traced process tree exited. - `git`: Git classification for the path when applicable. Metadata records include whether the path exists, file type, mode, mtime, and optionally a `blob` key. Blob keys are state-qualified absolute paths such as `before:/repo/input.txt` or `after:/repo/generated.txt`; payloads for those keys are stored in `blobs.parquet` as described above. Clean Git-tracked reads typically have no blob: ```json { "path": "/repo/file.txt", "operations": ["read"], "before": { "exists": true, "type": "file", "mode": 33188, "mtime": 1770000000 }, "after": { "exists": true, "type": "file", "mode": 33188, "mtime": 1770000000 }, "git": { "in_repo": true, "root": "/repo", "head": "abc123...", "relative_path": "file.txt", "tracked": true, "dirty": false, "ignored": false } } ``` Captured file contents appear as blob references: ```json { "path": "/repo/generated.txt", "operations": ["write"], "before": { "exists": false }, "after": { "exists": true, "type": "file", "mode": 33188, "mtime": 1770000001, "blob": "after:/repo/generated.txt" } } ``` Deleted files are represented by an after-state tombstone: ```json { "path": "/repo/deleted.txt", "operations": ["delete"], "before": { "exists": true, "type": "file", "blob": "..." }, "after": { "exists": false, "tombstone": true } } ``` ## Usage Build with Dune: ```bash dune build src/ocaml/agent_snapshot.exe ``` The executable is written to: ```text _build/default/src/ocaml/agent_snapshot.exe ``` The project depends on OCaml, Dune, Yojson, Camomile, and the `ocaml-git` package. Install the latter with opam (for a local checkout next to this repo, `opam install ../../homebox/ocaml-git` from the repository root). On first use, if `ignore.json` is missing, Agent Snapshot creates the config directory and writes that file using the default path list in the [Configuration](#configuration) section. You can edit the file before or after the first run. Run a command under Agent Snapshot: ```bash dune exec -- agent-snapshot --snapshot-dir snapshot-dir command arg1 arg2 ``` For example: ```bash dune exec -- agent-snapshot --snapshot-dir snapshot-python /usr/bin/python3 script.py ``` `--output` remains available as an alias for `--snapshot-dir`. <!-- Restore is temporarily disabled while the bare command form is the default. Restore captured final-state blobs and tombstones in place: ```bash dune exec -- agent-snapshot restore snapshot-dir ``` Restore only applies files that have blobs and tombstones. Clean Git-tracked files and reconstructable system files are represented in the manifest but are not rewritten by restore. --> ## Configuration Snapshot runs read an ignore configuration file. If it is missing, Agent Snapshot creates it with the default JSON array shown below and continues. If the file exists but is not a JSON array of strings, Agent Snapshot aborts before launching the traced command. The config path is: ```text $XDG_CONFIG_HOME/agent-snapshot/ignore.json ``` If `XDG_CONFIG_HOME` is unset, the fallback is: ```text $HOME/.config/agent-snapshot/ignore.json ``` The file is a JSON list of file or directory paths: ```json [ "$HOME/.cache", "$HOME/.claude", "$HOME/.codex", "$HOME/.cursor", "$XDG_CONFIG_HOME/agent-snapshot/ignore.json", "/tmp/scratch-output", "/proc", "/dev", "/usr", "/bin" ] ``` Entries may begin with `$HOME` or `$XDG_CONFIG_HOME`. These prefixes are expanded before matching. If `XDG_CONFIG_HOME` is unset, `$XDG_CONFIG_HOME` expands to `$HOME/.config`. An ignored path suppresses both exact matches and descendants. For example, ignoring `/tmp/work` also ignores `/tmp/work/output.txt`. The ignore file itself is always ignored, even if it is read by the traced program. ## Tests Run the test suite with `uv`: ```bash uv run pytest ```