Repositories / agent-snapshot.git

agent-snapshot.git

Clone (read-only): git clone http://git.guha-anderson.com/git/agent-snapshot.git

Branch

Add README

Author
Arjun Guha <a.guha@northeastern.edu>
Date
2026-05-02 20:32:28 -0400
Commit
0df5a2eb62acf66a9f7376d586636a2f01ee4b68
README.md
new file mode 100644
index 0000000..8133c42
--- /dev/null
+++ b/README.md
@@ -0,0 +1,240 @@
+# Agent Snapshot
+
+Agent Snapshot is a Linux-only command-line tool for running another program under
+`ptrace` and recording the filesystem state that program observes or changes. It
+is intended for compact snapshots of command execution: enough information to
+reconstruct relevant file state without copying the whole filesystem.
+
+The tool traces the launched process and children created with `fork`, `vfork`,
+or `clone`. It records paths seen through filesystem-related syscalls such as
+opens, stats, directory traversal, renames, deletes, and truncation. For each
+recorded path, the snapshot stores a before-state captured at first observation
+and an after-state captured when the traced process tree exits.
+
+## What Is Saved
+
+Agent Snapshot saves state for paths the traced program actually observes or
+mutates. It does not scan the filesystem before launch.
+
+At a high level, the snapshot includes:
+
+- Paths opened for reading or writing.
+- Paths checked for existence or metadata.
+- Directories traversed by the program.
+- Files created, renamed, truncated, modified, or deleted.
+- Before and after metadata for observed paths.
+- Content blobs for files whose contents are needed for reconstruction.
+- Git repository roots and commit hashes for Git-tracked files.
+
+The snapshot is intentionally compact. Clean Git-tracked files that are only read
+are represented by Git metadata instead of copied into the snapshot. Files owned
+by another user and not writable by the current user are treated as part of the
+external system environment and are not copied.
+
+There are important exceptions:
+
+- If the traced program writes a regular file, Agent Snapshot saves its
+  after-state blob even if the file is clean and Git-tracked when the program
+  exits.
+- Dirty Git-tracked files, untracked files, and Git-ignored files under a Git
+  repo are copied when their contents are needed.
+- Deleted files are represented with tombstones.
+- Any path inside a `.git` directory is ignored.
+- Paths explicitly listed in the ignore configuration are ignored.
+- The ignore configuration file itself is ignored.
+
+Agent Snapshot currently does not aim to save every possible source of process
+behavior. It does not snapshot environment variables, process limits, network
+state, complete directory entry listings, or arbitrary non-filesystem resources.
+It is also currently focused on Linux x86_64 syscall decoding.
+
+Non-UTF-8 pathnames are a known limitation: the current JSON manifest stores
+paths as JSON strings, and the JSON library rejects invalid UTF-8.
+
+## Snapshot Format
+
+A snapshot is a directory bundle:
+
+```text
+snapshot-dir/
+  manifest.json
+  blobs/
+    <content-digest>
+```
+
+`manifest.json` contains:
+
+- `format_version`: snapshot format version.
+- `command`: command and arguments that were launched.
+- `exit_status`: recorded command status field.
+- `start_cwd`: working directory where Agent Snapshot was launched.
+- `uid` and `gid`: user and group running Agent Snapshot.
+- `git_repositories`: Git repositories observed by the traced program.
+- `files`: per-path records.
+
+Each file record contains:
+
+- `path`: absolute path.
+- `operations`: observed capabilities such as `read`, `write`, `existence`,
+  `directory`, or `delete`.
+- `before`: state captured the first time the path was observed.
+- `after`: state captured after the traced process tree exited.
+- `git`: Git classification for the path when applicable.
+
+Metadata records include whether the path exists, file type, mode, size, mtime,
+and optionally a `blob` digest. Blob files live under `blobs/` and are addressed
+by digest. The digest is currently an internal content-addressing key, not a
+cryptographic integrity guarantee.
+
+Clean Git-tracked reads typically have no blob:
+
+```json
+{
+  "path": "/repo/file.txt",
+  "operations": ["read"],
+  "before": {
+    "exists": true,
+    "type": "file",
+    "mode": 33188,
+    "size": 12,
+    "mtime": 1770000000
+  },
+  "after": {
+    "exists": true,
+    "type": "file",
+    "mode": 33188,
+    "size": 12,
+    "mtime": 1770000000
+  },
+  "git": {
+    "in_repo": true,
+    "root": "/repo",
+    "head": "abc123...",
+    "relative_path": "file.txt",
+    "tracked": true,
+    "dirty": false,
+    "ignored": false
+  }
+}
+```
+
+Captured file contents appear as blob references:
+
+```json
+{
+  "path": "/repo/generated.txt",
+  "operations": ["write"],
+  "before": {
+    "exists": false
+  },
+  "after": {
+    "exists": true,
+    "type": "file",
+    "mode": 33188,
+    "size": 18,
+    "mtime": 1770000001,
+    "blob": "0d88229adcb64ea7"
+  }
+}
+```
+
+Deleted files are represented by an after-state tombstone:
+
+```json
+{
+  "path": "/repo/deleted.txt",
+  "operations": ["delete"],
+  "before": {
+    "exists": true,
+    "type": "file",
+    "blob": "..."
+  },
+  "after": {
+    "exists": false,
+    "tombstone": true
+  }
+}
+```
+
+## Usage
+
+Build with CMake:
+
+```bash
+cmake -S . -B build
+cmake --build build --parallel
+```
+
+Create the required ignore configuration before running snapshots:
+
+```bash
+mkdir -p "${XDG_CONFIG_HOME:-$HOME/.config}/agent-snapshot"
+printf '[]\n' > "${XDG_CONFIG_HOME:-$HOME/.config}/agent-snapshot/ignore.json"
+```
+
+Run a command under Agent Snapshot:
+
+```bash
+build/agent-snapshot --output snapshot-dir -- command arg1 arg2
+```
+
+For example:
+
+```bash
+build/agent-snapshot --output snapshot-python -- /usr/bin/python3 script.py
+```
+
+Restore captured final-state blobs and tombstones in place:
+
+```bash
+build/agent-snapshot restore snapshot-dir
+```
+
+Restore only applies files that have blobs and tombstones. Clean Git-tracked
+files and reconstructable system files are represented in the manifest but are
+not rewritten by restore.
+
+## Configuration
+
+Snapshot runs require an ignore configuration file. If it is missing or not a
+JSON array of strings, Agent Snapshot aborts before launching the traced command.
+
+The config path is:
+
+```text
+$XDG_CONFIG_HOME/agent-snapshot/ignore.json
+```
+
+If `XDG_CONFIG_HOME` is unset, the fallback is:
+
+```text
+$HOME/.config/agent-snapshot/ignore.json
+```
+
+The file is a JSON list of file or directory paths:
+
+```json
+[
+  "$HOME/.cache",
+  "$XDG_CONFIG_HOME/agent-snapshot/ignore.json",
+  "/tmp/scratch-output"
+]
+```
+
+Entries may begin with `$HOME` or `$XDG_CONFIG_HOME`. These prefixes are expanded
+before matching. If `XDG_CONFIG_HOME` is unset, `$XDG_CONFIG_HOME` expands to
+`$HOME/.config`.
+
+An ignored path suppresses both exact matches and descendants. For example,
+ignoring `/tmp/work` also ignores `/tmp/work/output.txt`.
+
+The ignore file itself is always ignored, even if it is read by the traced
+program.
+
+## Tests
+
+Run the test suite with `uv`:
+
+```bash
+uv run pytest
+```