Hamza Khaled Mahmoud — AI Engineering Notes

Hamza Khaled Mahmoud — AI Engineering Notes https://h19overflow.github.io/Portfolio/ Essential practical notes on AI systems, agents, tooling, orchestration, security, and evals. en Agent Harness Loop https://h19overflow.github.io/Portfolio/share/agent-harness-loop.html https://h19overflow.github.io/Portfolio/share/agent-harness-loop.html Mon, 15 Jun 2026 00:00:00 GMT An AI agent is not just a model call. It is a runtime that prepares context, calls the model, executes tools, streams output, handles provider quirks, and persists state. Agent Core State and Context Compression https://h19overflow.github.io/Portfolio/share/state-and-context-compression.html https://h19overflow.github.io/Portfolio/share/state-and-context-compression.html Sun, 14 Jun 2026 00:00:00 GMT Long-running agents need durable state and controlled context shrinkage. Otherwise they cannot resume, search prior work, audit tool calls, recover after crashes, or stay under... Agent Core CLI Runtime UX https://h19overflow.github.io/Portfolio/share/cli-runtime-ux.html https://h19overflow.github.io/Portfolio/share/cli-runtime-ux.html Sat, 13 Jun 2026 00:00:00 GMT A responsive agent CLI is an event-driven terminal application, not a linear input() / print() script. Hermes uses prompt_toolkit for input/control and Rich for output... Agent Core Separate Memory Layers https://h19overflow.github.io/Portfolio/share/separate-memory-layers.html https://h19overflow.github.io/Portfolio/share/separate-memory-layers.html Fri, 12 Jun 2026 00:00:00 GMT Memory is not one thing. A production agent needs at least two memory layers: 1. Curated durable memory — compact stable facts that should shape future behavior. Memory Cache Boundaries and Frozen Snapshots https://h19overflow.github.io/Portfolio/share/cache-boundaries-and-frozen-snapshots.html https://h19overflow.github.io/Portfolio/share/cache-boundaries-and-frozen-snapshots.html Thu, 11 Jun 2026 00:00:00 GMT A memory write and a prompt mutation are not the same operation. Hermes lets the agent write durable memory immediately, but it does not let that write silently rewrite the active... Memory Fenced Recall and Background Sync https://h19overflow.github.io/Portfolio/share/fenced-recall-and-background-sync.html https://h19overflow.github.io/Portfolio/share/fenced-recall-and-background-sync.html Wed, 10 Jun 2026 00:00:00 GMT Retrieved memory is not the same as the user’s message, and it is not the same as durable conversation history. Hermes injects recalled memory into the API-facing copy of the... Memory Tools as Contracts https://h19overflow.github.io/Portfolio/share/tools-as-contracts.html https://h19overflow.github.io/Portfolio/share/tools-as-contracts.html Tue, 09 Jun 2026 00:00:00 GMT A tool is not just a function. A tool is a contract shown to the model. That contract includes: Every visible tool consumes prompt space and model attention. If too many tools... Tooling Tool Registry, Discovery, and Dispatch https://h19overflow.github.io/Portfolio/share/tool-registry-discovery-and-dispatch.html https://h19overflow.github.io/Portfolio/share/tool-registry-discovery-and-dispatch.html Mon, 08 Jun 2026 00:00:00 GMT A scalable agent tool system needs decoupled registration and centralized execution. Hermes solves this with: 1. self-registering tool modules; 2. AST pre-scan before import; Tooling Progressive Disclosure and Tool Search https://h19overflow.github.io/Portfolio/share/progressive-disclosure-and-tool-search.html https://h19overflow.github.io/Portfolio/share/progressive-disclosure-and-tool-search.html Sun, 07 Jun 2026 00:00:00 GMT Every tool schema sent to the model costs tokens on every turn. A scalable tool ecosystem cannot expose every optional MCP/plugin tool directly all the time. Tooling Skills as Procedures https://h19overflow.github.io/Portfolio/share/skills-as-procedures.html https://h19overflow.github.io/Portfolio/share/skills-as-procedures.html Sat, 06 Jun 2026 00:00:00 GMT Many things agents need are not new actions. They are better ways to use existing actions. A terminal is a tool. Incident response is a skill. > [!summary] Skills Skill Mounting and Supply Chain https://h19overflow.github.io/Portfolio/share/skill-mounting-and-supply-chain.html https://h19overflow.github.io/Portfolio/share/skill-mounting-and-supply-chain.html Fri, 05 Jun 2026 00:00:00 GMT Skills are prompt-affecting dependencies. Once skills can be installed, synced, viewed, invoked, or loaded from external directories, they become part of the agent supply chain. Skills Orchestration Boundaries https://h19overflow.github.io/Portfolio/share/orchestration-boundaries.html https://h19overflow.github.io/Portfolio/share/orchestration-boundaries.html Thu, 04 Jun 2026 00:00:00 GMT Subagents are not magic. They are concurrent workers with context, tools, credentials, time limits, logs, cleanup needs, and failure modes. If agents can recursively spawn agents... Orchestration Threading in Delegation https://h19overflow.github.io/Portfolio/share/threading-in-delegation.html https://h19overflow.github.io/Portfolio/share/threading-in-delegation.html Wed, 03 Jun 2026 00:00:00 GMT Hermes uses threading because the core agent loop is synchronous, while delegated subagent work is often independent and I/O-bound. Threading lets Hermes fan out multiple subagents... Orchestration Trust Boundaries Across Agent Systems https://h19overflow.github.io/Portfolio/share/trust-boundaries-across-agent-systems.html https://h19overflow.github.io/Portfolio/share/trust-boundaries-across-agent-systems.html Tue, 02 Jun 2026 00:00:00 GMT AI agents turn text into action. That means any text that can influence future model behavior, tool use, or subagent behavior is part of the security boundary. Security Authentication and Credential Pools https://h19overflow.github.io/Portfolio/share/authentication-and-credential-pools.html https://h19overflow.github.io/Portfolio/share/authentication-and-credential-pools.html Mon, 01 Jun 2026 00:00:00 GMT Agent auth is runtime infrastructure, not just environment variables. Long-running agents need credential status tracking, rotation, exhaustion handling, OAuth refresh recovery, and... Security Advanced Evals for LLM Agents: Drift, Tool Use, and Task Fulfillment https://h19overflow.github.io/Portfolio/share/advanced-evals-for-llm-agents-drift-tool-use-and-task-fulfillment.html https://h19overflow.github.io/Portfolio/share/advanced-evals-for-llm-agents-drift-tool-use-and-task-fulfillment.html Sun, 31 May 2026 00:00:00 GMT _Last updated: 2026-06-15_ This guide synthesizes the user's timestamped notes from Phil Hetzel's Braintrust talk, plus current public guidance from Braintrust, OpenAI Evals/agent... Evaluation Advanced Agent Evals Overview https://h19overflow.github.io/Portfolio/share/advanced-agent-evals-overview.html https://h19overflow.github.io/Portfolio/share/advanced-agent-evals-overview.html Sat, 30 May 2026 00:00:00 GMT Advanced evals turn agent behavior into measurable, replayable, and improvable evidence. This folder is the learning breakdown: Agents are hard to evaluate because they are... Evaluation Eval Philosophy and Production Flywheel https://h19overflow.github.io/Portfolio/share/eval-philosophy-and-production-flywheel.html https://h19overflow.github.io/Portfolio/share/eval-philosophy-and-production-flywheel.html Fri, 29 May 2026 00:00:00 GMT LLM and agent evals are decision-support systems, not perfect truth machines. They help decide whether a model, prompt, tool, or workflow change is safe to ship. Evaluation Score Vectors and Hard Gates https://h19overflow.github.io/Portfolio/share/score-vectors-and-hard-gates.html https://h19overflow.github.io/Portfolio/share/score-vectors-and-hard-gates.html Thu, 28 May 2026 00:00:00 GMT Do not compress agent quality into one number. Use a score vector plus hard gates. A single average hides dangerous failures. Example: a candidate improves overall... Evaluation Model Behavior Evals https://h19overflow.github.io/Portfolio/share/model-behavior-evals.html https://h19overflow.github.io/Portfolio/share/model-behavior-evals.html Wed, 27 May 2026 00:00:00 GMT Model behavior evals check whether the model output follows the behavioral contract: valid format, correct content, grounded claims, safe response, and appropriate style. Evaluation Tool and Trajectory Evals https://h19overflow.github.io/Portfolio/share/tool-and-trajectory-evals.html https://h19overflow.github.io/Portfolio/share/tool-and-trajectory-evals.html Tue, 26 May 2026 00:00:00 GMT For agents, the path matters. A correct-looking final answer is not enough if the agent used the wrong tool, passed bad arguments, ignored tool output, or mutated state incorrectly. Evaluation Task Fulfillment and State Evals https://h19overflow.github.io/Portfolio/share/task-fulfillment-and-state-evals.html https://h19overflow.github.io/Portfolio/share/task-fulfillment-and-state-evals.html Mon, 25 May 2026 00:00:00 GMT A task is successful only when the user's real goal is satisfied and the external world is in the correct state. Agents can produce polished final messages while failing the real task: Evaluation Drift, Regression, and Enhancement https://h19overflow.github.io/Portfolio/share/drift-regression-and-enhancement.html https://h19overflow.github.io/Portfolio/share/drift-regression-and-enhancement.html Sun, 24 May 2026 00:00:00 GMT A change is not an enhancement unless it improves target behavior without introducing hidden regressions in safety, tools, state, cost, latency, or important slices. Evaluation Datasets, Scorers, and Judges https://h19overflow.github.io/Portfolio/share/datasets-scorers-and-judges.html https://h19overflow.github.io/Portfolio/share/datasets-scorers-and-judges.html Sat, 23 May 2026 00:00:00 GMT The quality of an eval is limited by the quality of its dataset and scorer. A good dataset captures real risks; a good scorer checks the actual contract. Evaluation Traces, CI, and Production Monitoring https://h19overflow.github.io/Portfolio/share/traces-ci-and-production-monitoring.html https://h19overflow.github.io/Portfolio/share/traces-ci-and-production-monitoring.html Fri, 22 May 2026 00:00:00 GMT If you cannot replay or inspect a run, you cannot reliably debug it or turn it into a regression test. Without traces and CI integration: Capture enough trace data to reconstruct... Evaluation