AI Audit Trail: Implementation Guide for EU AI Act Article 12

Q: What does Article 12 of the EU AI Act require?

Article 12 requires high-risk AI systems to automatically record events — "logs" — over their lifetime, sufficient to allow traceability of functioning, post-market monitoring, and serious-incident investigation. The records must identify inputs sufficiently to follow the trail, and reference any persons involved in verification. The provision is binding on providers; deployers carry the parallel Article 26(6) obligation to retain logs under their control for at least six months.

An AI audit trail is the unblinkable record of what an AI system did, when, on whose behalf, and with what consequence. EU AI Act Article 12 requires it in the strongest terms the Act uses — automatic, throughout the system's lifecycle, sufficient to allow traceability and post-market monitoring. The Italian Garante, French CNIL, and German BSI guidance all extend the same expectation to AI processing personal data under GDPR.

This guide is the practitioner's reference. It covers exactly what to capture, the format trade-offs (JSONL vs database vs SIEM), retention rules, the EU-vs-US distinction, and how Knowlee implements an article-12-grade audit trail end-to-end through the agent runtime wrapper, the job registry, and structured per-job log artifacts.

If you are evaluating compliance platforms, ask the vendor to demonstrate a live query against their audit trail before signing — see the buyer framework in /blog/ai-act-compliance-software-guide.

What Article 12 Actually Requires

Article 12 of Regulation (EU) 2024/1689 obliges providers and deployers of high-risk AI systems to ensure the system automatically records events ("logs") over its lifetime. The provisions are concrete:

Automatic. Manual screenshots or after-the-fact summaries do not satisfy the Article. The system itself must produce the log.
Lifetime. Retention extends across the system's operational life, not a tax-style five-or-seven-year window.
Sufficient for post-market monitoring. The log must let the provider monitor functioning over time and identify reasonably foreseeable risks.
Sufficient for serious-incident investigation. When an incident occurs, the log must enable reconstruction of what the system did.
Identifies inputs sufficient for trace. Input data identification at a level allowing the trail to be followed.
Records reference data and persons involved in verification. Where applicable.

For deployers, Article 26(6) adds: "deployers shall keep the logs automatically generated by that high-risk AI system to the extent such logs are under their control … for a period appropriate to the intended purpose … of at least six months."

The minimum bar is six months for deployers. For providers, the QMS requirements under Article 17 push retention much longer — typically the full operational lifetime plus an extension for serious-incident investigation.

What an AI Audit Trail Must Capture

The Act stops short of prescribing fields, deliberately, because AI systems vary. The standards-track interpretation — emerging from ISO/IEC 42001:2024 §7.5 and from sectoral guidance (EBA, ECB, Banca d'Italia AI guidance for banks; CNIL guidance; BSI AIC4 catalog) — converges on the following minimum fields.

Per-inference / per-AI-call fields

For each individual call to an AI model (or each agent action), the trail must capture:

Field	Why it matters
Call ID (UUID)	Unique reference for cross-correlation across logs and incidents.
Timestamp (ISO-8601, UTC)	Establishes ordering, retention boundaries, and incident reconstruction.
System ID	Which AI system in the inventory was invoked.
Model identifier and version	"GPT-4o-2024-11-20" not "GPT-4". A drift-on-version-change incident is invisible without this.
Prompt template ID and version (or prompt hash)	The instructions the model received, captured by reference rather than full content (cost and confidentiality).
Input fingerprint	A hash or summarization of input sufficient to identify the input class without storing the raw payload (which may include personal data).
Tool calls / external resource access	Each MCP/tool invocation, with arguments and results.
Output (or output fingerprint)	Final result returned to the caller.
Operator identity	The natural or legal person on whose behalf the call was made.
Confidence / certainty signal	Where the model emits one.
Decision outcome	What happened as a result (action taken, deferred, rejected, escalated).
Override flag	Whether a human overrode the AI output.
Approver identity	If the call is in a human-oversight required flow.
Approval timestamp	When the approver signed off.
Token / cost metrics	Useful for both finance and detecting anomalous behavior.

For agentic systems — where one user prompt triggers a chain of model calls and tool invocations — the trail must additionally preserve the call tree: which AI call invoked which next call, with parent/child IDs.

Per-system / per-day fields

In addition to per-inference logs, the audit trail should aggregate:

Daily inference counts per system / per use case.
Distribution shifts in input data (drift signals).
Accuracy metrics where post-hoc evaluation is possible.
Incident count and type.
Override ratio (how often human operators reject AI output).
System uptime and availability.

These are post-market-monitoring artifacts under Article 72, not Article 12 directly, but auditors expect to see them connected to the underlying log stream.

Format: Why JSONL Beats Almost Everything

Implementation choices for AI audit trails fall into three families:

1. JSONL (newline-delimited JSON) streamed to disk

Each line is one event. Append-only. Trivially parsable by jq, grep, awk. Ships natively to any log aggregator (Loki, Datadog, Splunk, Elastic). Tamper-evident with simple per-line hashing. Storage cheap.

Why it wins for AI: the unit of capture (one inference) maps cleanly to one line. Streaming is natural. The format does not impose a schema, so model output fields, tool calls, and reasoning traces fit without ETL.

2. Relational database table

Strong for deterministic OLTP — audit_event table with foreign keys to systems, users, models. Easy to query.

Why it loses on AI workloads: rich nested payloads (tool call trees, JSON outputs, full reasoning traces) require either JSON columns (then queries leave SQL anyway) or denormalized text fields (then the schema is a lie). High-cardinality, append-heavy writes hit the database harder than necessary.

3. SIEM event stream

Acceptable as a downstream consumer. Failing as a primary capture format because SIEM event schemas are designed for security events (firewall rule hits, login attempts), not AI inference.

The defensible architecture in 2026 is JSONL as the system of record, with downstream feeds to a database (for low-cardinality reporting) and a SIEM (for security correlation). Log-shipping is solved; do not let it dictate the capture format.

Knowlee uses JSONL throughout. the agent runtime wrapper invokes the Claude CLI with --output-format=stream-json, producing one JSON object per event (model response, tool call, error). Per-Claude-Code-session JSONL files live under ~/.knowlee-studio-sessions/:userId/:context/ for studio sessions. Per-job logs land at the audit trail. Structured per-job outputs land at the structured report store. This is the reference pattern.

Retention: How Long, Where, and Who Can Delete

Retention is governed by overlapping rules:

AI Act Article 12 — system lifetime (no fixed minimum).
AI Act Article 26(6) — six months minimum for deployers.
Sector-specific — EBA / ECB / Banca d'Italia AI guidance for banks: typically 5–10 years for credit-decisioning logs.
GDPR Article 5(1)(e) — storage limitation. Personal data not kept longer than necessary.
GDPR Article 17 — right to erasure, with exceptions for legal obligations.

The reconciliation:

Logs that contain no personal data: retain per Article 12 lifetime.
Logs that contain personal data: retain only as long as necessary for the purpose, with technical capability to redact or delete on data-subject request — but the Article 12 obligation creates a competing legal basis that often justifies longer retention than GDPR alone would.
Where retention is contested, document the legal-basis decision and the technical implementation. Auditors and supervisory authorities expect a written reconciliation.

Knowlee's current state on retention: retention TTL on JSONL is open as Gap GP-008 in TECHNICAL-COMPLIANCE-MAP.md. Per-vertical Supabase data has its own lifecycle. The full Article 17 GDPR purge endpoint is gap GP-009. The honest framing for buyers: today, default retention is unbounded; configurable retention and per-user purge are scheduled for closure ahead of Knowlee's ISO 42001 audit.

EU vs US: Where the Standards Diverge

The EU AI Act / ISO 42001 / Italian Garante / French CNIL framework treats the audit trail as a provider obligation by default and a deployer obligation under Article 26. The US framework — NIST AI RMF, sector-specific (NYC Local Law 144 on automated employment decision tools, California ADMT regulations under CPRA) — is fragmented but converging on similar capture requirements:

NYC Local Law 144 requires bias audits of automated employment decision tools, with publication of summary results — implies an underlying audit trail capable of supporting a bias audit.
California ADMT (CPRA) requires risk assessments and the ability to opt-out of certain automated decisioning — implies inference-level traceability.
NIST AI RMF Govern, Map, Measure, Manage functions all assume the existence of a comprehensive operational record.

The pragmatic implementation answer: a single audit-trail architecture aligned with Article 12 satisfies most US sectoral requirements simultaneously, with jurisdiction-specific filters layered on top for retention, redaction, and reporting. There is no defensible reason to operate divergent capture for EU and US workloads.

How Knowlee Implements an Article-12-Grade Audit Trail

The implementation in this repository is a worked reference for an open-source-friendly AI runtime.

Streaming layer — the agent runtime wrapper

Every agent runtime child process is invoked with --output-format=stream-json --verbose. Each event from the CLI — model response, tool call, error — is one JSON line. The runner forwards stdin, captures stdout/stderr, and produces a complete JSONL file. Token counts, model versions, and reasoning traces are preserved. This is the foundation: the audit trail is the system's stdout, not an afterthought.

Per-job artifacts — the audit trail and the structured report store

Every job run produces a log file at the audit trail and, for jobs that emit structured outputs, a report directory at the structured report store. The the job-runner entrypoint entrypoint acquires a lock, appends to the run history log, and runs the script/session under timeouts and idle-watchdog controls.

Job registry — the automation registry

Every automated workload is declared in the automation registry with the AI-Act-shaped fields:

risk level — minimal | limited | high | prohibited.
data categories — declarative list of data types processed.
human-oversight required — boolean.
approver and approval timestamp — identity and timestamp of the approver, populated when the job transitions from backlog to runnable.
allowedTools — MCP-server tool allow-list per job.
maxTimeout / idleTimeout / maxTurns — runaway protection.

The 37 jobs in the current registry all carry these fields (see TECHNICAL-COMPLIANCE-MAP.md §6.1, gaps GP-003 through GP-006, all closed 2026-03-28).

Approval gate — `server.js:scheduleJob()`

The cron scheduler re-reads the job registry on every tick. Jobs flagged "human-oversight required" set to true without approver are skipped. This is the technical enforcement of Article 14 — the gate is in the runtime, not in a wiki.

Approvals log — the approvals log

Every approval is appended to a single approvals log with operator identity, timestamp, prior state, and new state. This is the Article 14 evidence file — a regulator asking "show me who approved deployment of this AI system on this date" gets a one-query answer.

Per-vertical data isolation

Each vertical (4Sales, 4Talents, 4Marketers, 4Legals, 4Projects, 4Procurement, 4Finance, 4Operations) runs against its own Supabase project. Audit-relevant rows do not cross verticals. The Enterprise Brain (Knowledge Graph + RAG) sees only entities and relationships each vertical chooses to publish — never raw inference payloads.

Studio sessions — `~/.knowlee-studio-sessions/:userId/:context/`

For Knowlee Studio (the per-user Claude Code orchestration layer), each user's sessions land in user-namespaced directories. Session JSONL is searchable cross-session via GET /api/search-conversations. Cost calculation per session via calculateSessionCost() in server.js:2850. Live SSE streaming via GET /api/studio/sessions/:id/jsonl/stream.

Where Knowlee is honest about gaps

TTL/retention (GP-008) — currently no automatic expiration on JSONL files.
GDPR purge endpoint (GP-009) — DELETE /api/users/:id/data not yet shipped.
Audit export PDF/CSV (GP-010) — JSONL export works; formatted compliance export is roadmap.

These gaps live in GAP-REGISTER.md with priority and effort estimates. We document them publicly because procurement teams should not have to disambiguate marketing claims at the table.

Common Implementation Mistakes

Drawn from real organizations, anonymized:

Logging the prompt and not the prompt template version. The prompt changes weekly. Without a versioned template reference, audit trails of the past lose meaning the next time the prompt is updated. Capture template ID + version, not prompt text.
Logging input but not input fingerprint. Storing raw input (which may contain personal data) creates a GDPR exposure in the audit trail itself. Hash or summarize the input; store the raw input only in a controlled tier with retention bounded by purpose.
Capturing model output but not tool calls. Agentic systems often reach decisions through multiple tool calls. An audit trail that shows "input → output" without intermediate steps is uninvestigable when a tool call introduces an error.
No call-tree linkage in agentic flows. Parent/child IDs across multi-call agentic flows are essential. Without them, reconstructing a multi-step decision is forensic guesswork.
Approvals captured in chat, not in structured logs. A Slack thread saying "ok approved" is not Article 14 evidence. Approvals must be structured records with operator identity and timestamp, queryable.
No retention policy at all. Default-retain-everything is a GDPR risk. Default-purge-after-30-days is an Article 12 risk. The defensible answer is a documented retention matrix per data category, with technical enforcement.
Logs in a tier the operator cannot delete from. GDPR Article 17 requires a path to erasure. Logs in immutable tiers (object lock, WORM storage) need a documented redaction or pseudonymization workflow.

FAQ

What does Article 12 of the EU AI Act require?

Article 12 requires high-risk AI systems to automatically record events — "logs" — over their lifetime, sufficient to allow traceability of functioning, post-market monitoring, and serious-incident investigation. The records must identify inputs sufficiently to follow the trail, and reference any persons involved in verification. The provision is binding on providers; deployers carry the parallel Article 26(6) obligation to retain logs under their control for at least six months.

What fields should an AI audit trail capture?

At minimum, per inference: call ID, timestamp, system ID, model + version, prompt template ID + version, input fingerprint, tool calls, output, operator identity, decision outcome, and any approval / override records with approver identity and timestamp. For agentic systems, parent/child IDs preserving the call tree. For provider-side post-market monitoring (Article 72), aggregated daily counts, drift signals, accuracy metrics, incident counts, and override ratios.

Is JSONL the right format for AI audit trails?

For most AI workloads, yes. JSONL pairs cleanly with the unit of capture (one inference per line), is append-only by nature, ships to any log aggregator, supports rich nested payloads without ETL, and is trivially parsable. Relational tables become lossy or denormalized when storing tool-call trees and reasoning traces. SIEM event streams are good downstream consumers but poor primary capture formats. The defensible architecture: JSONL as system of record, with downstream feeds to a database for reporting and to a SIEM for security correlation.

How long must AI audit trails be retained?

EU AI Act Article 12 implies the system's operational lifetime. Article 26(6) sets a six-month minimum for deployers. Sector-specific guidance often extends retention much further — banking and credit decisioning typically require five to ten years. GDPR Article 5(1)(e) constrains retention of personal data. The reconciliation: retain logs without personal data per Article 12 lifetime; retain logs with personal data only as long as needed for the documented purpose, with technical capability to redact or delete on data-subject request.

How does Knowlee implement an Article 12 audit trail?

Knowlee streams JSONL for every agent runtime call via the agent runtime wrapper (using --output-format=stream-json). Per-job logs land under the audit trail; structured outputs under the structured report store. Every job in the automation registry declares risk level, data categories, human-oversight requirement, approver, approval timestamp. The cron scheduler refuses to execute jobs flagged "human-oversight required" set to true without approval. Approvals append to the approvals log. The audit trail is the system's stdout, not an export.

Can I store the audit trail in my SIEM?

Use the SIEM as a downstream consumer, not the primary capture. SIEM event schemas are designed for security events (logins, firewall hits) and lose fidelity when forced to host nested AI inference payloads. The defensible architecture is JSONL as system of record, shipped to the SIEM for security correlation in addition to the underlying log file.

What happens if my audit trail has gaps?

Gaps are findings. A regulator's information request that lands in a window with no log coverage is, at best, a Tier 3 fine for inability to provide records (€7.5M / 1.5%) and, in serious cases, a Tier 2 finding for failing Article 12 itself (€15M / 3%). The cheapest insurance is automatic capture with self-monitoring — alarms when log volume drops below an expected baseline, before the auditor finds the gap.

Does the audit trail need to be tamper-evident?

The Act does not prescribe cryptographic tamper-evidence, but a regulator presented with logs of unknown integrity will likely ask. Per-line hashing, log-shipping to an immutable tier, and signed daily aggregates are inexpensive defenses. Knowlee's JSONL files are append-only by file-system semantics; cryptographic chaining is an open enhancement (``).