Skip to content

Hardening batch: fail-loudly invariants, audit visibility, migration rails, test depth#18

Open
Anmolnoor wants to merge 13 commits into
mainfrom
feat/fcli-hardening
Open

Hardening batch: fail-loudly invariants, audit visibility, migration rails, test depth#18
Anmolnoor wants to merge 13 commits into
mainfrom
feat/fcli-hardening

Conversation

@Anmolnoor

Copy link
Copy Markdown
Owner

Summary

Implements plans/fcli-hardening-roadmap.md — the staged fix plan from the 2026-06-10 full-project review. The theme: fcli was stricter about what gets in (policy, approval, validation) than about noticing when its own internals fail. This batch closes that gap, one reviewable commit per stage.

Gates on every commit: pytest (460 → 542 tests), ruff check + format, mypy strict, foundation doctor.

Stages

  1. Executor invariants fail loudly — all 16 asserts replaced with typed ExecutorInvariantError → FAILED results; survives python -O.
  2. Plan-action validation — closed the stray question/explanation cross-payload holes; failures route through plan repair.
  3. One source of truth for git mutation subcommands (models/git.py); planner and guardrails can no longer diverge.
  4. Audit-trail failures visible — sink failure counting, disable-after-3 circuit breaker (observer + compose_event_sink), writer crashes poison the session to write_truncated, phraser fallbacks log a reason category.
  5. Migration safety rails — pre-migration .pre-v<target>.bak backup, row-count validation before dropping the source table, HistoryMigrationError naming the backup.
  6. Diff-applier strictness — parse-time rejection of count-mismatched and no-op hunks; bare-context and newline-normalized matching kept but surfaced via FileMutationResult.leniency_notes.
  7. Test depth — 41 new tests: isolated test_planner.py, Codex failure paths, live-rendering edges.
  8. Orchestrator slimmingevaluated and descoped with evidence (helpers' only callers are internal; ctor params are deliberate DI seams). See roadmap.
  9. Docs/plans hygiene — status headers everywhere, CHANGELOG, TECHNICAL.md.

Bugs found and fixed along the way

  • Hallucinated capability id escaped the repair loop as an unwrapped ValueError and crashed the turn (found by stage 7 tests).
  • Model text parsed as Rich markup → MarkupError crash / styling injection in the live detail panel and both chat renderers (the latter found only by live smoke testing).
  • Piped non-TTY runs hard-aborted ("Aborted.", exit 1) at approval prompts → now graceful PENDING_APPROVAL stop; TTY Ctrl-C still aborts.
  • Successful self-repairing turns reported "stopped: tool failed" + "no verification command ran" → custom test scripts count as verification, latest verification attempt wins across iterations, and recovered turns read "Executed N actions, recovered from M earlier failures".

Verification

Beyond the suite, smoke-tested hard against the real Codex provider (gpt-5.5, ChatGPT login) in a scratch git workspace: file creation, fix-and-verify replan loop (3 iterations, verification "passed"), typed git stage+commit through the approval gate, graceful gap decline, non-TTY scope-block, empty-request guard, markup-injection probe (crashed before the fix, renders literally after), piped-approval probe (aborted before the fix, pending-approval after).

Review note: the roadmap file (plans/fcli-hardening-roadmap.md) carries a status header per stage including what was rejected on inspection and why — start there.

🤖 Generated with Claude Code

Smoke Test and others added 12 commits June 10, 2026 16:58
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
python -O strips assert statements, so the executor's 16 invariant guards
(kind/payload narrowing and file/git service wiring) could silently vanish
in optimized runs. Replace them with a _require helper raising
ExecutorInvariantError; ActionExecutor.execute() converts violations into
typed FAILED ExecutionResults instead of interpreter crashes.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The kind/payload validator already enforced payload presence but accepted
a stray question payload on EXPLANATION/SHELL/TOOL_CALL actions and a
stray explanation on QUESTION actions. Every cross-kind payload is now
rejected at validation time; failures route through the existing plan
repair loop.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Planner and guardrails each maintained an identical 16-entry set; a
divergence would let the planner permit what policy blocks. The set now
lives once in models/git.py with both services aliasing it, plus an
identity test.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Event sinks must never break a turn, but their failures were invisible:
the observer swallowed sink exceptions, compose_event_sink warned per
event forever on a flapping sink, and a crash escaping
EventLogWriter.write_event left the sessions index claiming a complete
log. Sink failures are now counted and warned about once, a sink is
disabled after 3 consecutive failures, writer crashes poison the index
row to write_truncated, and gap-phraser fallbacks log their reason with
a category instead of failing silently.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Schema migrations ran automatically against the user's real history DB
with no backup and no post-check, and the v6 executescript rebuild
dropped the source table without validating the copy. Migrations now
back up the DB file first (keeping the newest backup), the v6 rebuild
validates row counts before dropping anything, and failures raise
HistoryMigrationError naming the backup path with the original data
intact.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The unified-diff applier silently tolerated bare context lines and
always compared with trailing newlines stripped, and never checked a
hunk's declared source count against its body. Declared-count
mismatches and no-op hunks are now parse-time DIFF_REJECTED errors;
bare-context and newline-normalized matching remain as deliberate
tolerances but are reported through FileMutationResult.leniency_notes
so they land in the trace.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
41 new tests where coverage was thinnest: PlannerService unit tests
(prompt injection, validation, repair retries), Codex adapter failure
paths mapped to ProviderErrorCodes, and live renderer edge cases
(narrow widths, spinner identity, markup/ANSI payloads).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
A plan naming a nonexistent capability id raised a plain ValueError
from CapabilityRegistry.resolve that escaped both the plan-repair loop
and the orchestrator's PlanningError handling, crashing the turn; it is
now wrapped as PlanningError so the model gets a repair retry. And
render_detail_panel passed user request text to Rich as raw markup,
allowing styling injection and a MarkupError crash mid-Live.update; it
is now rendered literally via Text().

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Mark completed roadmaps (fixes roadmap, v0.1, v2 stages, v4) with status
headers, add the hardening-batch CHANGELOG section, and document the
write_truncated semantics, sink circuit-breaker, and pre-migration
backups in TECHNICAL.md. Stage 8 (orchestrator slimming) is recorded as
evaluated-and-descoped: the helpers' only callers live inside the
orchestrator, and the constructor params are deliberate DI seams.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Found by the codex smoke test: an assistant message starting with a
stray Rich closing tag (e.g. '[/bold]') crashed the concise renderer's
console.print and the verbose Assistant panel with MarkupError after
the turn had already completed. Model-generated text now renders
literally via Text() in both paths, matching the earlier detail-panel
fix.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Two findings from the codex smoke test. First, a piped one-shot run
hitting an approval prompt died with a bare 'Aborted.' (exit 1) when
typer.confirm got EOF; the prompt callback now returns None on non-TTY
EOF and ApprovalService resolves the action as PENDING, producing the
graceful loop stop. A TTY Ctrl-C still aborts. Second, a turn that
failed, repaired itself, and completed naturally reported 'stopped:
tool failed' plus 'no verification command ran': custom test scripts
now classify as verification, cross-iteration verification is decided
by the latest attempt instead of worst-wins, and the summary only uses
stop framing for abnormal stops — recovered turns read 'Executed N
actions, recovered from M earlier failures'.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@Anmolnoor

Copy link
Copy Markdown
Owner Author

Pushed one more fix found by interactive use: the live status line's ?-toggle keypress reader kept reading stdin during approval prompts, eating the typed y so typer.confirm fell back to the default n and the turn stopped. pause() now fully releases stdin (reader thread stopped, termios restored, type-ahead flushed) and resume() reinstalls the reader — pinned with a pty-backed regression test. Affected all mid-turn prompts (approvals, agent questions, scope grants) on a real TTY. 543 tests green.

On a real TTY the live status line's '?'-toggle reader kept its
byte-at-a-time stdin loop running while approval/question prompts were
shown — renderer.pause() only stopped the Rich Live widget. The reader
raced typer.confirm for every byte, swallowed the user's 'y', and the
prompt resolved to its default 'n', blocking the action and stopping
the turn. pause() now tears down the reader (thread stopped, termios
restored, type-ahead flushed) and resume() reinstalls it. Pinned by a
pty-backed regression test.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@Anmolnoor Anmolnoor force-pushed the feat/fcli-hardening branch from 5a5540b to e03d417 Compare June 11, 2026 07:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant