Skip to content

feat(data-retention): granular PII redaction stages (input + block outputs)#5272

Open
TheodoreSpeaks wants to merge 19 commits into
stagingfrom
feat/pii-granular-redaction
Open

feat(data-retention): granular PII redaction stages (input + block outputs)#5272
TheodoreSpeaks wants to merge 19 commits into
stagingfrom
feat/pii-granular-redaction

Conversation

@TheodoreSpeaks

Copy link
Copy Markdown
Collaborator

Summary

  • Add two execution-altering PII redaction stages alongside the existing log redaction: redact the workflow input before execution, and mask every block output in-flight before the next block reads it
  • Per-stage policy (entity types + language) for each of Logs / Workflow input / Block outputs; resolved most-specific-wins per workspace, with full back-compat for existing logs-only rules
  • In-flight stages fail-fast (abort the run) on a Presidio error instead of scrubbing or leaking; the logs stage keeps scrub-to-marker
  • Reuse the shared HTTP → Presidio path; block-output redaction runs before payload compaction so offloaded large values are still masked
  • Settings UI: chip-tabs across the three stages, language-first picker with the entity grid filtered to that language's recognizers, and a confirmation before removing a workspace override

Type of Change

  • New feature

Testing

Tested manually. Unit tests for resolver back-compat, redactObjectStrings + failure modes, and the contract schema. bun run lint, check:api-validation:strict, and check:migrations origin/staging all pass.

Checklist

  • Code follows project style guidelines
  • Self-reviewed my changes
  • Tests added/updated and passing
  • No new warnings introduced
  • I confirm that I have read and agree to the terms outlined in the Contributor License Agreement (CLA)

@vercel

vercel Bot commented Jun 29, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
docs Ready Ready Preview, Comment Jul 1, 2026 9:18pm

Request Review

@cursor

cursor Bot commented Jun 29, 2026

Copy link
Copy Markdown

PR Summary

High Risk
Changes execution-time data (input, outputs, streams, memory) and log persistence for PII; misconfiguration or Presidio failures can abort runs or alter workflow results, though fail-fast and stored-rule semantics aim to prevent leaks.

Overview
Introduces three independent PII redaction stagesLogs, Workflow input, and Block outputs—each with its own entity types and language, while legacy flat rules still map to logs-only. Resolution stays most-specific-wins per workspace; enabled stages require at least one entity type (no “redact all”).

Runtime: Workflow input is masked before execution; block outputs are masked before compaction, downstream blocks, agent memory, and child workflows. Streaming blocks can drain without forwarding raw chunks when block-output redaction is on. In-flight stages use onFailure: 'throw'; log persist keeps scrub-to-marker and now hydrates/masks/re-stores large-value refs under the logs policy. Masking at run/persist time follows stored rules, not the feature flag (fail-safe).

Presidio & throughput: New /analyze_batch and /anonymize_batch endpoints; app-side masking uses shared byte/count chunking and bounded concurrent HTTP batches (configurable PII_MASK_CHUNK_CONCURRENCY), with the old total-size scrub ceiling removed for large payloads.

Config: pii-granular-redaction gates enabling input/blockOutputs on the data-retention API and in settings (stage tabs, per-stage language + entity grid, remove-override confirm). PII_GRANULAR_REDACTION env + contract/schema updates support the new shape.

Reviewed by Cursor Bugbot for commit 965eb65. Bugbot is set up for automated code reviews on this repo. Configure here.

Comment thread apps/sim/executor/execution/block-executor.ts
@greptile-apps

greptile-apps Bot commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR adds granular PII redaction stages for workflow execution and logs. The main changes are:

  • Per-stage redaction policy for input, block outputs, and logs.
  • Execution-time masking before workflow input and block outputs are consumed.
  • Batched Presidio HTTP masking and large-value log redaction support.
  • Settings UI updates for stage tabs, language filtering, and override removal confirmation.
  • API, contract, resolver, and test updates for the new policy shape.

Confidence Score: 4/5

This is close, but the restore path should be fixed before merging.

  • Restored offloaded block outputs can bypass the new block-output masking.
  • A paused or run-from-block workflow can materialize raw PII after the policy is enabled.
  • The API and resolver fixes for empty stages look consistent with the updated policy model.

apps/sim/lib/workflows/executor/execution-core.ts

Security Review

Restored large-value refs can still expose raw PII after block-output redaction is enabled for a paused or run-from-block workflow.

Important Files Changed

Filename Overview
apps/sim/lib/workflows/executor/execution-core.ts Adds execution-time policy resolution and restore masking, but restored offloaded block outputs can still be hydrated as raw data.
apps/sim/lib/billing/retention.ts Resolves stored PII rules into per-stage effective policies.
apps/sim/executor/execution/block-executor.ts Masks block outputs before compaction and buffers streaming output when block-output redaction is enabled.
apps/sim/lib/logs/execution/pii-redaction.ts Adds reusable object-string redaction with scrub or throw failure handling.
apps/sim/lib/logs/execution/pii-large-values.ts Adds hydrate, mask, and re-store handling for offloaded values in log payloads.

Reviews (14): Last reviewed commit: "feat(data-retention): gate granular PII ..." | Re-trigger Greptile

Comment thread apps/sim/lib/workflows/executor/execution-core.ts Outdated
Comment thread apps/sim/executor/execution/block-executor.ts
Comment thread apps/sim/lib/workflows/executor/execution-core.ts
@TheodoreSpeaks

Copy link
Copy Markdown
Collaborator Author

@greptile review

Comment thread apps/sim/lib/workflows/executor/execution-core.ts Outdated
@TheodoreSpeaks

Copy link
Copy Markdown
Collaborator Author

@greptile review

Comment thread apps/sim/executor/execution/block-executor.ts Outdated
Comment thread apps/sim/lib/workflows/executor/execution-core.ts Outdated
@TheodoreSpeaks

Copy link
Copy Markdown
Collaborator Author

@greptile review

@TheodoreSpeaks

Copy link
Copy Markdown
Collaborator Author

@greptile review

…redaction

# Conflicts:
#	apps/sim/ee/data-retention/components/data-retention-settings.tsx
Comment thread apps/sim/app/api/organizations/[id]/data-retention/route.ts
Comment thread apps/sim/ee/data-retention/components/data-retention-settings.tsx
Comment thread apps/sim/lib/workflows/executor/execution-core.ts
@TheodoreSpeaks

Copy link
Copy Markdown
Collaborator Author

@greptile review

Comment on lines +678 to +682
// Limitation: this walks inline strings only — values offloaded to
// large-value storage are still refs here and are not re-masked. In the
// normal flow that is safe (a run with the stage on masks before offload);
// the gap is the narrow case of a run that offloaded a large value while
// the stage was OFF and is resumed after the stage is turned ON.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 security Large values bypass masking

When block-output redaction is enabled after a workflow already offloaded large block outputs, this restore path only masks inline strings in the snapshot. The offloaded payloads stay behind large-value refs. On resume or run-from-block, downstream blocks can still read the raw restored payload, and log persistence can skip the large-value scrub because block-output redaction is now enabled. This leaves raw PII reachable from prior block outputs after the stage is turned on.

abortSignal: ctx.abortSignal,
// Propagate in-flight block-output redaction into child workflows so
// nested blocks mask outputs too (recurses: each child forwards it).
piiBlockOutputRedaction: ctx.piiBlockOutputRedaction,

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Child workflows skip input redaction

Medium Severity

The new workflow-input PII stage runs only in executeWorkflowCore on top-level processedInput. Nested child runs are started with a direct Executor and pass childWorkflowInput unchanged. Only the block-output policy is forwarded on the context, so when the input stage is on and block outputs are off, mapped or explicit child input can execute and produce downstream state without in-flight input masking.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 8f86d77. Configure here.

@TheodoreSpeaks

Copy link
Copy Markdown
Collaborator Author

@greptile review

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 2 total unresolved issues (including 1 from previous review).

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 6e9587a. Configure here.

Comment thread apps/sim/lib/logs/execution/logger.ts Outdated
Comment on lines +689 to +692
snapshot.state.blockStates = await redactObjectStrings(
snapshot.state.blockStates,
blockOutputOpts
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 security Refs stay unmasked When a paused run or run-from-block snapshot contains a large-value ref that was created before block-output redaction was enabled, this call only masks inline strings. Large-value refs are treated as opaque by redactObjectStrings, so the ref still points at the original offloaded bytes. The later warm-up step can materialize that raw value for downstream blocks, letting them read or send unredacted PII even though the block-output stage is enabled.

@TheodoreSpeaks

Copy link
Copy Markdown
Collaborator Author

@greptile review

Comment on lines +689 to +692
snapshot.state.blockStates = await redactObjectStrings(
snapshot.state.blockStates,
blockOutputOpts
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 security Refs stay raw

This restore path still only masks inline strings. When a paused run or run-from-block snapshot contains a large-value ref created before block-output redaction was enabled, redactObjectStrings leaves the ref untouched. The later warm-up can materialize that original offloaded value for downstream blocks, so the resumed workflow can read raw PII even though block-output redaction is now enabled. This path needs to hydrate, mask, and re-store restored refs before downstream state can use them.

@waleedlatif1 waleedlatif1 deleted the branch staging July 1, 2026 05:43
@waleedlatif1 waleedlatif1 reopened this Jul 1, 2026
Comment on lines +689 to +693
snapshot.state.blockStates = await redactObjectStrings(
snapshot.state.blockStates,
blockOutputOpts
)
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 security Large refs remain raw

This restore path still leaves old offloaded block outputs unmasked. It only runs redactObjectStrings over restored blockStates, and that redactor treats large-value refs as opaque, so a paused run or run-from-block snapshot created before block-output redaction was enabled can still point at raw stored bytes. When the restored state is warmed and downstream blocks read that ref, they can receive the original PII even though the block-output stage is enabled. The restore path needs to hydrate, mask, and re-store those refs before exposing the state to execution.

… (env-tunable), remove request timeouts, sync large-value walk
…daction flag

- New pii-granular-redaction feature flag (fallback PII_GRANULAR_REDACTION),
  layered on pii-redaction, gating the execution-altering input + block-output stages
- Route returns piiGranularRedactionEnabled and rejects enabling granular stages when off
- UI shows only the Logs stage tab unless the flag is on; clamps active stage
- Drop the per-search Select all toggle; add a Deselect all action to the PII section header
@TheodoreSpeaks

Copy link
Copy Markdown
Collaborator Author

@greptile review

Comment on lines +688 to +698
if (snapshot.state?.blockStates) {
snapshot.state.blockStates = await redactObjectStrings(
snapshot.state.blockStates,
blockOutputOpts
)
}
if (runFromBlock?.sourceSnapshot?.blockStates) {
runFromBlock.sourceSnapshot.blockStates = await redactObjectStrings(
runFromBlock.sourceSnapshot.blockStates,
blockOutputOpts
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 security Large refs stay raw

This restore path masks only inline strings. redactObjectStrings leaves large-value refs untouched, and the snapshot warm-up runs after this block. When a paused run or run-from-block snapshot contains a ref created before block-output redaction was enabled, this code keeps the ref pointing at the original offloaded value. The later warm-up can materialize raw PII into blockStates, so downstream blocks can read unmasked data even though block-output redaction is enabled. The restore path needs to hydrate, mask, and re-store these refs before execution can use the restored state.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants