Add structured logging to the preprocessing document-processing pipeline

Brain review

Azure DevOpsai/work-packet-575ce8f821-attempt-1daytona

Human review

Plan an inspection-first, minimal-surface change in ITP-Agent-Runtime to add structured, step-level logging within the preprocessing pipeline under src/agentic_workflow/helpers/preprocessing/. Focus on the code paths that (1) route by file suffix, (2) choose OCR versus e-invoice XML parsing, (3) converge into process_document, and (4) persist outputs to S3. Because SocratiCode context failed and no detector evidence or review discussion is available, the coding agent should first identify the exact modules/functions and existing logging conventions before proposing targeted edits.

medium risk

SocratiCode/codegraph context failed due to a dirty local clone, so repository structure and exact file targets must be confirmed manually before implementation.No detector evidence, approved action proposal details, or review discussion were available, so the plan relies primarily on the work packet summary and may miss repository-specific constraints.Structured logging can inadvertently expose sensitive document metadata or payload content; a human reviewer should verify field selection and redaction strategy.Logging changes in a document-processing pipeline can affect observability volume, downstream parsing of logs, and operational dashboards; event names and fields should be validated by reviewers.S3 persistence logging may expose storage identifiers if implemented carelessly; reviewers should confirm acceptable metadata boundaries.Dashboard review is required because all code changes and PR creation are gated before sandbox execution.

Expected files

src/agentic_workflow/helpers/preprocessing/__init__.py

Implementation steps

1.Inspect src/agentic_workflow/helpers/preprocessing/ to map the preprocessing flow end to end: locate the entrypoint that receives documents, the suffix-based router, the OCR handlers for pdf/jpg/png/tif, the e-invoice XML parsing path, the process_document convergence point, and the S3 persistence function(s).
2.Identify the project’s current logging standard before changing anything: determine whether the repo uses Python logging, structlog, a custom logger wrapper, or context-bound logger helpers, and reuse that existing mechanism instead of introducing a new logging framework.
3.Document the exact event points to instrument with the smallest file surface possible. At minimum include: preprocessing_started, suffix_detected/routed, branch_selected (ocr vs einvoice), ocr_started/completed or xml_parse_started/completed, process_document_started/completed, s3_persist_started/completed, and failure events around each major step.
4.Define a consistent structured field set for all events, using only non-sensitive metadata already available in the pipeline. Prefer fields such as document_id/correlation_id if one already exists, filename or normalized suffix, selected_processing_branch, mime/type hints if already known, target bucket/key prefix if non-secret, duration_ms where easy to compute, and status/result. Avoid logging raw document contents, OCR text, invoice payloads, secrets, or full signed URLs.
5.Add logging at the suffix router so each incoming file emits a traceable event showing the detected suffix and selected downstream path. Ensure unsupported or unexpected suffixes also emit structured warning/error logs before existing exception handling or rejection behavior continues unchanged.
6.Add logging around branch selection between OCR-capable document formats and XML e-invoice parsing. If the branch decision is split across multiple helper functions, prefer instrumenting the shared decision point once rather than duplicating logs in every low-level helper.
7.Add step-level logging inside the OCR path and the e-invoice parsing path. Capture start/completion/failure at the outer helper boundary to avoid excessive log noise, unless the code already has wrapper functions where one additional structured event per step is natural.
8.Add logging at or immediately around process_document so the convergence of both branches can be traced consistently regardless of source format. If process_document already receives normalized payloads, include the branch/source metadata in the log context where possible.
9.Add structured logging around S3 persistence calls to record that persistence was attempted and whether it succeeded, while redacting or omitting sensitive bucket object details if current standards require that. Preserve existing exception propagation and retry behavior.
10.If there is repeated logging boilerplate across preprocessing helpers, factor it only minimally—e.g., via existing helper utilities or a tiny local helper function in the same module—without broad refactoring. Prefer direct, explicit instrumentation over architectural changes.
11.Review exception handling paths in preprocessing helpers and ensure failures emit structured error logs with exception information using the repository’s standard pattern, while preserving existing control flow and return contracts.
12.Update or add focused tests only where the repository already has a clear pattern for asserting logs or behavior. If log assertion patterns do not exist, prioritize behavior-preserving tests around the instrumented paths and note log verification as a manual review point.

Test plan

·Run a local inspection of unit/integration tests related to preprocessing and document processing to determine the narrowest existing test targets.
·Execute the relevant preprocessing test subset, especially coverage for suffix routing, OCR document formats, XML e-invoice handling, and any S3 persistence mocks/stubs.
·If the repo has pytest log capture or structured log assertions, add/execute focused tests verifying that representative OCR and XML inputs emit the expected step-level events without leaking payload data.
·Manually exercise, in sandbox only, one OCR-supported document (e.g. pdf) and one XML e-invoice through the preprocessing entrypoint and confirm logs show: suffix routing, branch selection, process_document convergence, and S3 persistence.
·Validate failure-path observability by triggering at least one controlled unsupported-suffix or parser/OCR error scenario and confirming a structured error event is emitted while existing failure behavior remains unchanged.
·Check that no secrets or document contents appear in emitted logs and that log volume remains bounded to step-level events rather than per-page or per-token noise unless already standard.

Repositories

·ITP-Agent-Runtime