Distributed Fraud Review Pipeline
Currently building a distributed fraud-review pipeline, delivering a functional POC with asynchronous processing, concurrent consumers, persistence, and a baseline resilience model.
Overview
This project builds an end-to-end system for suspicious listing review in transactional marketplaces. The current focus is a robust POC baseline, with production hardening as the next evolution step.
Problem
Fraud and prohibited-item detection workflows often break under volume, latency pressure, and operational complexity. A simple CRUD backend is not enough when cases must be ingested asynchronously, prioritized, reviewed, and audited at scale.
Constraints
- The project must deliver a working distributed POC quickly, without over-engineering the architecture.
- The system must include at least three specialized services with clear input and output contracts.
- The pipeline must support concurrent consumers, retries, and a dead-letter strategy for predictable failures.
- The current scope prioritizes a complete and traceable flow over advanced autoscaling and deep production tooling.
- Initial observability must still capture baseline signals such as throughput, latency, error rate, and queue backlog.
Approach
The current implementation focuses on delivering a traceable asynchronous pipeline (ingestion, screening, persistence, and query APIs) with queue-based processing, basic retry and DLQ behavior, and minimum viable operational visibility. The roadmap then hardens this baseline with stronger idempotency, advanced observability, and production-like AWS operations.
Key Decisions
Start with queue-based processing before introducing streaming complexity
A queue-first design keeps the first version understandable while still validating distributed boundaries, asynchronous flow, and failure handling. It creates a stable baseline for later evaluation of stream-oriented evolution.
- Adopt stream-first architecture from day one
- Keep synchronous processing between services
Split the platform into specialized services with explicit contracts
Separating ingestion, screening, and review/query responsibilities improves maintainability and makes bottlenecks easier to locate during scale and failure tests.
- Single modular monolith
- Only two services with shared responsibilities
Treat retries, DLQ, and idempotency as first-class concerns
Distributed pipelines fail in predictable ways. Encoding retry policy, dead-letter handling, and duplicate protection early avoids silent data corruption and improves confidence under load.
- Best-effort retries without DLQ
- Manual recovery only
Plan ECS/Fargate and Terraform as the production-hardening deployment target
Keeping cloud hardening as a follow-up milestone preserves delivery speed in the current scope while maintaining a clear path toward production-like operations.
- Stay local-only with Docker Compose
- Use managed serverless components exclusively
Tech Stack
- Java 21
- Spring Boot
- Spring Web
- Spring Data JPA
- Spring Boot Actuator
- PostgreSQL
- Amazon SQS
- Docker
- Docker Compose
- AWS ECS/Fargate (planned)
- Terraform (planned)
- OpenTelemetry (planned)
- CloudWatch (planned)
Result & Impact
The current implementation already validates the core distributed flow and gives a concrete baseline for performance, failure handling, and service boundaries. This creates a measurable foundation for production hardening.
Learnings
- A strict POC scope prevents over-engineering and accelerates end-to-end validation.
- Clear service boundaries matter more than service count when scaling distributed systems.
- Reliability mechanisms such as retry and DLQ need to exist even in POC stage.
- Baseline metrics are required early to guide hardening priorities.
- Production-like cloud delivery should be a deliberate evolution, not an assumption.
Current Status
The current implementation focuses on a complete asynchronous flow: suspicious-event ingestion, initial risk screening, persistence, and review-oriented query access. The priority is proving architecture and execution with minimum viable resilience.
The next iteration is a hardening step with stronger operational controls, broader observability, and infrastructure maturity. This approach keeps trade-offs explicit between fast delivery and production readiness.
Why This Problem Space
The selected context is marketplace fraud review, where volume, latency, and analyst workflow quality all interact. This creates a realistic environment to test queueing behavior, consumer concurrency, backlog management, and failure recovery.
The goal is not to build the largest system possible, but to build a plausible one that demonstrates engineering judgment from architecture through operation.