Ongoing

Distributed Fraud Review Pipeline

Full-stack Engineer · 2026 · In progress · 3 min read

Currently building a distributed fraud-review pipeline, delivering a functional POC with asynchronous processing, concurrent consumers, persistence, and a baseline resilience model.

Overview

This project builds an end-to-end system for suspicious listing review in transactional marketplaces. The current focus is a robust POC baseline, with production hardening as the next evolution step.

Problem

Fraud and prohibited-item detection workflows often break under volume, latency pressure, and operational complexity. A simple CRUD backend is not enough when cases must be ingested asynchronously, prioritized, reviewed, and audited at scale.

Constraints

The project must deliver a working distributed POC quickly, without over-engineering the architecture.
The system must include at least three specialized services with clear input and output contracts.
The pipeline must support concurrent consumers, retries, and a dead-letter strategy for predictable failures.
The current scope prioritizes a complete and traceable flow over advanced autoscaling and deep production tooling.
Initial observability must still capture baseline signals such as throughput, latency, error rate, and queue backlog.

Approach

The current implementation focuses on delivering a traceable asynchronous pipeline (ingestion, screening, persistence, and query APIs) with queue-based processing, basic retry and DLQ behavior, and minimum viable operational visibility. The roadmap then hardens this baseline with stronger idempotency, advanced observability, and production-like AWS operations.

Key Decisions

Start with queue-based processing before introducing streaming complexity

Reasoning:

A queue-first design keeps the first version understandable while still validating distributed boundaries, asynchronous flow, and failure handling. It creates a stable baseline for later evaluation of stream-oriented evolution.

Alternatives considered:

Adopt stream-first architecture from day one
Keep synchronous processing between services

Split the platform into specialized services with explicit contracts

Reasoning:

Separating ingestion, screening, and review/query responsibilities improves maintainability and makes bottlenecks easier to locate during scale and failure tests.

Alternatives considered:

Single modular monolith
Only two services with shared responsibilities

Treat retries, DLQ, and idempotency as first-class concerns

Reasoning:

Distributed pipelines fail in predictable ways. Encoding retry policy, dead-letter handling, and duplicate protection early avoids silent data corruption and improves confidence under load.

Alternatives considered:

Best-effort retries without DLQ
Manual recovery only

Plan ECS/Fargate and Terraform as the production-hardening deployment target

Reasoning:

Keeping cloud hardening as a follow-up milestone preserves delivery speed in the current scope while maintaining a clear path toward production-like operations.

Alternatives considered:

Stay local-only with Docker Compose
Use managed serverless components exclusively

Tech Stack

Java 21
Spring Boot
Spring Web
Spring Data JPA
Spring Boot Actuator
PostgreSQL
Amazon SQS
Docker
Docker Compose
AWS ECS/Fargate (planned)
Terraform (planned)
OpenTelemetry (planned)
CloudWatch (planned)

Result & Impact

The current implementation already validates the core distributed flow and gives a concrete baseline for performance, failure handling, and service boundaries. This creates a measurable foundation for production hardening.

Learnings

A strict POC scope prevents over-engineering and accelerates end-to-end validation.
Clear service boundaries matter more than service count when scaling distributed systems.
Reliability mechanisms such as retry and DLQ need to exist even in POC stage.
Baseline metrics are required early to guide hardening priorities.
Production-like cloud delivery should be a deliberate evolution, not an assumption.

Current Status

The current implementation focuses on a complete asynchronous flow: suspicious-event ingestion, initial risk screening, persistence, and review-oriented query access. The priority is proving architecture and execution with minimum viable resilience.

The next iteration is a hardening step with stronger operational controls, broader observability, and infrastructure maturity. This approach keeps trade-offs explicit between fast delivery and production readiness.

Why This Problem Space

The selected context is marketplace fraud review, where volume, latency, and analyst workflow quality all interact. This creates a realistic environment to test queueing behavior, consumer concurrency, backlog management, and failure recovery.

The goal is not to build the largest system possible, but to build a plausible one that demonstrates engineering judgment from architecture through operation.

All projects