Ongoing

Distributed Fraud Review Pipeline

Backend Engineer · 2026 · In progress · 3 min read

Currently building a distributed fraud-review pipeline, delivering a functional POC with asynchronous processing, concurrent consumers, persistence, and a baseline resilience model.

Overview

This project builds an end-to-end system for suspicious listing review in transactional marketplaces. The current focus is a robust POC baseline, with production hardening as the next evolution step.

Problem

Fraud and prohibited-item detection workflows often break under volume, latency pressure, and operational complexity. A simple CRUD backend is not enough when cases must be ingested asynchronously, prioritized, reviewed, and audited at scale.

Constraints

  • The project must deliver a working distributed POC quickly, without over-engineering the architecture.
  • The system must include at least three specialized services with clear input and output contracts.
  • The pipeline must support concurrent consumers, retries, and a dead-letter strategy for predictable failures.
  • The current scope prioritizes a complete and traceable flow over advanced autoscaling and deep production tooling.
  • Initial observability must still capture baseline signals such as throughput, latency, error rate, and queue backlog.

Approach

The current implementation focuses on delivering a traceable asynchronous pipeline (ingestion, screening, persistence, and query APIs) with queue-based processing, basic retry and DLQ behavior, and minimum viable operational visibility. The roadmap then hardens this baseline with stronger idempotency, advanced observability, and production-like AWS operations.

Key Decisions

Start with queue-based processing before introducing streaming complexity

Reasoning:

A queue-first design keeps the first version understandable while still validating distributed boundaries, asynchronous flow, and failure handling. It creates a stable baseline for later evaluation of stream-oriented evolution.

Alternatives considered:
  • Adopt stream-first architecture from day one
  • Keep synchronous processing between services

Split the platform into specialized services with explicit contracts

Reasoning:

Separating ingestion, screening, and review/query responsibilities improves maintainability and makes bottlenecks easier to locate during scale and failure tests.

Alternatives considered:
  • Single modular monolith
  • Only two services with shared responsibilities

Treat retries, DLQ, and idempotency as first-class concerns

Reasoning:

Distributed pipelines fail in predictable ways. Encoding retry policy, dead-letter handling, and duplicate protection early avoids silent data corruption and improves confidence under load.

Alternatives considered:
  • Best-effort retries without DLQ
  • Manual recovery only

Plan ECS/Fargate and Terraform as the production-hardening deployment target

Reasoning:

Keeping cloud hardening as a follow-up milestone preserves delivery speed in the current scope while maintaining a clear path toward production-like operations.

Alternatives considered:
  • Stay local-only with Docker Compose
  • Use managed serverless components exclusively

Tech Stack

  • Java 21
  • Spring Boot
  • Spring Web
  • Spring Data JPA
  • Spring Boot Actuator
  • PostgreSQL
  • Amazon SQS
  • Docker
  • Docker Compose
  • AWS ECS/Fargate (planned)
  • Terraform (planned)
  • OpenTelemetry (planned)
  • CloudWatch (planned)

Result & Impact

The current implementation already validates the core distributed flow and gives a concrete baseline for performance, failure handling, and service boundaries. This creates a measurable foundation for production hardening.

Learnings

  • A strict POC scope prevents over-engineering and accelerates end-to-end validation.
  • Clear service boundaries matter more than service count when scaling distributed systems.
  • Reliability mechanisms such as retry and DLQ need to exist even in POC stage.
  • Baseline metrics are required early to guide hardening priorities.
  • Production-like cloud delivery should be a deliberate evolution, not an assumption.

Current Status

The current implementation focuses on a complete asynchronous flow: suspicious-event ingestion, initial risk screening, persistence, and review-oriented query access. The priority is proving architecture and execution with minimum viable resilience.

The next iteration is a hardening step with stronger operational controls, broader observability, and infrastructure maturity. This approach keeps trade-offs explicit between fast delivery and production readiness.

Why This Problem Space

The selected context is marketplace fraud review, where volume, latency, and analyst workflow quality all interact. This creates a realistic environment to test queueing behavior, consumer concurrency, backlog management, and failure recovery.

The goal is not to build the largest system possible, but to build a plausible one that demonstrates engineering judgment from architecture through operation.