0%

I had idempotency, a message queue, and retries. I thought I was finally building something production-ready. Then I started thinking about the “in-between” failures.

What if the worker starts a charge, the provider succeeds, but the worker crashes before it can update the database?

The payment stays stuck in processing forever. The user was charged, but my system thinks it’s still “happening.” This is an orphaned payment, and it’s a nightmare to debug because, from the outside, it looks like nothing went wrong.

The Processing Lease

I solved this by treating the processing status as a lease. When a worker starts, it stamps the payment with a processing_started_at timestamp.

If a payment has been in processing for more than 2 minutes, I assume the worker died. I wrote a “sweeper” that finds these stale leases and resets them to failed so they can be picked up by the retry logic.

UPDATE payments
SET status = 'failed', next_retry_at = datetime('now', '+1 minute')
WHERE status = 'processing'
  AND processing_started_at <= datetime('now', '-2 minutes');

The Golden Rule: Commit Before You Charge

This was the most important lesson I learned. I used to do this:

tx.Begin()
// ... update status to processing ...
provider.Charge() // DANGEROUS: external call inside a DB transaction
tx.Commit()

This is a disaster. If provider.Charge() succeeds but the tx.Commit() fails (maybe the DB is slow or the connection drops), the transaction rolls back. My database says the payment is still pending, so the system tries to charge the user again. Double charge.

I had to flip the logic entirely:

  1. Update status to processing and COMMIT the transaction immediately.
  2. Call the provider OUTSIDE of any transaction.
  3. Update status to completed.

If step 3 fails, the sweeper will eventually see a stale processing record and reset it. Because the provider itself is idempotent (using the same key), the retry will simply return the existing success. The user is never charged twice.

Distributed Tracing (Actually seeing the mess)

As the system grew, logs weren’t enough. I couldn’t easily tell which API request led to which worker error.

I added OpenTelemetry to pass a Trace-ID through RabbitMQ. Now, when I look at a trace in Grafana, I can see the entire journey across different services:

Client
  |
  | POST /charge (Idempotency-Key)
  v
API
  |-- Store pending payment & Outbox entry (Transaction)
  |-- (Outbox Poller) Publish message (payment_ref, Trace-ID)
  v
RabbitMQ
  v
Worker
  |-- Acquire lease (status=processing, timestamp)
  |-- Call provider.Charge()
  |   |-- Success -> Update DB status=completed
  |   +-- Failure -> Set next_retry_at, status=failed
  v
Sweeper (periodic)
  |-- Find stale processing leases
  |-- Reset to failed for retry

Final Thoughts

Building a payment processor isn’t about the happy path. The real engineering happens in the 1% of the time when the network dies, the worker crashes, or the database is locked.

There’s more to it, of course. A real system would need circuit breakers to stop hammering a failing provider, rate limiting to stay upright, and reconciliation jobs to cross-reference the database with the provider’s records every night.

The goal isn’t just to make it work; it’s to make it fail gracefully without losing anyone’s money.

The final code is at github.com/oreoluwa-bs/dinero.