My first attempt at a payment processor was synchronous and dangerous. If anything timed out or failed mid-way, I ended up with inconsistent data or double charges.
To fix it, I had to stop trying to do everything inside the HTTP request. I moved to a model where the API just records the intent to pay, and a separate worker handles the actual execution.
Step 1: Idempotency (The “Don’t Double Charge” Fix)
The first thing I needed was an Idempotency-Key. It’s just a unique string from the client. If I see the same key twice, I return the existing result instead of starting a new process.
I added a unique index to my payments table and updated the handler to check for it first:
idemKey := r.Header.Get("Idempotency-Key")
existing, err := s.store.GetPaymentByIdempotency(ctx, idemKey)
if err == nil {
// We've seen this before, just return the result
writeJSON(w, http.StatusOK, existing)
return
}
Now, retries from the client are safe.
Step 2: The Outbox Pattern and RabbitMQ
I decided to use a background worker to handle the actual provider call. But this introduced a new problem: what if I save the payment to the database, but my server crashes before it can send the message to RabbitMQ? The payment would stay “pending” forever.
This is where the Outbox Pattern comes in. Instead of publishing to RabbitMQ directly, I save the payment and a “message to be sent” into an outbox table in the same database transaction.
tx, _ := s.db.Begin()
// 1. Save payment as pending
payment, _ := qtx.CreatePayment(ctx, params)
// 2. Save the intent to the outbox table
qtx.CreateOutboxEvent(ctx, outboxParams)
tx.Commit()
Since they are in the same transaction, either both are saved or neither is. A separate background process then reads from the outbox and publishes to RabbitMQ. If the publish succeeds, it marks the outbox row as sent.
The client gets a 202 Accepted immediately. They can poll a GET endpoint later to see if it’s finished. This frees up my API and prevents holding connections open while waiting for slow providers.
Step 3: Handling Provider Failures
Payment providers fail. A lot. My mock provider is actually set to fail 50% of the time just to make sure my retry logic actually works.
When the worker picks up a message and the provider fails, I don’t want to just give up. I implemented a simple retry poller with backoff. If a charge fails, I update the DB with a next_retry_at timestamp. A background process sweeps the DB every few seconds and re-queues anything that’s ready for another shot.
The New State Machine
The flow looks more like this now:
pending → processing → completed OR failed (which triggers a retry).
By moving to this async model, I’ve solved the timeout issues. Even if the client disconnects, the worker keeps going. If the worker crashes, the outbox or the retry poller eventually catches it.
But there’s still a catch. What happens if the worker crashes exactly after charging the user but before updating the database? That’s the final 1% of failures I’ll tackle in the last post.
The code for this version is at github.com/oreoluwa-bs/dinero/tree/resilient-approach.