Viet Le | May 18 2026

The Network is a Detail: Engineering a Fault-Tolerant Sync Engine

Built with Next.js, TypeScript, tRPC, Prisma, PostgreSQL, IndexedDB, Zustand, Server-Sent Events, PM2, Nginx

Making a web app load offline is a solved problem. You cache static assets, drop in a Service Worker, and route to a fallback page. It's a great trick for read-only content.

But the moment a user edits their data while offline — especially financial data — the problem space changes entirely. You're no longer just managing UI state. You're managing consistency across time and devices, which means making decisions that distributed systems engineers deal with at scale: what happens when two clients write the same record? Whose version wins? How do you guarantee a write is never silently lost?

When I built CardLedger.io, my goal was zero-latency UX. If a user logs a $200 Charizard while on a subway with a dropping connection, the app needs to feel instant. More importantly, that mutation can't silently fail — it requires optimistic UI, offline persistence, and reliable multi-device synchronization when the internet reconnects.

I could have outsourced this to a service like Supabase. Instead, I built a custom local-first sync engine using IndexedDB, Zustand, Server-Sent Events (SSE), and Postgres. Here's exactly how I engineered it.


1. The Client: Decoupling UI from the Network

In a standard React app, a mutation looks like this: show a spinner, await the database response, update the UI with the returned record.

This completely breaks offline. To sever the UI's reliance on the network, I implemented two things together: client-side ID generation and the Outbox Pattern.

When a user adds a card, the Zustand store instantly generates a crypto.randomUUID() and updates the UI optimistically. The user never sees a spinner. The app assumes success. Because the ID is generated on the client before the request fires, the optimistic UI entry and the eventual server record share the same ID from the start — there's no reconciliation step, no ID swap, no flicker when the server responds.

Simultaneously, the mutation fires in the background. Standard browser APIs like navigator.onLine are notoriously unreliable for catching network drops mid-flight, so instead I detect failure at the tRPC layer:

// Inside collectionStore.ts
const isNetworkError =
    error?.data?.code === 'FETCH_ERROR' ||
    error?.name === 'TRPCClientError' ||
    !navigator.onLine;

if (isNetworkError) {
    // Server unreachable — save to IndexedDB outbox
    set((state) => ({
        offline_mutations: [...state.offline_mutations, mutationRecord]
    }));
}

If the fetch drops, the payload goes into an offline_mutations array persisted in IndexedDB. The user's flow is entirely uninterrupted. Mutations flush sequentially when the device reconnects — order matters here, because an edit can't flush before the add it depends on.


2. The Server: Conflict Resolution Across Devices

Queuing mutations offline is the easy part. Safely pushing them to the server when a device reconnects is where things get interesting.

Consider the scenario: a user edits a card on their desktop at T=1050, then their phone reconnects and tries to push an older offline edit for the same card from T=1000. Without a conflict strategy, the latest-to-arrive mutation wins — which is arbitrary and wrong.

Last-Write-Wins with client timestamps

When the mutation queue flushes, each mutation carries the clientTimestamp of when the action originally occurred. The server compares this against the record's current updatedAt. If the incoming write is older than what's already in the database, it gets cleanly rejected:

// Inside collectionRouter.ts — updateEntry
if (entry.updatedAt.getTime() > clientTimestamp) {
    return { success: true, ignored: true, message: 'Stale write ignored' };
}

The conflict rule is intentionally simple: the latest mutation wins, regardless of type. If Device A deletes a card at T=1000 and Device B edits it at T=1050, the edit wins — the card survives. This is a deliberate product decision: a user who edited a card more recently almost certainly still wants it. "Delete always wins" is a common default in the literature, but it's the wrong default for a collection app where an accidental deletion is far more likely than a deliberate deletion made before a deliberate edit.

Tombstoning deletes

Hard deletes break multi-device sync. If Device A deletes a card while Device B is offline, Device B's next sync query finds nothing and silently keeps the card — a ghost record that should no longer exist.

Instead, deletions write a deletedAt timestamp and leave the row intact. When Device B syncs, it receives the tombstone and actively removes the card from its local state. The delete propagates correctly regardless of when each device reconnects. The LWW rule applies here too: if the tombstone's timestamp loses to a more recent edit on another device, the deletion is ignored and the card survives.


3. Real-Time Sync: SSE over a Dedicated VPS

To keep multiple open sessions in sync without forcing a manual refresh, the server needs to push notifications to the client when the database changes.

Polling every few seconds wastes resources and drains mobile batteries. WebSockets are stateful and operationally heavier than what I needed. Since I only required unidirectional signaling — "your data changed, come fetch the delta" — Server-Sent Events were the right fit.

Why SSE forced an infrastructure change

I originally ran Postgres on Neon's serverless platform, primarily for cost. But SSE requires a persistent, long-lived connection — and serverless environments tear down connections between requests. I migrated to a dedicated VPS, which turned out to be the right call for cost reasons anyway, but the architectural need for SSE made it inevitable regardless.

The other constraint is PgBouncer. Most managed Postgres providers sit behind PgBouncer, a connection pooler that recycles connections back to a pool after each transaction. LISTEN registers a notification subscription on a specific connection — the moment PgBouncer recycles it, the subscription is silently dropped. To hold a permanent LISTEN, you need a dedicated connection that never gets returned to a pool. On the VPS, while my standard tRPC mutations route through PgBouncer, the Node.js SSE listener connects directly to raw Postgres on port 5432, bypassing the pooler entirely to maintain its unbroken subscription.

The sync tower

The Node.js sync server runs co-located on the same VPS as Postgres, managed by PM2. This matters for latency: when a mutation fires pg_notify, the notification travels over localhost to the Node.js listener rather than across a network hop. By the time the browser receives the SSE push, the only meaningful latency is the Node.js event loop tick and the network round-trip-time to the client.

The listener is straightforward — a single persistent pg client holding a permanent LISTEN connection:

await pgClient.query('LISTEN sync_channel');

pgClient.on('notification', (msg) => {
    const updatedUserId = msg.payload;
    const userConnections = clients.get(updatedUserId);

    if (userConnections) {
        userConnections.forEach(res => {
            res.write(`data: {"type": "SYNC_REQUIRED"}\n\n`);
        });
    }
});

When any mutation commits, Prisma fires a raw notify:

await prisma.$executeRaw`SELECT pg_notify('sync_channel', ${ctx.user.id})`

Active connections are stored as a Map<userId, Set<Response>> — multiple tabs on the same account each hold their own SSE connection, all registered under the same key, and a single pg_notify fans out to all of them simultaneously.

Nginx as the SSE gateway

The sync server runs on port 8080 and is exposed publicly via sync.cardledger.io through Nginx. SSE has a few requirements that Nginx doesn't satisfy by default:

proxy_buffering off;
proxy_cache off;
chunked_transfer_encoding off;
proxy_read_timeout 86400s;
proxy_send_timeout 86400s;
proxy_http_version 1.1;
proxy_set_header Connection '';

proxy_buffering off is the critical one — Nginx buffers proxy responses by default and flushes them in chunks, which completely defeats SSE. Events need to reach the browser the moment they're written; buffering turns a real-time stream into a delayed batch.

proxy_read_timeout 86400s sets a 24-hour ceiling before Nginx kills an idle connection. The default is 60 seconds, which would force clients to reconnect constantly. This works in tandem with the 30-second heartbeat in server.js — the heartbeat keeps the connection from appearing idle to any intermediary that might time it out before the ceiling is reached.

Connection: '' clears the hop-by-hop header so Nginx doesn't forward a close directive downstream.


4. The Local/Server Boundary: SSE as a Global Event Bus

Syncing raw database rows into IndexedDB solves the problem for scalar data. If a user changes a card's price from $10 to $20, the local Zustand store instantly recalculates the "Total Value" summary — no network required.

But not all state is a simple scalar. CardLedger's historical portfolio chart aggregates thousands of price points over time. When a user changes a card's variant from "Normal" to "Holo," the local client knows the current Holo price — but it has no way to reconstruct six months of historical data. That recalculation has to happen server-side via tRPC.

The naive solution is to invalidate the query directly in the UI component after the mutation commits:

await updateEntry(id, { variant: 'Holo' });
utils.collection.getPortfolioHistory.invalidate();

This is a subtle but serious anti-pattern in a local-first app. If the device is offline, the local write succeeds instantly — but the invalidate() call fires against an unreachable server, silently fails, and leaves the chart permanently stale even after reconnection. The UI layer has no business knowing whether the network is up.

The fix was already sitting in the architecture: every mutation — whether it happens instantly online or flushes from the outbox hours later — eventually triggers a pg_notify from Postgres. That made the SSE listener the natural place to own cache invalidation:

eventSource.onmessage = async (event) => {
    if (data.type === 'SYNC_REQUIRED') {
        useCollectionStore.getState().pullChanges();
        utils.collection.getPortfolioHistory.invalidate();
    }
};

This creates a clean boundary. The UI stays pure — components push optimistic updates to the local store and never touch the network layer directly. Offline mutations queue safely, and the chart holds its last known state rather than breaking. The moment the device reconnects, the outbox flushes, Postgres broadcasts, the SSE listener catches it, and the chart recalculates to reflect reality. Eventual consistency falls out naturally from the infrastructure that was already there.


5. The Edge Cases That Actually Took Time

The happy path took a few days. The edge cases took longer. Three are worth calling out specifically.

The Thundering Herd

If a user rapidly deletes 10 cards, Postgres fires 10 pg_notify events in quick succession. Without intervention, the client would execute 10 simultaneous pullChanges calls. I added a 150ms debouncer on the SSE listener — it waits for rapid events to settle before firing a single batched pull.

The Dirty Flag Lock

Network requests take time. If an SSE ping arrives while a pull is already in flight, the new ping can't just be dropped — the change it's signaling might not be included in the in-flight request. To prevent dropped syncs, I implemented a lightweight state machine using two flags: isPulling and pendingPull. A ping that arrives during an active pull sets pendingPull = true. When the pull finishes, it checks the flag and immediately kicks off another sweep if needed.

The Transaction Cursor Problem

This one is subtle. pullChanges uses a cursor approach — fetch everything with updatedAt > lastSynced. The assumption is that updatedAt reflects when a transaction is committed. It doesn't — Postgres assigns it when the transaction has started.

So: Transaction A (editing a Charizard) starts at T=1000 and is slow. Transaction B (editing a Blastoise) starts at T=1050, is fast, commits first, and advances the client's cursor to T=1050.

Transaction A eventually commits with updatedAt = 1000 — which is now behind the cursor. The client's next pull starts from T=1050 and misses the Charizard edit entirely. Because of this out-of-order race condition, the user's mutation is permanently skipped and never appears in the UI.

const safeCursor = lastSynced > 5000 ? lastSynced - 5000 : 0;

Rolling the cursor back 5 seconds on every pull creates an overlapping lookback window. This ensures these out-of-order, slow-committing transactions are caught on the next sweep. The tradeoff is re-fetching some changes the client has already applied — but since applying an already-known change to the Zustand map is idempotent, this is a completely harmless duplicate operation. The 5-second window is conservative; in practice a transaction lagging that far behind would indicate a deeper problem, but the buffer costs nothing and elegantly closes the gap.


Conclusion

Hand-rolling a sync engine meant treating client-side state with the same care as a transactional database. The interesting problems weren't in the happy path — they were in the gaps: the delete that arrives after the edit, the cursor that races past a slow commit, the cache invalidation that silently fails while the device is underground.

None of those problems are unique to CardLedger. They show up anywhere you stop asking "is the network available?" and start asking "what happens when it isn't?" The moment you make that shift, the network stops being a dependency and starts being a detail — one your app handles quietly, in the background, while the user logs their Charizard.