← all writing

Nº02 // WRITING

The N+1 that Redis couldn't save

There’s a particular flavor of disappointment that comes from adding a cache and watching nothing happen. You expect the graph to drop off a cliff. Instead it sits there, flat and indifferent, like the cache isn’t even wired in. That was me, a couple of weeks ago, staring at a tenant-directory search in one of my own commercial Booknetic addons — a marketplace-style listing UI I sell at addons.itahir.com — wondering why a perfectly healthy Redis object cache was doing absolutely nothing for the one page that needed it.

The search itself is mundane in the best way: a shortcode renders a directory, you type a few words, the form does a full reload, and you get a grid of matching listings. No per-keystroke fanciness, no autocomplete to debounce. Just a classic server-rendered search. And it was slow enough that I noticed, which on a page like this means it was slow enough that customers would too.

Redis was there. It just wasn’t invited.

The first instinct — mine, and probably yours — is “the object cache should catch this.” Redis was up. WordPress was using it as a persistent object cache. Other parts of the admin felt snappy. So why was this page hammering the database on every load?

Because the hot path never asked the cache a single question.

This is the part worth generalizing, because it’s not about any one platform. A whole class of WordPress plugins — including data-heavy ones with their own tables — issue raw $wpdb queries directly. WordPress’s object cache only accelerates code that goes through its caching APIs, or through query layers that themselves cache. Hand-rolled SQL bypasses all of that by design. The query runs, the result comes back, and the object cache never sees the round trip. You can install the fastest Redis in the world and it will faithfully cache the things it’s handed and remain blissfully unaware of everything it isn’t.

So lesson one, the unglamorous one: an object cache only helps code that asks it. If your hot path is hand-rolled SQL, no amount of memory-store tuning will save you. You have to fix the query first.

Profiling the page, not guessing about it

When I actually profiled the request instead of theorizing, the shape was almost comically textbook.

The page ran one search query to find matching listings, and then — for each result row — fired off several more queries to assemble the card you see in the grid: location, hours, media, top services, the listing’s type, the tenant record behind it. Roughly six extra queries per row. Innocent enough at one row. But this is a directory; the whole point is that it returns many rows.

The numbers scaled exactly the way an N+1 scales — which is to say, linearly and unforgivingly:

11 results  →  77 queries
37 results  → 235 queries

That’s the signature. Query count rising in lockstep with row count is the fingerprint of a loop doing per-row lookups. None of those queries were individually slow. They were slow in aggregate, and they were slow because of how many of them there were, and the database connection overhead of doing the same trivial lookup over and over.

The search query itself had a second, separate problem. It joined against a keywords table to do the matching, and that join column had no index. EXPLAIN told the whole story in its terse little way:

type: ALL          -- full table scan, no usable index
rows: 496          -- reading everything to find a handful
Extra: Using where; Using join buffer (Block Nested Loop)

type: ALL plus a Block Nested Loop join is the database politely informing you it has given up on being clever and is just going to read the entire table into a buffer and grind through it. On top of that, a GROUP BY in the query forced a temporary table and a filesort — Using temporary; Using filesort — which is the optimizer materializing intermediate results to disk-or-memory and then sorting them, all so it can deduplicate rows it shouldn’t have had to scan in the first place.

So: a full scan feeding a nested-loop join feeding a temp table feeding a filesort, and then a per-row loop multiplying the whole thing by the result count. Two independent performance bugs stacked on the same request.

The fix that actually moved the needle

Here’s the part I want to be honest about: I shipped the low-risk fix first, on purpose, and it was the right call.

The N+1 is the bigger, scarier number, but fixing it means rewriting a loop and changing how data flows through the page — real code changes with real regression surface. The join, on the other hand, was bleeding from a wound I could close with three indexes and zero application code.

Three targeted indexes:

-- foreign-key join columns on the keywords link table
ADD INDEX idx_lp (landing_page_id);
ADD INDEX idx_kw (keyword_id);

-- the status + tenant pair the listing lookup filters on
ADD INDEX idx_status_tenant (status, tenant_id);

The first two give the join an actual access path. The third is a composite that matches the exact (status, tenant_id) predicate the listing query filters on, so the database can seek instead of scan. All three went on tiny tables, applied as online DDL — instant, no lock drama, safe to run on a live site.

The EXPLAIN flipped exactly the way you hope:

lpk:  type ALL, rows 496, Block Nested Loop
  →   type ref, key idx_lp, rows 4

ALL became ref. Four hundred ninety-six rows became four. The Block Nested Loop vanished because there was finally an index to drive the join. The temporary table and filesort disappeared along with it. End to end, the search dropped from roughly 62ms to roughly 35ms — a little better than half — for the cost of three lines of DDL and zero risk to the application logic.

That’s the kind of fix I love. Not because it’s elegant, but because it’s cheap and it’s safe and it ships today.

What I deliberately left for next time

The indexes fixed the query. They did not fix the N+1, and I want to be clear-eyed that the bigger win is still sitting on the table.

The N+1 is a code change, and it’s the obvious one: instead of looping over results and firing six queries per row, collect the IDs from the result set and fetch everything in batched IN(...) queries — one query for all the locations, one for all the media, one for all the services, and so on — then stitch them back together in PHP. That alone should take the 37-result case from ~235 queries down to something like 12–15. The query count stops tracking the row count, which is the entire point of killing an N+1.

And only after that — only at the layer I actually own, my own addon’s code — does caching finally earn its keep. The plan is a transient keyed on a hash of the normalized search params: same search terms, same filters, same cached result set. With Redis behind the transient API, that assembled result set lands in the object cache the platform’s raw queries never touched, because now it’s my code doing the asking. The one carve-out is date or availability filters — those are inherently volatile and time-sensitive, so I skip the cache entirely when a date is set rather than serve someone a stale calendar.

That ordering matters, and it’s the whole thesis. I didn’t reach for the cache first. I reached for it last, after the query was sane and the N+1 was batched, because a cache layered over a broken query just memorizes a slow answer. Fix the shape of the work, then cache the result of work worth caching.

Redis couldn’t save this page. It was never going to. The page wasn’t asking it anything.