← all writing

Nº02 // WRITING

Everything I self-host

I run four machines, and they split into three jobs: one production box that earns money, one staging clone that lives on the same metal, one multi-tenant QA/demo host (which got owned — more on that below), and one always-on droplet for an autonomous-agent workload. What follows is the actual configuration, not a tidy idealised version. The gaps are named on purpose.

The production box, tuned for a 15 GB ceiling

example is an Ubuntu 24.04 server (203.0.113.10) with 15 GB of RAM running Virtualmin + nginx + PHP-FPM 8.3 + MariaDB 10.11. It hosts example.com, site-a, site-b, site-c and a Booknetic SaaS install. With that many pools sharing 15 GB, the defaults will kill you.

The headline change was flipping PHP-FPM from pm = dynamic to pm = ondemand. Idle pools then cost zero — children spawn only when a request actually arrives. I dropped pm.max_children from 24 to 15 (24 was the OOM risk on this box), added pm.max_requests = 500 to recycle workers and bound memory leaks, and pm.process_idle_timeout = 10s to reap idle children fast. Per-pool memory_limit is 256M everywhere — except site-c, whose price-list generator genuinely OOM’d at 256M, so that one pool gets 512M. It’s a per-pool override, not a global bump; I’d rather give one greedy site more headroom than inflate every pool to cover its worst case.

The single biggest database win was embarrassingly simple. In /etc/mysql/mariadb.conf.d/99-tuning.cnf I raised innodb_buffer_pool_size from the default 128M to 3G. 128M for a WooCommerce workload is absurd — most of your hot indexes don’t even fit, so every query that should hit RAM goes to disk. OPcache in /etc/php/8.3/fpm/conf.d/99-opcache-tuning.ini gets opcache.memory_consumption = 256 and opcache.max_accelerated_files = 32000, because WordPress + WooCommerce + Booknetic together ship an astonishing number of PHP files and the default file cap silently leaves half of them uncached. I also added a 4 GB /swapfile (persisted in /etc/fstab) with vm.swappiness = 10 so the kernel treats swap as a safety net under a traffic spike, not a first resort.

Object caching is Redis with maxmemory 512mb and allkeys-lru eviction, and here’s the trick I like best: one Redis DB per site, set via WP_REDIS_DATABASE in each wp-config.php. site-a=0, site-b=1, example=2, site-c=3, staging=4, next site=5. A single shared Redis would let one site’s FLUSHDB nuke every other site’s cache; numbered DBs give you cheap namespacing without the overhead of running five separate Redis instances.

Monitoring without exposing the daemon

Netdata is installed but bound to 127.0.0.1:19999 only — the raw port is never on the public internet. To make it reachable I created monitor.example.com as its own Virtualmin domain: nginx reverse-proxies to the loopback Netdata, gated behind HTTP Basic Auth (user monitor) with a wildcard Cloudflare Origin cert. That loopback-bind-then-front-it pattern is house style across the whole fleet. The honest status: the Cloudflare DNS record and Telegram alerting (bot token + chat ID for the ~80% health alarms) are still pending.

While we’re being honest: this box has no UFW yet, SSH still allows password auth, and the origin’s 80/443 aren’t restricted to Cloudflare IPs, so anyone hitting the bare IP bypasses CF entirely. These are deferred trade-offs, not things I forgot. What is in place is fail2ban (sshd/webmin/usermin/mail jails) and a Virtualmin backup that runs hourly to Dropbox. And the password-auth “gap” once saved me: after I overwrote ~/.ssh/id_ed25519 with a fresh key and locked myself out, ssh-copy-id only worked because password auth was still on. The recovery hatch and the security hole turned out to be the exact same door.

When Cloudflare silently broke scheduled jobs

This is my favourite war story because nothing was “broken” in the obvious place. On 2026-05-18 around 20:25, time-based WhatsApp and email reminders on example.com stopped firing. Twilio creds were valid — 886 successful sends in the prior week — so on the surface it looked like a Twilio or WhatsApp problem. It wasn’t.

Like a lot of WordPress stacks, the scheduled side of this site leans on a server-side loopback request — the kind of pattern wp-cron and Action Scheduler use, where PHP fires an HTTP call back to its own wp-cron.php / admin-ajax.php to run due work. Around that date, Cloudflare’s Bot Fight Mode got toggled on and started returning HTTP 403 with cf-mitigated: challenge on both of those endpoints. A cURL loopback can’t solve a JS challenge, so anything that depended on a self-issued HTTP call simply never ran. Event-driven work kept happening — because that fires synchronously inside the real visitor’s browser request, which passes the challenge. That asymmetry — scheduled jobs dead, instant ones alive — is exactly why it masqueraded as a messaging bug.

The fix is to stop relying on a Cloudflare-proxied HTTP loopback for scheduling at all. I disabled WordPress’s HTTP-triggered cron (define('DISABLE_WP_CRON', true)) and drive the due work from a real per-minute system cron in the example user’s crontab, invoking WP-CLI directly so nothing ever leaves the box over HTTP:

* * * * * /usr/bin/flock -n /tmp/wpcron.lock /usr/local/bin/wp --path=/home/example/public_html cron event run --due-now >/dev/null 2>&1

The flock -n guard keeps overlapping minute-runs from stacking on top of each other. When you cut over, verify the work actually fired rather than trusting any single “last run” timestamp — I check the application’s own log table by id, since a CLI-triggered run can leave a UI-facing “ran at” marker unchanged. The proper root-cause fix — a Cloudflare WAF skip rule for */wp-cron.php and */admin-ajax.php scoped to the origin egress IP — is still on the list. The generalizable lesson: any Cloudflare-proxied site that leans on a loopback wp-cron or Action Scheduler can silently stall the moment Bot Fight Mode flips. If a migrated WooCommerce site’s scheduled jobs ever look stuck, check this first.

Staging on the same box, isolated at every layer

staging.example.com lives on the same physical server but as a separate Virtualmin domain, which buys full isolation per layer: its own unix user staging (/home/staging), its own FPM pool (/etc/php/8.3/fpm/pool.d/17792676891117477.conf, 256M, 64M upload/post), its own MySQL DB and creds, its own Redis DB 4, and its own nginx vhost. Access is gated server-wide with auth_basic against /etc/nginx/monitor.htpasswd — the same monitor creds as the monitor domain — with .well-known/ carved out via auth_basic off so ACME can still validate for TLS renewal.

The clone is a four-step runbook: rsync the docroot, mysqldump ... | mysql staging, wp config set the new creds plus WP_REDIS_DATABASE=4, then wp search-replace https://example.com https://staging.example.com --skip-columns=guid (127 replacements last run), then chown -R staging:staging.

And here’s the trap that cost me real time: wp search-replace updates the database but does NOT invalidate Redis. After the clone, siteurl/home in the DB read staging.example.com, but home_url() still returned the cached example.com value, so redirect_canonical 301’d every single staging request straight back to production. The fix is one line — redis-cli -n 4 FLUSHDB — and it’s now a hard rule: always flush the cloned site’s Redis DB after a search-replace.

The most important hardening is the email kill-switch, because the number-one way staging humiliates you is by emailing real customers. The mu-plugin at /home/staging/public_html/wp-content/mu-plugins/staging-disable-email.php returns true from the pre_wp_mail filter (short-circuiting the entire send) and, belt-and-suspenders, clears all recipients and attachments on phpmailer_init. Combined with wp option update blog_public 0 and a Disallow: / robots.txt, staging can’t email anyone or get indexed. The catch is that every re-clone rsync overwrites the mu-plugin, the Redis DB setting, and blog_public — so all three have to be re-applied (and the FLUSHDB re-run) every single time, or the clone quietly re-arms itself to act like production.

The compromise that proved the isolation

panel.itahir.com is an OVH VPS (213.32.21.187), Ubuntu under Webmin/Virtualmin, hosting Booknetic QA sites, demo environments, and n8n automation — each tenant under its own Linux user (n8n, sizinzaman, dev, saas, bktest, qatest1/2, env1/2…). Webmin (:10000) and Usermin (:20000) are still publicly exposed, which is a noted gap that turns out to matter here.

On 2026-05-20 the /home/test/ site was compromised through a vulnerable ElementsKit page-builder plugin — a known RCE vector. The attacker hid compiled binaries in a fake double-vendor path mimicking phpseclib: wp-content/plugins/elementskit/libs/composer/vendor/build/vendor/src/phpseclib/.../Reductions/.tmp. Running as the unprivileged test user, it opened 600+ outbound HTTPS connections doing WordPress credential-stuffing, which tripped an OVH abuse report (ticket +CQBQBDPPTK.1fb6). Containment was blunt:

sudo pkill -9 -u test -f "elementskit/libs/composer/vendor/build"
find /home -type d -path "*vendor/build/vendor*"
# then: deleted /home/test/ and the test user entirely

The server-wide audit came back clean: no root compromise, no spread to other tenants, no rootkit, no backdoor SSH keys. The reason it stayed contained is precisely the per-user Virtualmin isolation — the blast radius was one site because the attacker could only ever be test. That’s the architectural argument for per-tenant Linux users, validated the hard way. The lessons are unglamorous: patch hygiene on WP plugins matters more than the OS hardening you fuss over, exposed Webmin/Usermin is attack surface you want closed, and vendor/build/vendor mimicking phpseclib is a concrete IOC worth grepping for across any shared host.

The droplet where I did security properly

The cortex droplet (DigitalOcean, 159.89.9.22, Ubuntu 24.04.3, 2 vCPU / 3.8 GB / 116 GB) is the deliberate counterpoint. I provisioned it specifically so an experimental autonomous-agent + Docker workload would NOT share a box with production tenants — and after the panel compromise, that reasoning felt earned rather than paranoid.

It’s SSH-key only. UFW allows SSH and nothing else. The board UI binds 127.0.0.1 and you reach it through a tunnel (ssh -L 7878:127.0.0.1:7878) — the same loopback-first instinct as Netdata, but a tunnel instead of nginx+Basic-Auth. fail2ban is on. The bootstrap (scripts/install-droplet.sh) installs Docker 29.x and a 2 GB swap. Worker containers run claude as the non-root node user, writing deliverables to /work/task-<id>; critically, the host itself doesn’t even have the claude CLI — claude only ever runs inside a sandboxed container, authed by an injected CLAUDE_CODE_OAUTH_TOKEN from ~/.config/cortex/worker-token (generated on the Mac and piped over SSH, never pasted anywhere).

The honest quirk: the repo at /root/cortex was deployed via tar-over-ssh, not git clone, so there’s no .git on the box — updates mean re-tarring the changed files until I set up the deploy key described in docs/DEPLOY.md. Laid side by side, the fleet is a security-maturity gradient: the droplet does it right (UFW, key-only, loopback UI, non-root sandbox, minimal host); example is pragmatic with named gaps; panel is the lesson I paid for. Self-hosting isn’t about dragging every box to the same bar — it’s about knowing which box deserves which bar, and being honest about the rest.