May 14, 2026 · ~12 minute read · Storage & power on local LLM inference
What two RTX 3090s, an 8-cell cold-start sweep, and a power-cap experiment taught me about where the seconds and watts actually go. Three assumptions the data didn't support, one 12× ratio that surprised me, and a 250 W power cap that gives back 36% of GPU power for an 11% throughput cost.
Hashnode
April 2026 · ~18 minute read
Standing up vLLM nightly and llama.cpp on the same 3090 with the same Qwen3.6-27B model — and discovering that two inference engines on identical hardware give a 4× difference in usable context. Hybrid Mamba-attention architecture accounting, quantization comparison, and the prompt-cache mechanics behind the gap.