devinfo.dev

inspiration

Merging Is Not Training

Model merging combines two or more fine-tuned LLMs into a single model without any gradient updates. No data. No compute budget. No training run. The result inherits capabilities from every source model — if you pick the right algorithm.

June 12, 2026

#model-merging #mergekit #fine-tuning #inference

inspiration

The Tokenizer Is the Bug

Every LLM failure starts with the same invisible step: tokenization. It runs before inference, produces no logs, and degrades outputs silently. Most debugging sessions end at the model. They should start at the tokenizer.

June 11, 2026

#tokenization #llm #inference #engineering

inspiration

GGUF Is a Container, Not Just Weights

Every self-hosted AI practitioner downloads .gguf files. Few understand what they are. GGUF is not a weight dump — it is a self-contained container that carries the model, the tokenizer, the quantization scheme, and the chat template in a single file. That design decision changed how open-source models are distributed.

June 10, 2026

#gguf #inference #llama-cpp #self-hosted

inspiration

Flash Attention Is an IO Problem

Standard attention is slow not because of arithmetic — it is slow because of memory traffic. Flash Attention solves the IO problem, not the compute problem. That distinction matters for how you think about every inference optimization that follows.

June 9, 2026

#inference #attention #transformers #gpu

whitepaper

Evals Are Not Optional

Benchmark scores are not evaluations. Contamination is widespread, Goodhart's Law is in effect, and the gap between a leaderboard number and production behaviour is unbridged without a real eval pipeline. This paper defines what evals are, why the major benchmarks are unreliable in isolation, and how to build an evaluation practice that actually catches failures.

June 8, 2026

#evals #benchmarks #llm #engineering

inspiration

Prefix Caching Is Free Throughput

Automatic Prefix Caching in vLLM reuses already-computed KV cache blocks across requests that share identical prefixes — delivering 30–50% throughput gains and up to 10x latency reduction at zero engineering cost beyond a single configuration flag.

June 8, 2026

#inference #vllm #performance #prefix-caching

inspiration

Continuous Batching: The Throughput Multiplier

Static batching wastes GPU cycles waiting for the slowest request. Continuous batching fixes this at the scheduler level — and the gains are not marginal.

June 7, 2026

#inference #vllm #throughput #self-hosted

inspiration

LoRA Is Not Fine-Tuning

LoRA does not update your model. It adds a thin, low-rank correction on top — and that distinction changes how you think about deployment, switching, and scale.

June 5, 2026

#lora #fine-tuning #inference #model-adaptation

inspiration

Steering Is Not Prompting

Prompts influence what a model says. Activation steering changes what the model is, mid-inference. They are not the same tool.

June 4, 2026

#inference #mechanistic-interpretability #activation-steering #llm-internals

inspiration

The KV Cache Is Your Real Memory Budget

The KV cache — not the model weights — is what limits how many tokens you can generate and how many requests you can serve. Understanding it changes how you provision hardware and tune inference.

June 3, 2026

#inference #kv-cache #memory #llm-serving

inspiration

Attention Sinks: The Tokens That Hold Everything Together

Transformers quietly route a disproportionate share of attention to their first tokens — not because those tokens are important, but because softmax needs somewhere to put mass. Understanding this changes how you think about KV cache design.

June 2, 2026

#inference #transformers #kv-cache #attention

whitepaper

Fine-Tuning, RAG, or Prompting: An Engineering Decision

Three techniques can improve LLM output quality: prompt engineering, retrieval-augmented generation, and fine-tuning. Each solves a different problem. Choosing the wrong one wastes months and produces worse results than the right one done simply.

June 1, 2026

#fine-tuning #rag #prompt-engineering #llm

inspiration

The Tool Is Not the Model

A language model does not execute functions. It describes them. The execution lives elsewhere — in your code, your runtime, your responsibility.

June 1, 2026

#tool-use #function-calling #agents #llm

inspiration

Embeddings Are Not Optional

Every RAG pipeline, semantic search index, and similarity feature runs on embeddings. The generation model gets the credit. The embedding model does the work.

May 31, 2026

#embeddings #rag #local-inference #vector-search

inspiration

Temperature Is Not Creativity

Temperature is a probability reshaper, not a creativity dial. Calling it a creativity parameter is a category error — one that leads to misconfigured systems and wasted inference budget.

May 30, 2026

#inference #sampling #llm #engineering

inspiration

Retrieval Is the Weakest Link

RAG systems fail at retrieval, not generation. Engineers blame the LLM. The problem is upstream.

May 29, 2026

#rag #retrieval #embeddings #ai-engineering

inspiration

Prompt Caching Is Free Money

Every time your app resends the same system prompt, you pay to compute it again. Prompt caching eliminates that cost by reusing precomputed KV tensors across requests. It requires no code changes and delivers up to 90% input token savings.

May 28, 2026

#inference #optimization #cost #llm

booklet

From Free Tier to Sovereignty: Running Inference on Cloud ARM Instances

Free tier cloud compute promises self-hosted AI. The reality is capacity lotteries, region lock-in, and silent deprecation. This booklet documents what actually works, what does not, and how to build an inference setup that survives policy changes.

May 27, 2026

#cloud #arm #oci #sovereignty #inference #self-hosted

booklet

Ollama Beyond Defaults: Custom Model Paths on Windows and WSL

Ollama assumes default paths. When your models live elsewhere, the documentation stops helping. This booklet covers every configuration path for Windows native, WSL2, and cross-boundary access.

May 27, 2026

#ollama #windows #wsl #self-hosted #inference

booklet

OpenCode with Local Models: Pointing Your Coding Agent at Your Own Inference

OpenCode is a terminal-first AI coding agent. It expects cloud APIs by default. This booklet shows how to wire it to Ollama, vLLM, or any OpenAI-compatible local endpoint — and what breaks when you do.

May 27, 2026

#opencode #ollama #coding-agent #local-inference #self-hosted

inspiration

Structured Outputs Are a Contract

Constrained generation is not a convenience feature. It is a systems boundary — a contract between your model and every downstream component that consumes its output.

May 27, 2026

#structured-outputs #constrained-decoding #inference #llm-engineering

booklet

The LocalLLM Engine Stack: One API, Multiple Backends, Zero Lock-in

A single OpenAI-compatible endpoint that routes across Ollama, llama.cpp, and FreeLLMAPI with automatic failover. This booklet documents the architecture, routing logic, and deployment of the localllm-engine.

May 27, 2026

#localllm-engine #inference #routing #self-hosted #architecture

inspiration

The Model Is Not the Agent

An LLM does not call tools. It requests them. The loop is the agent — and most broken agents are broken loops, not broken models.

May 26, 2026

#tool-use #agents #llm #engineering

whitepaper

Choosing Your Inference Engine: llama.cpp, Ollama, and vLLM

llama.cpp, Ollama, and vLLM are not interchangeable. They solve different problems at different scales. This paper maps the architectural differences, performance characteristics, and deployment tradeoffs to help you pick the right engine for your workload — and understand why the wrong choice costs you in ways that are hard to undo.

May 25, 2026

#inference #llm #vllm #llama.cpp #ollama #self-hosted

inspiration

Speculative Decoding: The Free Tokens

Speculative decoding cuts inference latency 2–3x without changing a single output token. The gain is real. So is the catch.

May 25, 2026

#inference #llm #optimization #latency

inspiration

Quantization Is a Design Decision

Quantization is not just compression. It is a tradeoff you are making about accuracy, speed, and memory — and it belongs in your architecture docs, not your deployment scripts.

May 24, 2026

#quantization #inference #llm #systems

inspiration