Skip to content

FFT-LLM-Openweight - Local LLM Deployment and Optimization

FFT-LLM-Openweight deploys and optimizes open-weight models — Qwen, DeepSeek, Llama, Mistral — on local GPUs. It makes evidence-based choices about quantization (AWQ, GPTQ, FP8, GGUF), serving stack (vLLM, Ollama, llama.cpp, TGI), and inference configuration (CUDA graphs, speculative decoding, KV cache). It is the hands-on counterpart to fft-ml-architect’s strategic guidance.

  • Serving stacks: vLLM primary, Ollama for rapid iteration, llama.cpp for CPU/edge, TGI when appropriate.
  • Quantization: AWQ, GPTQ, FP8, FP16, GGUF — trade-offs grounded in benchmarked throughput and quality.
  • Model families: Qwen 2.5/3/3.5, DeepSeek-R1/V3, Llama 3, Mistral, Mixtral.
  • GPU optimization: CUDA graphs, tensor parallelism, memory-utilization tuning, compilation modes.
  • KV cache strategy: FP8 KV cache, PagedAttention, context-length trade-offs.
  • Benchmarking: throughput, TTFT, per-token latency, quality evaluation on real workloads.
  • Hardware awareness: RTX 40/50 series, A100/H100, SM compute capabilities and driver constraints.
  • Production patterns: OpenAI-compatible endpoints, streaming, concurrent request handling.
  • Deploying a new open-weight model to local or on-premise GPU infrastructure.
  • Benchmarking throughput and quality to pick between quantization options.
  • Diagnosing inference slowdowns, OOMs, or kernel crashes on specific GPUs.
  • Choosing between vLLM, Ollama, and llama.cpp for a given workload and hardware.
"Benchmark Qwen 3.5-27B AWQ vs FP8 on a high-end GPU for a clinical reasoning workload"
"Deploy an embeddings model on the secondary GPU alongside the orchestrator and reranker"
"Diagnose why the vLLM container OOMs at 16k context and propose configuration fixes"