FFT-LLM-Openweight - Local LLM Deployment and Optimization
Overview
Section titled “Overview”FFT-LLM-Openweight deploys and optimizes open-weight models — Qwen, DeepSeek, Llama, Mistral — on local GPUs. It makes evidence-based choices about quantization (AWQ, GPTQ, FP8, GGUF), serving stack (vLLM, Ollama, llama.cpp, TGI), and inference configuration (CUDA graphs, speculative decoding, KV cache). It is the hands-on counterpart to fft-ml-architect’s strategic guidance.
Capabilities
Section titled “Capabilities”- Serving stacks: vLLM primary, Ollama for rapid iteration, llama.cpp for CPU/edge, TGI when appropriate.
- Quantization: AWQ, GPTQ, FP8, FP16, GGUF — trade-offs grounded in benchmarked throughput and quality.
- Model families: Qwen 2.5/3/3.5, DeepSeek-R1/V3, Llama 3, Mistral, Mixtral.
- GPU optimization: CUDA graphs, tensor parallelism, memory-utilization tuning, compilation modes.
- KV cache strategy: FP8 KV cache, PagedAttention, context-length trade-offs.
- Benchmarking: throughput, TTFT, per-token latency, quality evaluation on real workloads.
- Hardware awareness: RTX 40/50 series, A100/H100, SM compute capabilities and driver constraints.
- Production patterns: OpenAI-compatible endpoints, streaming, concurrent request handling.
When to Use
Section titled “When to Use”- Deploying a new open-weight model to local or on-premise GPU infrastructure.
- Benchmarking throughput and quality to pick between quantization options.
- Diagnosing inference slowdowns, OOMs, or kernel crashes on specific GPUs.
- Choosing between vLLM, Ollama, and llama.cpp for a given workload and hardware.
Example Prompts
Section titled “Example Prompts”"Benchmark Qwen 3.5-27B AWQ vs FP8 on an RTX 5090 for our clinical reasoning workload""Deploy an embeddings model on the secondary GPU alongside the orchestrator and reranker""Diagnose why the vLLM container OOMs at 16k context and propose configuration fixes"