FFT-LLM-Openweight - Local LLM Deployment and Optimization
Overview
Section titled “Overview”FFT-LLM-Openweight deploys and optimizes open-weight models — Qwen, DeepSeek, Llama, Mistral — on local GPUs. It makes evidence-based choices about quantization (AWQ, GPTQ, FP8, GGUF), serving stack (vLLM, Ollama, llama.cpp, TGI), and inference configuration (CUDA graphs, speculative decoding, KV cache). It is the hands-on counterpart to fft-ml-architect’s strategic guidance.
Capabilities
Section titled “Capabilities”- Serving stacks: vLLM primary, Ollama for rapid iteration, llama.cpp for CPU/edge, TGI when appropriate.
- Quantization: AWQ, GPTQ, FP8, FP16, GGUF — trade-offs grounded in benchmarked throughput and quality.
- Model families: Qwen 2.5/3/3.5, DeepSeek-R1/V3, Llama 3, Mistral, Mixtral.
- GPU optimization: CUDA graphs, tensor parallelism, memory-utilization tuning, compilation modes.
- KV cache strategy: FP8 KV cache, PagedAttention, context-length trade-offs.
- Benchmarking: throughput, TTFT, per-token latency, quality evaluation on real workloads.
- Hardware awareness: RTX 40/50 series, A100/H100, SM compute capabilities and driver constraints.
- Production patterns: OpenAI-compatible endpoints, streaming, concurrent request handling.
When to Use
Section titled “When to Use”- Deploying a new open-weight model to local or on-premise GPU infrastructure.
- Benchmarking throughput and quality to pick between quantization options.
- Diagnosing inference slowdowns, OOMs, or kernel crashes on specific GPUs.
- Choosing between vLLM, Ollama, and llama.cpp for a given workload and hardware.
Example Prompts
Section titled “Example Prompts”"Benchmark Qwen 3.5-27B AWQ vs FP8 on a high-end GPU for a clinical reasoning workload""Deploy an embeddings model on the secondary GPU alongside the orchestrator and reranker""Diagnose why the vLLM container OOMs at 16k context and propose configuration fixes"