Run Large Language Models Entirely On The GPU With Delphi And Vulkan

Developer tinyBigGAMES has VindexLLM, a GPU-accelerated LLM inference engine written entirely in Delphi. Instead of wrapping existing inference libraries, VindexLLM performs the complete transformer forward pass using Vulkan compute shaders, requiring only the Vulkan runtime included with modern GPU drivers.

The framework loads standard GGUF models, memory-maps model weights, executes attention and feed-forward layers on the GPU, and streams generated tokens back to Delphi applications—all without Python, CUDA toolkits, or external inference runtimes.

VindexLLM includes native support for interactive chat, persistent SQLite-backed memory, Retrieval-Augmented Generation (RAG), configurable sampling, and streaming token generation. It also introduces TurboQuant (TQ3), a custom 3-bit quantization format designed to dramatically reduce KV cache memory while maintaining inference quality.

Currently supporting Gemma 3 GGUF models, VindexLLM demonstrates what’s possible when modern LLM inference is implemented natively in Object Pascal instead of relying on external AI runtimes.

Explore how VindexLLM brings native GPU-powered LLM inference to Delphi using Vulkan.

Run Large Language Models Entirely On The GPU With Delphi And Vulkan

Have Delphi Firemonkey questions? Ask and get answers on StackOverflow.

Leave a ReplyCancel reply

Something Fresh