Get access to over 100 FireMonkey cross platform samples for Android, IOS, OSX, Windows, and Linux!

DelphiLibraryWindows

Run Large Language Models Entirely On The GPU With Delphi And Vulkan

Developer tinyBigGAMES has VindexLLM, a GPU-accelerated LLM inference engine written entirely in Delphi. Instead of wrapping existing inference libraries, VindexLLM performs the complete transformer forward pass using Vulkan compute shaders, requiring only the Vulkan runtime included with modern GPU drivers.

The framework loads standard GGUF models, memory-maps model weights, executes attention and feed-forward layers on the GPU, and streams generated tokens back to Delphi applications—all without Python, CUDA toolkits, or external inference runtimes.

VindexLLM includes native support for interactive chat, persistent SQLite-backed memory, Retrieval-Augmented Generation (RAG), configurable sampling, and streaming token generation. It also introduces TurboQuant (TQ3), a custom 3-bit quantization format designed to dramatically reduce KV cache memory while maintaining inference quality.

Currently supporting Gemma 3 GGUF models, VindexLLM demonstrates what’s possible when modern LLM inference is implemented natively in Object Pascal instead of relying on external AI runtimes.

Explore how VindexLLM brings native GPU-powered LLM inference to Delphi using Vulkan.

Have Delphi Firemonkey questions? Ask and get answers on StackOverflow.

Related posts
Code SnippetDelphiDemoLibraryWindows

Add Real-Time AI Voice Conversations To Windows Applications

ComponentDelphiDemoLibraryWindows

Give AI Models Eyes, Ears, And Real-World Capabilities

Code SnippetDelphiDemoShowcaseWindows

Build Better Code With Two AI Models Working Together

Code SnippetDelphiDemoLibraryWindows

Orchestrate Complex AI Workflows And Thought Chains In Delphi

Sign up for our Newsletter and
stay informed
[mailpoet_form id="1"]

Leave a Reply