From Custom Kernels to Multimodal Streams: The New Frontier of Agentic Hardware and Software
The landscape of AI agent construction is undergoing a dual-track evolution. On one end, high-level orchestration libraries are moving away from simple text-based interactions toward complex, multimodal message streams. On the other, low-level hardware optimization is becoming more accessible, allowing developers to write custom kernels that squeeze every drop of performance out of their silicon.
For the modern agent builder, understanding these two ends of the stack—the software abstraction and the hardware execution—is no longer optional. Recent updates to the llm library and the introduction of Pallas for JAX represent a significant shift in how we build, optimize, and deploy AI agents on local hardware.
The Software Layer: Orchestrating Multimodal Agents
For a long time, the industry standard for interacting with Large Language Models (LLMs) was the “text-in, text-out” paradigm. You sent a string, and you received a string. However, as Simon Willison notes in the release of llm 0.32a0, this model is no longer sufficient for the “frontier models” of today [1].
Moving Beyond Simple Prompts
The latest refactor of the llm library highlights a fundamental shift: prompts are no longer just strings; they are sequences of messages. This reflects the reality of how modern agents operate—through multi-turn conversations where the context includes not just text, but a history of interactions, system instructions, and tool outputs [1].
This message-based approach is critical for agent builders because it allows for:
- Conversational State Management: Treating inputs as a sequence of turns makes it easier to implement “memory” within an agentic loop.
- Multimodal Integration: Modern agents must process more than just text. The updated
llmarchitecture now natively supports “attachments,” allowing builders to pass images, audio, and video directly into the model stream [1]. - Structured Outputs and Tool Use: To be “agentic,” a model must do more than talk; it must act. The inclusion of schemas for JSON output and dedicated tool-calling support ensures that agents can interact with external APIs and local file systems reliably [1].
The Hardware Implications of Multimodal Streams
This shift in software abstraction has direct consequences for your hardware “rig.” When an agent begins processing video or high-resolution images as part of its “thought process,” the VRAM (Video RAM) requirements on your GPU spike significantly.
While a standard text-based agent might run comfortably on a 12GB or 16GB card, a multimodal agent utilizing the latest message-stream abstractions will benefit from the 24GB found on an RTX 3090/4090 or the massive 48GB+ buffers on professional-grade hardware like the RTX 6000 Ada. The ability to handle “differently typed parts” in a single response stream means your hardware must be ready to decode and process various data formats simultaneously without bottlenecking the inference engine [1].
The Hardware Layer: Low-Level Optimization with Pallas
While high-level libraries like llm handle how we talk to models, tools like Pallas determine how efficiently those models run on our hardware. For builders using the JAX ecosystem, Pallas represents a breakthrough in making custom kernel programming accessible to those who are not yet CUDA or Triton experts [2].
What is Pallas?
Pallas is an extension for JAX designed to bridge the gap between high-level mathematical operations and low-level hardware execution. Traditionally, writing custom kernels—the specialized code that tells a GPU or TPU exactly how to handle a specific mathematical operation—required deep knowledge of hardware architecture and specialized languages like CUDA or C++.
Pallas changes this by allowing developers to write these kernels using JAX’s familiar Python-like syntax, which is then compiled into efficient low-level code [2]. This is particularly relevant for agent builders who are pushing the boundaries of what local hardware can do.
Why Custom Kernels Matter for Agents
You might wonder why an agent builder needs to care about kernels. The answer lies in latency and efficiency:
- Memory Bandwidth Optimization: Agents often perform repetitive tasks or require specialized attention mechanisms. Custom kernels can optimize how data moves between the GPU’s global memory and its fast on-chip SRAM, reducing the bottlenecks that slow down agent response times.
- Specialized Operations: As agents become more specialized (e.g., agents for scientific computing or real-time signal processing), they may require mathematical operations that aren’t efficiently covered by standard libraries. Pallas allows builders to implement these operations without sacrificing performance [2].
- Hardware Portability: Pallas is designed to work across different hardware backends, including GPUs and TPUs. This is vital for builders who might develop on a local NVIDIA rig but deploy on cloud-based TPU clusters [2].
Synthesizing the Stack: The Agent Builder’s Perspective
The convergence of multimodal software abstractions and accessible hardware optimization creates a powerful new toolkit. Below is a comparison of how the “old way” of building compares to the “new way” enabled by these advancements.
Comparison: Traditional vs. Modern Agent Architecture
| Feature | Traditional Approach | Modern Approach (LLM 0.32a0 + Pallas) |
|---|---|---|
| Input Format | Single text string | Sequence of multimodal messages [1] |
| Output Type | Unstructured text | Structured JSON & typed parts [1] |
| Hardware Control | Standard library kernels | Custom JAX kernels via Pallas [2] |
| Agent Capability | Text-only chat | Vision, Audio, Tool-use, Reasoning [1] |
| Optimization | Generic (Out of Box) | Hardware-specific (Custom Kernels) [2] |
Practical Application for Local Rigs
If you are building a local rig for these new workflows, your focus should shift toward asymmetric workloads.
The llm library’s move toward “attachments” means your system needs a fast NVMe storage backend to feed image and video data to the GPU without delay [1]. Simultaneously, the ability to write custom kernels via Pallas means that even older hardware can see a performance “second wind.” By optimizing the specific kernels used by your agent’s most frequent tasks, you can achieve speeds previously reserved for much newer silicon [2].
For instance, if your agent spends 80% of its time summarizing local PDF documents, a custom Pallas kernel optimized for the specific attention patterns of your local model could significantly reduce the “time to first token,” making the agent feel much more responsive and “alive.”
Conclusion: The Era of the Full-Stack Builder
The release of llm 0.32a0 and the rise of Pallas signal that the “black box” era of AI is ending. We are entering an era where the builder has granular control over both the high-level conversational flow and the low-level hardware execution.
By leveraging message-based, multimodal abstractions, you can build agents that see, hear, and act with precision [1]. By mastering tools like Pallas, you ensure that those agents run with maximum efficiency on the hardware you’ve worked hard to assemble [2]. For the AgentRigs community, the message is clear: the best builds are those that optimize for both the logic of the agent and the physics of the silicon. As orchestration becomes more complex, the ability to dive deep into the kernel level will be the ultimate competitive advantage for independent developers.
Sources & Further Reading
- LLM 0.32a0 Release Notes (Simon Willison’s Blog): A comprehensive look at the major refactor of the LLM library, detailing the shift to message sequences and the new support for multimodal attachments and tool-calling. Source 1
- Pallas for Beginners (Hugging Face): An introductory guide to writing custom hardware kernels using JAX and Pallas, explaining how to bridge the gap between high-level Python code and raw GPU/TPU performance. Source 2