AI & Tech·May 21, 2026

Multi-Stream LLMs: new paper on parallelizing/separating prompts, thinking, I/O

Article URL: Comments URL: Points: 19 # Comments: 1

Hacker NewsMay 211 min readSingle source

Image · Hacker News

The gist

5-point summary · 1 min

Article URL: Comments URL: Points: 19 # Comments: 1

However, the core of these systems has not changed much since early instruction-tuned models like ChatGPT.
Even advanced AI agents function on message exchange formats, successively exchanging messages with users, systems, with itself (i.e. chain-of-thought) and tools in a single stream of computation.
This bottleneck to a single stream in chat models leads to a number of limitations: the agent cannot act (generate output) while reading, and in reverse, cannot react to new information while writing.
Similarly, the agent cannot act while thinking and cannot think while reading or acting on information.
Every forward pass of the language model then simultaneously reads from multiple input streams and generates tokens in multiple output streams, all of which causally depend on earlier timesteps.

May 2026

Source

View PDF HTML (experimental) Abstract:The continued improvements in language model capability have unlocked their widespread use as drivers of autonomous agents, for example in coding or computer use applications. However, the core of these systems has not changed much since early instruction-tuned models like ChatGPT. Even advanced AI agents function on message exchange formats, successively exchanging messages with users, systems, with itself (i.e. chain-of-thought) and tools in a single stream of computation. This bottleneck to a single stream in chat models leads to a number of limitations: the agent cannot act (generate output) while reading, and in reverse, cannot react to new information while writing. Similarly, the agent cannot act while thinking and cannot think while reading or acting on information. In this work, we show that models can be unblocked by switching from instruction-tuning for sequential message formats to instruction-tuning for multiple, parallel streams of computation, splitting each role into a separate stream. Every forward pass of the language model then simultaneously reads from multiple input streams and generates tokens in multiple output streams, all of which causally depend on earlier timesteps. We argue that this data-driven change remedies a number of usability limitations as outlined above, improves model efficiency through parallelization, improves model security through better separation of concerns and can further improve model monitorability. Comments: Preprint, 37 pages. Code at this https URL Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL) Cite as: arXiv:2605.12460 [cs.LG] (or arXiv:2605.12460v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.12460 arXiv-issued DOI via DataCite (pending registration) Submission history From: Jonas Geiping [view email] [v1] Tue, 12 May 2026 17:47:41 UTC (871 KB)

Integrity note · Xela does not rewrite or paraphrase article content. The excerpt above is the source publication's own words, sanitized for display. For the full piece — including any quotes, charts, or images — read it at Hacker News. Xela's rewritten version is off for this story, so there's no editorial angle attached — you're getting the source's reporting unfiltered. When the rewrite is on, we add a What this means block underneath with the operator/trader takeaway.