Close Menu
    Trending
    • How to Make Your AI App Faster and More Interactive with Response Streaming
    • Beyond Code Generation: AI for the Full Data Science Workflow
    • What the Bits-over-Random Metric Changed in How I Think About RAG and Agents
    • AI system learns to keep warehouse robot traffic running smoothly | MIT News
    • Augmenting citizen science with computer vision for fish monitoring | MIT News
    • Following Up on Like-for-Like for Stores: Handling PY
    • This startup wants to change how mathematicians do math
    • Building Human-In-The-Loop Agentic Workflows | Towards Data Science
    ProfitlyAI
    • Home
    • Latest News
    • AI Technology
    • Latest AI Innovations
    • AI Tools & Technologies
    • Artificial Intelligence
    ProfitlyAI
    Home » How to Make Your AI App Faster and More Interactive with Response Streaming
    Artificial Intelligence

    How to Make Your AI App Faster and More Interactive with Response Streaming

    ProfitlyAIBy ProfitlyAIMarch 26, 2026No Comments9 Mins Read
    Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
    Share
    Facebook Twitter LinkedIn Pinterest Email


    In my latest posts, talked rather a lot about prompt caching in addition to caching in general, and the way it can enhance your AI app when it comes to price and latency. Nevertheless, even for a completely optimized AI app, typically the responses are simply going to take a while to be generated, and there’s merely nothing we will do about it. After we request giant outputs from the mannequin or require reasoning or deep considering, the mannequin goes to naturally take longer to reply. As affordable as that is, ready longer to obtain a solution may be irritating for the person and decrease their total person expertise utilizing an AI app. Fortunately, a easy and simple means to enhance this challenge is response streaming.

    Streaming means getting the mannequin’s response incrementally, little by little, as generated, relatively than ready for your complete response to be generated after which displaying it to the person. Usually (with out streaming), we ship a request to the mannequin’s API, we anticipate the mannequin to generate the response, and as soon as the response is accomplished, we get it again from the API in a single step. With streaming nonetheless, the API sends again partial outputs whereas the response is generated. It is a relatively acquainted idea as a result of most user-facing AI apps like ChatGPT, from the second they first appeared, used streaming to indicate their responses to their customers. However past ChatGPT and LLMs, streaming is basically used in all places on the net and in fashionable functions, comparable to for example in dwell notifications, multiplayer video games, or dwell information feeds. On this put up, we’re going to additional discover how we will combine streaming in our personal requests to mannequin APIs and obtain an analogous impact on customized AI apps.

    There are a number of totally different mechanisms to implement the idea of streaming in an software. Nonetheless, for AI functions, there are two extensively used forms of streaming. Extra particularly, these are:

    • HTTP Streaming Over Server-Despatched Occasions (SSE): That could be a comparatively easy, one-way sort of streaming, permitting solely dwell communication from server to consumer.
    • Streaming with WebSockets: That could be a extra superior and sophisticated sort of streaming, permitting two-way dwell communication between server and consumer.

    Within the context of AI functions, HTTP streaming over SSE can help easy AI functions the place we simply must stream the mannequin’s response for latency and UX causes. Nonetheless, as we transfer past easy request–response patterns into extra superior setups WebSockets change into significantly helpful as they permit dwell, bidirectional communication between our software and the mannequin’s API. For instance, in code assistants, multi-agent programs, or tool-calling workflows, the consumer might must ship intermediate updates, person interactions, or suggestions again to the server whereas the mannequin continues to be producing a response. Nevertheless, for most straightforward AI apps the place we simply want the mannequin to offer a response, WebSockets are normally overkill, and SSE is enough.

    On the remainder of this put up we’ll be taking a greater look into streaming for easy AI apps utilizing HTTP streaming over SSE.

    . . .

    What about HTTP Streaming Over SSE?

    HTTP Streaming Over Server-Sent Events (SSE) relies on HTTP streaming.

    . . .

    HTTP streaming signifies that the server can ship no matter it’s that it has to ship in components, relatively than abruptly. That is achieved by the server not terminating the connection to the consumer after sending a response, however relatively leaving it open and sending the consumer no matter extra occasion happens instantly.

    For instance, as a substitute of getting the response in a single chunk:

    Whats up world!

    we may get it in components utilizing uncooked HTTP streaming:

    Whats up
    
    World
    
    !

    If we had been to implement HTTP streaming from scratch, we would wish to deal with every part ourselves, together with parsing the streamed textual content, managing any errors, and reconnections to the server. In our instance, utilizing uncooked HTTP streaming, we must by some means clarify to the consumer that ‘Whats up world!’ is one occasion conceptually, and every part after it could be a separate occasion. Luckily, there are a number of frameworks and wrappers that simplify HTTP streaming, considered one of which is HTTP Streaming Over Server-Despatched Occasions (SSE).

    . . .

    So, Server-Sent Events (SSE) present a standardized solution to implement HTTP streaming by structuring server outputs into clearly outlined occasions. This construction makes it a lot simpler to parse and course of streamed responses on the consumer facet.

    Every occasion usually contains:

    • an id
    • an occasion sort
    • a information payload

    or extra correctly..

    id: <unique-event-id>
    occasion: <event-type>
    information: <payload>

    Our instance utilizing SSE may look one thing like this:

    id: 1
    occasion: message
    information: Whats up world!

    However what’s an occasion? Something can qualify as an occasion – a single phrase, a sentence, or hundreds of phrases. What really qualifies as an occasion in our explicit implementation is outlined by the setup of the API or the server we’re linked to.

    On high of this, SSE comes with varied different conveniences, like robotically reconnecting to the server if the connection is terminated. One other factor is that incoming stream messages are clearly tagged as textual content/event-stream, permitting the consumer to appropriately deal with them and keep away from errors.

    . . .

    Roll up your sleeves

    Frontier LLM APIs like OpenAI’s API or Claude API natively help HTTP streaming over SSE. On this means, integrating streaming in your requests turns into comparatively easy, as it may be achieved by altering a parameter within the request (e.g., enabling a stream=true parameter).

    As soon as streaming is enabled, the API not waits for the complete response earlier than replying. As an alternative, it sends again small components of the mannequin’s output as they’re generated. On the consumer facet, we will iterate over these chunks and show them progressively to the person, creating the acquainted ChatGPT typing impact.

    However, let’s do a minimal instance of this utilizing as typical the OpenAI’s API:

    import time
    from openai import OpenAI
    
    consumer = OpenAI(api_key="your_api_key")
    
    stream = consumer.responses.create(
        mannequin="gpt-4.1-mini",
        enter="Clarify response streaming in 3 quick paragraphs.",
        stream=True,
    )
    
    full_text = ""
    
    for occasion in stream:
        # solely print textual content delta as textual content components arrive
        if occasion.sort == "response.output_text.delta":
            print(occasion.delta, finish="", flush=True)
            full_text += occasion.delta
    
    print("nnFinal collected response:")
    print(full_text)

    On this instance, as a substitute of receiving a single accomplished response, we iterate over a stream of occasions and print every textual content fragment because it arrives. On the identical time, we additionally retailer the chunks right into a full response full_text to make use of later if we need to.

    . . .

    So, ought to I simply slap streaming = True on each request?

    The quick reply is not any. As helpful as it’s, with nice potential for considerably enhancing person expertise, streaming shouldn’t be a one-size-fits-all answer for AI apps, and we must always use our discretion for evaluating the place it must be carried out and the place not.

    Extra particularly, including streaming in an AI app could be very efficient in setups after we count on lengthy responses, and we worth above all of the person expertise and responsiveness of the app. Such a case can be consumer-facing chatbots.

    On the flip facet, for easy apps the place we count on the offered responses to be quick, including streaming isn’t seemingly to offer important good points to the person expertise and doesn’t make a lot sense. On high of this, streaming solely is sensible in circumstances the place the mannequin’s output is free-text and never structured output (e.g. json information).

    Most significantly, the foremost disadvantage of streaming is that we’re not capable of evaluation the complete response earlier than displaying it to the person. Keep in mind, LLMs generate the tokens one-by-one, and the which means of the response is fashioned because the response is generated, not prematurely. If we make 100 requests to an LLM with the very same enter, we’re going to get 100 totally different responses. That’s to say, nobody is aware of earlier than the responses are accomplished what it’ll say. Consequently, with streaming activated is rather more troublesome to evaluation the mannequin’s output earlier than displaying it to the person, and apply any ensures on the produced content material. We will at all times attempt to consider partial completions, however once more, partial completions are tougher to judge, as we have now to guess the place the mannequin goes with this. Including that this analysis must be carried out in actual time and never simply as soon as, however recursively on totally different partial responses of the mannequin, renders this course of much more difficult. In apply, in such circumstances, validation is run on your complete output after the response is full. Nonetheless, the problem with that is that at this level, it might already be too late, as we might have already proven the person inappropriate content material that doesn’t go our validations.

    . . .

    On my thoughts

    Streaming is a function that doesn’t have an precise impression on the AI app’s capabilities, or its related price and latency. Nonetheless, it might probably have an ideal impression on the way in which the person’s understand and expertise an AI app. Streaming makes AI programs really feel sooner, extra responsive, and extra interactive, even when the time for producing the whole response stays precisely the identical. That stated, streaming shouldn’t be a silver bullet. Completely different functions and contexts might profit kind of from introducing streaming. Like many choices in AI engineering, it’s much less about what’s doable and extra about what is sensible to your particular use case.

    . . .

    If you happen to made it this far, you might find pialgorithms useful — a platform we’ve been constructing that helps groups securely handle organizational information in a single place.

    . . .

    Liked this put up? Be part of me on 💌Substack and 💼LinkedIn

    . . .

    All photographs by the writer, besides talked about in any other case.



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous ArticleBeyond Code Generation: AI for the Full Data Science Workflow
    ProfitlyAI
    • Website

    Related Posts

    Artificial Intelligence

    Beyond Code Generation: AI for the Full Data Science Workflow

    March 26, 2026
    Artificial Intelligence

    What the Bits-over-Random Metric Changed in How I Think About RAG and Agents

    March 26, 2026
    Artificial Intelligence

    AI system learns to keep warehouse robot traffic running smoothly | MIT News

    March 26, 2026
    Add A Comment
    Leave A Reply Cancel Reply

    Top Posts

    Solving the generative AI app experience challenge

    April 5, 2025

    3 Questions: Modeling adversarial intelligence to exploit AI’s security vulnerabilities | MIT News

    April 6, 2025

    Everything You Need to Know About the New Power BI Storage Mode

    August 21, 2025

    Using generative AI to help robots jump higher and land safely | MIT News

    June 27, 2025

    FLUX.2 AI-bildgenerering med upp till 4MP upplösning

    December 3, 2025
    Categories
    • AI Technology
    • AI Tools & Technologies
    • Artificial Intelligence
    • Latest AI Innovations
    • Latest News
    Most Popular

    How Not to Write an MCP Server

    May 9, 2025

    How to Protect Your Creativity in the Age of AI with Bridget McCormack [MAICON 2025 Speaker Series]

    October 9, 2025

    The Free AI Tutor That Never Sleeps

    August 5, 2025
    Our Picks

    How to Make Your AI App Faster and More Interactive with Response Streaming

    March 26, 2026

    Beyond Code Generation: AI for the Full Data Science Workflow

    March 26, 2026

    What the Bits-over-Random Metric Changed in How I Think About RAG and Agents

    March 26, 2026
    Categories
    • AI Technology
    • AI Tools & Technologies
    • Artificial Intelligence
    • Latest AI Innovations
    • Latest News
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About us
    • Contact us
    Copyright © 2025 ProfitlyAI All Rights Reserved.

    Type above and press Enter to search. Press Esc to cancel.