Close Menu
    Trending
    • What health care providers actually want from AI
    • Alibaba har lanserat Qwen-Image-Edit en AI-bildbehandlingsverktyg som öppenkällkod
    • Can an AI doppelgänger help me do my job?
    • Therapists are secretly using ChatGPT during sessions. Clients are triggered.
    • Anthropic testar ett AI-webbläsartillägg för Chrome
    • A Practical Blueprint for AI Document Classification
    • Top Priorities for Shared Services and GBS Leaders for 2026
    • The Generalist: The New All-Around Type of Data Professional?
    ProfitlyAI
    • Home
    • Latest News
    • AI Technology
    • Latest AI Innovations
    • AI Tools & Technologies
    • Artificial Intelligence
    ProfitlyAI
    Home » How We Reduced LLM Costs by 90% with 5 Lines of Code
    Artificial Intelligence

    How We Reduced LLM Costs by 90% with 5 Lines of Code

    ProfitlyAIBy ProfitlyAIAugust 21, 2025No Comments10 Mins Read
    Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
    Share
    Facebook Twitter LinkedIn Pinterest Email


    feeling when every little thing appears to be working simply nice, till you look beneath the hood and understand your system is burning 10× extra gas than it must?

    We had a consumer script firing off requests to validate our prompts, constructed with async Python code and operating easily in a Jupyter pocket book. Clear, easy, and quick. We ran it commonly to check our fashions and acquire analysis knowledge. No pink flags. No warnings.

    However beneath that polished floor, one thing was quietly going improper.

    We weren’t seeing failures. We weren’t getting exceptions. We weren’t even noticing slowness. However our system was doing much more work than it wanted to, and we didn’t understand it.

    On this submit, we’ll stroll by means of how we found the problem, what brought on it, and the way a easy structural change in our async code lowered LLM visitors and price by 90%, with just about no loss in pace or performance.

    Now, honest warning, studying this submit gained’t magically slash your LLM prices by 90%. However the takeaway right here is broader: small, ignored design choices, generally just some strains of code, can result in large inefficiencies. And being intentional about how your code runs can prevent time, cash, and frustration in the long term.

    The repair itself may really feel area of interest at first. It entails the subtleties of Python’s asynchronous conduct, how duties are scheduled and dispatched. In case you’re accustomed to Python and async/await, you’ll get extra out of the code examples, however even in case you’re not, there’s nonetheless lots to remove. As a result of the actual story right here isn’t nearly LLMs or Python, it’s about accountable, environment friendly engineering.

    Let’s dig in.

    The Setup

    To automate validation, we use a predefined dataset and set off our system by means of a consumer script. The validation focuses on a small subset of the dataset, so the consumer code solely stops after receiving a sure variety of responses.

    Right here’s a simplified model of our consumer in Python:

    import asyncio
    from aiohttp import ClientSession
    from tqdm.asyncio import tqdm_asyncio
    
    URL = "http://localhost:8000/instance"
    NUMBER_OF_REQUESTS = 100
    STOP_AFTER = 10
    
    async def fetch(session: ClientSession, url: str) -> bool:
        async with session.get(url) as response:
            physique = await response.json()
            return physique["value"]
    
    async def important():
        outcomes = []
    
        async with ClientSession() as session:
            duties = [fetch(session, URL) for _ in range(NUMBER_OF_REQUESTS)]
    
            for future in tqdm_asyncio.as_completed(duties, complete=NUMBER_OF_REQUESTS, desc="Fetching"):
                response = await future
                if response is True:
                    outcomes.append(response)
                    if len(outcomes) >= STOP_AFTER:
                        print(f"n✅ Stopped after receiving {STOP_AFTER} true responses.")
                        break
    
    asyncio.run(important())

    This script reads requests from a dataset, fires them concurrently, and stops as soon as we acquire sufficient true responses for our analysis. In manufacturing, the logic is extra complicated and primarily based on the range of responses we’d like. However the construction is identical.

    Let’s use a dummy FastAPI server to simulate actual conduct:

    import asyncio
    import fastapi
    import uvicorn
    import random
    
    app = fastapi.FastAPI()
    
    @app.get("/instance")
    async def instance():
        sleeping_time = random.uniform(1, 2)
        await asyncio.sleep(sleeping_time)
        return {"worth": random.alternative([True, False])}
    
    if __name__ == "__main__":
        uvicorn.run(app, host="0.0.0.0", port=8000)

    Now let’s hearth up that dummy server and run the consumer. You’ll see one thing like this from the consumer terminal:

    The progress bar stopped after receiving 10 responses

    Can You Spot the Drawback?

    Picture by Keiteu Ko on Unsplash

    Good! Quick, clear, and… wait is every little thing working as anticipated?

    On the floor, it appears just like the consumer is doing the precise factor: sending requests, getting 10 true responses, then stopping.

    However is it?

    Let’s add a number of print statements to our server to see what it’s truly doing beneath the hood:

    import asyncio
    import fastapi
    import uvicorn
    import random
    
    app = fastapi.FastAPI()
    
    @app.get("/instance")
    async def instance():
        print("Acquired a request")
        sleeping_time = random.uniform(1, 2)
        print(f"Sleeping for {sleeping_time:.2f} seconds")
        await asyncio.sleep(sleeping_time)
        worth = random.alternative([True, False])
        print(f"Returning worth: {worth}")
        return {"worth": worth}
    
    if __name__ == "__main__":
        uvicorn.run(app, host="0.0.0", port=8000)

    Now re-run every little thing.

    You’ll begin seeing logs like this:

    Acquired a request
    Sleeping for 1.11 seconds
    Acquired a request
    Sleeping for 1.29 seconds
    Acquired a request
    Sleeping for 1.98 seconds
    ...
    Returning worth: True
    Returning worth: False
    Returning worth: False
    ...

    Take a more in-depth have a look at the server logs. You’ll discover one thing sudden: as an alternative of processing simply 14 requests like we see within the progress bar, the server handles all 100. Despite the fact that the consumer stops after receiving 10 true responses, it nonetheless sends each request up entrance. Consequently, the server should course of all of them.

    It’s a simple mistake to overlook, particularly as a result of every little thing seems to be working accurately from the consumer’s perspective: responses are available in shortly, the progress bar advances, and the script exits early. However behind the scenes, all 100 requests are despatched instantly, no matter after we resolve to cease listening. This ends in 10× extra visitors than wanted, driving up prices, growing load, and risking price limits.

    So the important thing query turns into: why is that this occurring, and the way can we be sure we solely ship the requests we really want? The reply turned out to be a small however highly effective change.

    The foundation of the problem lies in how the duties are scheduled. In our authentic code, we create an inventory of 100 duties :

    duties = [fetch(session, URL) for _ in range(NUMBER_OF_REQUESTS)]
    
    for future in tqdm_asyncio.as_completed(duties, complete=NUMBER_OF_REQUESTS, desc="Fetching"):
        response = await future

    If you go an inventory of coroutines to as_completed, Python instantly wraps every coroutine in a Activity and schedules it on the occasion loop. This occurs earlier than you begin iterating over the loop physique. As soon as a coroutine turns into a Activity, the occasion loop begins operating it within the background immediately.

    as_completed itself doesn’t management concurrency, it merely waits for duties to complete and yields them one after the other within the order they full. Consider it as an iterator over accomplished futures, not a visitors controller. Because of this by the point you begin looping, all 100 requests are already in progress. Breaking out after 10 true outcomes stops you from processing the remainder, nevertheless it doesn’t cease them from being despatched.

    To repair this, we launched a semaphore to restrict concurrency. The semaphore provides a light-weight lock inside fetch in order that solely a set variety of requests can begin on the similar time. The remainder stay paused, ready for a slot. As soon as we hit our stopping situation, the paused duties by no means purchase the lock, in order that they by no means ship their requests.

    Right here’s the adjusted model:

    import asyncio
    from aiohttp import ClientSession
    from tqdm.asyncio import tqdm_asyncio
    
    URL = "http://localhost:8000/instance"
    NUMBER_OF_REQUESTS = 100
    STOP_AFTER = 10
    
    async def fetch(session: ClientSession, url: str, semaphore: asyncio.Semaphore) -> str:
        async with semaphore:
            async with session.get(url) as response:
                physique = await response.json()
                return physique["value"]
    
    async def important():
        outcomes = []
        semaphore = asyncio.Semaphore(int(STOP_AFTER * 1.5))
    
        async with ClientSession() as session:
            duties = [fetch(session, URL, semaphore) for _ in range(NUMBER_OF_REQUESTS)]
    
            for future in tqdm_asyncio.as_completed(duties, complete=NUMBER_OF_REQUESTS, desc="Fetching"):
                response = await future
                if response:
                    outcomes.append(response)
                    if len(outcomes) >= STOP_AFTER:
                        print(f"n✅ Stopped after receiving {STOP_AFTER} true responses.")
                        break
    
    asyncio.run(important())

    With this modification, we nonetheless outline 100 requests upfront, however solely a small group is allowed to run on the similar time, 15 in that instance. If we attain our stopping situation early, the occasion loop stops earlier than launching extra requests. This retains the conduct responsive whereas lowering pointless calls.

    Now, the server logs will show solely round 20 "Acquired a request/Returning response" entries. On the consumer aspect, the progress bar will seem similar to the unique.

    The progress bar stopped after receiving 10 responses

    With this modification in place, we noticed fast affect: 90% discount in request quantity and LLM value, with no noticeable degradation in consumer expertise. It additionally improved throughput throughout the staff, lowered queuing, and eradicated rate-limit points from our LLM suppliers.

    This small structural adjustment made our validation pipeline dramatically extra environment friendly, with out including a lot complexity to the code. It’s a very good reminder that in async programs, management stream doesn’t all the time behave the best way you assume except you’re express about how duties are scheduled and when they need to run.

    Bonus Perception: Closing the Occasion Loop

    If we had run the unique consumer code with out asyncio.run, we would have seen the issue earlier. 
    For instance, if we had used handbook occasion loop administration like this:

    loop = asyncio.get_event_loop()
    loop.run_until_complete(important())
    loop.shut()

    Python would have printed warnings equivalent to:

    Activity was destroyed however it’s pending!

    These warnings seem when this system exits whereas there are nonetheless unfinished async duties scheduled within the loop. If we had seen a display stuffed with these warnings, it probably would’ve triggered a pink flag a lot sooner.

    So why didn’t we see that warning when utilizing asyncio.run()?

    As a result of asyncio.run() takes care of cleanup behind the scenes. It doesn’t simply run your coroutine and exit, it additionally cancels any remaining duties, waits for them to complete, and solely then shuts down the occasion loop. This built-in security web prevents these “pending job” warnings from exhibiting up, even when your code quietly launched extra duties than it wanted to.

    Consequently, it suppresses these “pending job” warnings while you manually shut the loop with loop.shut() after run_until_complete(), any leftover duties that haven’t been awaited will nonetheless be hanging round. Python detects that you just’re forcefully shutting down the loop whereas work continues to be scheduled, and warns you about it.

    This isn’t to say that each async Python program ought to keep away from asyncio.run() or all the time use loop.run_until_complete() with a handbook loop.shut(). But it surely does spotlight one thing necessary: you need to be conscious of what duties are nonetheless operating when your program exits. On the very least, it’s a good suggestion to watch or log any pending duties earlier than shutdown.

    Ultimate Ideas

    By stepping again and rethinking the management stream, we have been in a position to make our validation course of dramatically extra environment friendly — not by including extra infrastructure, however by utilizing what we already had extra fastidiously. A number of strains of code change led to a 90% value discount with nearly no added complexity. It resolved rate-limit errors, lowered system load, and allowed the staff to run evaluations extra regularly with out inflicting bottlenecks.

    It is a crucial reminder that “clear” async code doesn’t all the time imply environment friendly code, being intentional about how we use system assets is essential. Accountable, environment friendly engineering is about extra than simply writing code that works. It’s about designing programs that respect time, cash, and shared assets, particularly in collaborative environments. If you deal with compute as a shared asset as an alternative of an infinite pool, everybody advantages: programs scale higher, groups transfer quicker, and prices keep predictable.

    So, whether or not you’re making LLM calls, launching Kubernetes jobs, or processing knowledge in batches, pause and ask your self: am I solely utilizing what I really want?

    Typically, the reply and the advance are only one line of code away.



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous Article[The AI Show Episode 163]: AI Answers
    Next Article Where Hurricanes Hit Hardest: A County-Level Analysis with Python
    ProfitlyAI
    • Website

    Related Posts

    Artificial Intelligence

    The Generalist: The New All-Around Type of Data Professional?

    September 1, 2025
    Artificial Intelligence

    How to Develop a Bilingual Voice Assistant

    August 31, 2025
    Artificial Intelligence

    The Machine Learning Lessons I’ve Learned This Month

    August 31, 2025
    Add A Comment
    Leave A Reply Cancel Reply

    Top Posts

    Google Just Dropped Their Most Insane AI Products Yet at I/O 2025

    May 27, 2025

    [The AI Show Episode 163]: AI Answers

    August 21, 2025

    Gamers Nexus avslöjar omfattande GPU-smugglingsimperium från Kina

    August 19, 2025

    Visa and Mastercard Just Gave AI the Power to Shop and Pay for You

    May 1, 2025

    What We Need to Know About AI in Emotion Recognition in 2024

    April 5, 2025
    Categories
    • AI Technology
    • AI Tools & Technologies
    • Artificial Intelligence
    • Latest AI Innovations
    • Latest News
    Most Popular

    Data Science: From School to Work, Part V

    June 26, 2025

    Image Annotation – Key Use Cases, Techniques, and Types [2025]

    April 5, 2025

    Simple Guide to Multi-Armed Bandits: A Key Concept Before Reinforcement Learning

    July 14, 2025
    Our Picks

    What health care providers actually want from AI

    September 2, 2025

    Alibaba har lanserat Qwen-Image-Edit en AI-bildbehandlingsverktyg som öppenkällkod

    September 2, 2025

    Can an AI doppelgänger help me do my job?

    September 2, 2025
    Categories
    • AI Technology
    • AI Tools & Technologies
    • Artificial Intelligence
    • Latest AI Innovations
    • Latest News
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About us
    • Contact us
    Copyright © 2025 ProfitlyAI All Rights Reserved.

    Type above and press Enter to search. Press Esc to cancel.