Close Menu
    Trending
    • “The success of an AI product depends on how intuitively users can interact with its capabilities”
    • How to Crack Machine Learning System-Design Interviews
    • Music, Lyrics, and Agentic AI: Building a Smart Song Explainer using Python and OpenAI
    • An Anthropic Merger, “Lying,” and a 52-Page Memo
    • Apple’s $1 Billion Bet on Google Gemini to Fix Siri
    • Critical Mistakes Companies Make When Integrating AI/ML into Their Processes
    • Nu kan du gruppchatta med ChatGPT – OpenAI testar ny funktion
    • OpenAI’s new LLM exposes the secrets of how AI really works
    ProfitlyAI
    • Home
    • Latest News
    • AI Technology
    • Latest AI Innovations
    • AI Tools & Technologies
    • Artificial Intelligence
    ProfitlyAI
    Home » 4 Techniques to Optimize Your LLM Prompts for Cost, Latency and Performance
    Artificial Intelligence

    4 Techniques to Optimize Your LLM Prompts for Cost, Latency and Performance

    ProfitlyAIBy ProfitlyAIOctober 29, 2025No Comments9 Mins Read
    Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
    Share
    Facebook Twitter LinkedIn Pinterest Email


    of automating a big variety of duties. For the reason that launch of ChatGPT in 2022, we now have seen an increasing number of AI merchandise in the marketplace using LLMs. Nonetheless, there are nonetheless loads of enhancements that ought to be made in the best way we make the most of LLMs. Bettering your immediate with an LLM immediate improver and using cached tokens are, for instance, two easy strategies you may make the most of to vastly enhance the efficiency of your LLM software.

    On this article, I’ll talk about a number of particular strategies you may apply to the best way you create and construction your prompts, which is able to scale back latency and value, and likewise enhance the standard of your responses. The objective is to current you with these particular strategies, so you may instantly implement them into your personal LLM software.

    This infographic highlights the primary contents of this text. I’ll talk about 4 totally different strategies to tremendously enhance the efficiency of your LLM software, with regard to value, latency, and output high quality. I’ll cowl using cached tokens, having the consumer query on the finish, utilizing immediate optimizers, and having your personal custom-made LLM benchmarks. Picture by Gemini.

    Why you need to optimize your immediate

    In loads of circumstances, you may need a immediate that works with a given LLM and yields sufficient outcomes. Nonetheless, in loads of circumstances, you haven’t spent a lot time optimizing the immediate, which leaves loads of potential on the desk.

    I argue that utilizing the particular strategies I’ll current on this article, you may simply each enhance the standard of your responses and scale back prices with out a lot effort. Simply because a immediate and LLM work doesn’t imply it’s performing optimally, and in loads of circumstances, you may see nice enhancements with little or no effort.

    Particular strategies to optimize

    On this part, I’ll cowl the particular strategies you may make the most of to optimize your prompts.

    All the time preserve static content material early

    The primary approach I’ll cowl is to at all times preserve static content material early in your immediate. With static content material, I confer with content material that is still the identical once you make a number of API calls.

    The rationale you need to preserve the static content material early is that each one the massive LLM suppliers, resembling Anthropic, Google, and OpenAI, make the most of cached tokens. Cached tokens are tokens which have already been processed in a earlier API request, and that may be processed cheaply and rapidly. It varies from supplier to supplier, however cached enter tokens are normally priced round 10% of regular enter tokens.

    Cached tokens are tokens which have already been processed in a earlier API request, and that may be processed cheaper and sooner than regular tokens

    Meaning, when you ship in the identical immediate two occasions in a row, the enter tokens of the second immediate will solely value 1/tenth the enter tokens of the primary immediate. This works as a result of the LLM suppliers cache the processing of those enter tokens, which makes processing your new request cheaper and sooner.


    In follow, caching enter tokens is finished by conserving variables on the finish of the immediate.

    For instance, if in case you have an extended system immediate with a query that varies from request to request, you need to do one thing like this:

    immediate = f"""
    {lengthy static system immediate}
    
    {consumer immediate}
    """

    For instance:

    immediate = f"""
    You're a doc professional ...
    You need to at all times reply on this format ...
    If a consumer asks about ... you need to reply ...
    
    {consumer query}
    """

    Right here we now have the static content material of the immediate first, earlier than we put the variable contents (the consumer query) final.


    In some situations, you need to feed in doc contents. In case you’re processing loads of totally different paperwork, you need to preserve the doc content material on the finish of the immediate:

    # if processing totally different paperwork
    immediate = f"""
    {static system immediate}
    {variable immediate instruction 1}
    {doc content material}
    {variable immediate instruction 2}
    {consumer query}
    """

    Nonetheless, suppose you’re processing the identical paperwork a number of occasions. In that case, you can also make certain the tokens of the doc are additionally cached by making certain no variables are put into the immediate beforehand:

    # if processing the identical paperwork a number of occasions
    immediate = f"""
    {static system immediate}
    {doc content material} # preserve this earlier than any variable directions
    {variable immediate instruction 1}
    {variable immediate instruction 2}
    {consumer query}
    """

    Word that cached tokens are normally solely activated if the primary 1024 tokens are the identical in two requests. For instance, in case your static system immediate within the above instance is shorter than 1024 tokens, you’ll not make the most of any cached tokens.

    # do NOT do that
    immediate = f"""
    {variable content material} < --- this removes all utilization of cached tokens
    {static system immediate}
    {doc content material}
    {variable immediate instruction 1}
    {variable immediate instruction 2}
    {consumer query}
    """

    Your prompts ought to at all times be constructed up with probably the most static contents first (the content material various the least from request to request), the probably the most dynamic content material (the content material various probably the most from request to request)

    1. You probably have an extended system and consumer immediate with none variables, you need to preserve that first, and add the variables on the finish of the immediate
    2. If you’re fetching textual content from paperwork, for instance, and processing the identical doc twice, you need to

    May very well be doc contents, or if in case you have an extended immediate -> make use of caching

    Query on the finish

    One other approach you need to make the most of to enhance LLM efficiency is to at all times put the consumer query on the finish of your immediate. Ideally, you manage it so you’ve your system immediate containing all the final directions, and the consumer immediate merely consists of solely the consumer query, resembling beneath:

    system_prompt = "<common directions>"
    
    user_prompt = f"{user_question}"

    In Anthropic’s immediate engineering docs, the state that features the consumer immediate on the finish can enhance efficiency by as much as 30%, particularly in case you are utilizing lengthy contexts. Together with the query ultimately makes it clearer to the mannequin which process it’s making an attempt to attain, and can, in lots of circumstances, result in higher outcomes.

    Utilizing a immediate optimizer

    Lots of occasions, when people write prompts, they change into messy, inconsistent, embody redundant content material, and lack construction. Thus, you need to at all times feed your immediate by way of a immediate optimizer.

    The only immediate optimizer you need to use is to immediate an LLM to enhance this immediate {immediate}, and it’ll give you a extra structured immediate, with much less redundant content material, and so forth.

    A good higher method, nevertheless, is to make use of a particular immediate optimizer, resembling one yow will discover in OpenAI’s or Anthropic’s consoles. These optimizers are LLMs particularly prompted and created to optimize your prompts, and can normally yield higher outcomes. Moreover, you need to be certain to incorporate:

    • Particulars in regards to the process you’re making an attempt to attain
    • Examples of duties the immediate succeeded at, and the enter and output
    • Instance of duties the immediate failed at, with the enter and output

    Offering this extra info will normally yield means higher outcomes, and also you’ll find yourself with a a lot better immediate. In lots of circumstances, you’ll solely spend round 10-Quarter-hour and find yourself with a far more performant immediate. This makes utilizing a immediate optimizer one of many lowest effort approaches to enhancing LLM efficiency.

    Benchmark LLMs

    The LLM you utilize can even considerably impression the efficiency of your LLM software. Completely different LLMs are good at totally different duties, so you might want to check out the totally different LLMs in your particular software space. I like to recommend no less than establishing entry to the most important LLM suppliers like Google Gemini, OpenAI, and Anthropic. Setting this up is kind of easy, and switching your LLM supplier takes a matter of minutes if you have already got credentials arrange. Moreover, you may think about testing open-source LLMs as effectively, although they normally require extra effort.

    You now have to arrange a particular benchmark for the duty you’re making an attempt to attain, and see which LLM works finest. Moreover, you need to recurrently examine mannequin efficiency, because the large LLM suppliers often improve their fashions, with out essentially popping out with a brand new model. You need to, in fact, even be able to check out any new fashions popping out from the massive LLM suppliers.

    Conclusion

    On this article, I’ve coated 4 totally different strategies you may make the most of to enhance the efficiency of your LLM software. I mentioned using cached tokens, having the query on the finish of the immediate, utilizing immediate optimizers, and creating particular LLM benchmarks. These are all comparatively easy to arrange and do, and may result in a big efficiency enhance. I consider many related and easy strategies exist, and you need to at all times attempt to be looking out for them. These matters are normally described in numerous weblog posts, the place Anthropic is without doubt one of the blogs that has helped me enhance LLM efficiency probably the most.

    👉 Discover me on socials:

    📩 Subscribe to my newsletter

    🧑‍💻 Get in touch

    🔗 LinkedIn

    🐦 X / Twitter

    ✍️ Medium

    You may as well learn a few of my different articles:



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous ArticleBringing Vision-Language Intelligence to RAG with ColPali
    Next Article IBMs släpper öppen källkod Granite 4.0 Nano – kompakt LLM för laptop och mobil
    ProfitlyAI
    • Website

    Related Posts

    Artificial Intelligence

    “The success of an AI product depends on how intuitively users can interact with its capabilities”

    November 14, 2025
    Artificial Intelligence

    How to Crack Machine Learning System-Design Interviews

    November 14, 2025
    Artificial Intelligence

    Music, Lyrics, and Agentic AI: Building a Smart Song Explainer using Python and OpenAI

    November 14, 2025
    Add A Comment
    Leave A Reply Cancel Reply

    Top Posts

    Microsoft har lanserat Copilot Vision på Windows

    June 15, 2025

    Confronting the AI/energy conundrum

    July 2, 2025

    What If AI Doesn’t Just Disrupt the Economy, But Detonates It?

    July 29, 2025

    Anthropic testar ett AI-webbläsartillägg för Chrome

    September 2, 2025

    Ray Kurzweil ’70 reinforces his optimism in tech progress | MIT News

    October 10, 2025
    Categories
    • AI Technology
    • AI Tools & Technologies
    • Artificial Intelligence
    • Latest AI Innovations
    • Latest News
    Most Popular

    Pi din emotionellt intelligenta AI-kompis

    November 6, 2025

    How to Spark AI Adoption in Your Organization with Janette Roush [MAICON 2025 Speaker Series]

    July 24, 2025

    Multi-Agent SQL Assistant, Part 2: Building a RAG Manager

    November 6, 2025
    Our Picks

    “The success of an AI product depends on how intuitively users can interact with its capabilities”

    November 14, 2025

    How to Crack Machine Learning System-Design Interviews

    November 14, 2025

    Music, Lyrics, and Agentic AI: Building a Smart Song Explainer using Python and OpenAI

    November 14, 2025
    Categories
    • AI Technology
    • AI Tools & Technologies
    • Artificial Intelligence
    • Latest AI Innovations
    • Latest News
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About us
    • Contact us
    Copyright © 2025 ProfitlyAI All Rights Reserved.

    Type above and press Enter to search. Press Esc to cancel.