Close Menu
    Trending
    • Optimizing Data Transfer in Distributed AI/ML Training Workloads
    • Achieving 5x Agentic Coding Performance with Few-Shot Prompting
    • Why the Sophistication of Your Prompt Correlates Almost Perfectly with the Sophistication of the Response, as Research by Anthropic Found
    • From Transactions to Trends: Predict When a Customer Is About to Stop Buying
    • America’s coming war over AI regulation
    • “Dr. Google” had its issues. Can ChatGPT Health do better?
    • Evaluating Multi-Step LLM-Generated Content: Why Customer Journeys Require Structural Metrics
    • Why SaaS Product Management Is the Best Domain for Data-Driven Professionals in 2026
    ProfitlyAI
    • Home
    • Latest News
    • AI Technology
    • Latest AI Innovations
    • AI Tools & Technologies
    • Artificial Intelligence
    ProfitlyAI
    Home » HNSW at Scale: Why Your RAG System Gets Worse as the Vector Database Grows
    Artificial Intelligence

    HNSW at Scale: Why Your RAG System Gets Worse as the Vector Database Grows

    ProfitlyAIBy ProfitlyAIJanuary 7, 2026No Comments20 Mins Read
    Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
    Share
    Facebook Twitter LinkedIn Pinterest Email


    a contemporary vector database—Neo4j, Milvus, Weaviate, Qdrant, Pinecone—there’s a very excessive likelihood that Hierarchical Navigable Small World (HNSW) is already powering your retrieval layer. It’s fairly possible you didn’t select it whereas constructing the database, nor did you tune it and even know it’s there. And but, HNSW is quietly deciding what your LLM sees as reality. It determines which doc chunks are fed into your RAG pipeline, which recollections your agent remembers, and in the end, whether or not the mannequin solutions appropriately—or hallucinates confidently.

    As your vector database grows, retrieval high quality degrades regularly:

    • No exceptions are raised
    • No errors are logged
    • Latency usually seems to be completely effective

    However the context high quality deteriorates, and your RAG system turns into much less dependable over time—though the embedding mannequin and distance metric stay unchanged.

    On this article, I reveal—utilizing managed experiments and actual knowledge—how HNSW impacts retrieval high quality as database dimension grows, why this degradation is worse than flat search, and what you may realistically do about it in manufacturing RAG programs.

    Particularly, I’ll:

    • Construct a sensible, reproducible use case to measure the impact of HNSW on RAG retrieval high quality utilizing Recall@ok.
    • Present that, for mounted HNSW settings, recall degrades quicker than flat search because the corpus grows.
    • Focus on sensible tuning methods for balancing recall and latency past merely rising ef_search of HNSW.

    What’s HNSW?

    HNSW is a graph-based algorithm for Approximate Nearest Neighbor (ANN) search. It organizes knowledge into a number of layers of related neighbors and makes use of this graph construction to hurry up search.

    HNSW illustration

    Every vector is related to a restricted variety of neighbors in every layer. Throughout a search, it performs a grasping search by these layers, and the variety of neighbors checked at every layer is fixed (managed by M and ef_search), which makes the search course of logarithmic with respect to the variety of vectors. In comparison with flat search, the place time complexity is O(N), HNSW search has a time complexity of O(log N), which implies the time required for a search grows very slowly (logarithmically) as in comparison with linear search. We’ll see this in the results of our use case.

    Parameters of HNSW index

    1. Construct time parameters: M and ef_construction. May be set earlier than constructing the database solely.

    M defines the utmost variety of connections (neighbors) that every vector (node) can have in every layer of the graph. A better M means extra connections, making the graph denser and probably rising recall however at the price of extra reminiscence and slower indexing.

    Ef_construction controls the dimension of the candidate set used throughout the building of the graph. Basically, it governs how totally the graph is constructed throughout indexing. A better worth for ef_construction means the graph is constructed extra totally, with extra candidates being thought-about earlier than making every connection, which results in a larger high quality graph and higher recall at the price of elevated reminiscence and slower indexing.

    For a normal goal RAG utility, typical values of M are inside a spread of 12 and 48 and ef_construction between 64 and 200.

    2. Question time parameter: ef_search

    This defines the variety of candidate nodes (or vectors) to discover throughout the question course of (i.e., throughout the seek for nearest neighbors). It controls how thorough the search course of is by figuring out what number of candidates are evaluated earlier than the search result’s returned. A better worth for ef_search means the search will discover extra candidates, main to raised recall however probably slower queries.

    What’s Recall@ok?

    Recall@ok is a key metric for measuring the accuracy of vector search and RAG programs. It measures the power of the retriever to search out the related chunks for a consumer question inside the prime ok outcomes. It’s vital as a result of If the retriever misses the chunks containing the knowledge required to reply the query (low recall), the LLM can not presumably generate an correct reply within the response synthesis step, no matter how highly effective it’s.

    [ text{Recall}@k = frac{text{relevant items retrieved in top } k}{text{total number of relevant items in the corpus}} ]

    In follow, this can be a tough metric to measure as a result of the denominator (floor reality paperwork) shouldn’t be simply recognized for a real-life manufacturing system. What we are going to do as a substitute, is design a use case the place the bottom reality (eg; vector index) is exclusive and recognized, and Recall@ok will measure the typical variety of occasions it’s retrieved in top-k outcomes, over numerous pattern queries.

    For example, Recall@5 will measure the typical variety of occasions the bottom reality index appeared in top-5 retrievals over 500 queries.

    For a RAG, the appropriate vary of Recall@5 is 70-90% and Recall@10 is 80-95%, and we are going to see that our use case adheres to those ranges for the Flat index.

    Use Case

    To check HNSW, we’d like a vector database with sufficiently giant variety of vectors (> 100,000). There doesn’t appear to be such a big public dataset obtainable consisting of doc chunks and related question(ies) for which the actual chunk could be thought-about as floor reality. And even when it had been there, pure language will be ambiguous, so it’s tough to confidently say which all chunks within the corpus might be thought-about as related for a question (the denominator in Recall@ok method). Creating such a curated dataset would require discovering numerous paperwork, chunking and embedding them, then creating queries for the chunks. That will be a useful resource intensive course of.

    As an alternative, lets re-imagine our RAG drawback as “given a brief caption (question), we wish to retrieve probably the most related photographs from the dataset”.

    For this strategy, I utilized the publicly obtainable LAION-Aesthetics dataset. To entry, you will have to be logged in to Hugging Face, and conform to the phrases talked about. Particulars in regards to the dataset is accessible on the LAOIN website here. It comprises an enormous variety of rows containing URLs to pictures together with a textual content caption. They seem like the next:

    LAOIN-Aesthetics

    I downloaded a subset of rows and generated 200,000 CLIP embeddings of the photographs to construct the vector database. The textual content captions of the photographs will be conveniently used as queries for RAG. And every caption has just one picture vector as the bottom reality so the denominator of Recall@ok is precisely recognized for all queries. Additionally, the CLIP embeddings of the picture and its caption are by no means an actual match, so there may be sufficient “fuzziness” in retrievals much like a purely doc RAG, the place a textual content question is used to retrieve related doc chunks utilizing a distance metric. This can be evident after we see the chart of Recall@ok within the subsequent sections.

    Measuring Recall@ok for Flat vs HNSW

    We undertake the next strategy:

    1. Embeddings of 200k photographs are saved as .npy file.
    2. From the laion dataset, 500 captions(queries) are randomly chosen and embedded utilizing CLIP. The chosen question indices additionally kind the bottom reality as they correspond to the distinctive picture for the question.
    3. The database is inbuilt increments of fifty,000 vectors, so 4 iterations of dimension 50k, 100k, 150k and 200k vectors. Each flat and HNSW indexes are constructed. HNSW is constructed utilizing M=16 and ef_construction=100.
    4. Recall@ok is calculated for ok = 1, 5, 10, 15 and 20 based mostly upon if the bottom reality indices are included in top-k outcomes.
    5. First, the Recall@ok values are calculated for every of the question vectors and averaged over the variety of samples (500).
    6. Then, common Recall@ok values are calculated for HNSW ef_search values of 10, 20, 40, 80 and 160.
    7. Lastly, 5 charts are drawn, one for every of the Recall@ok values. Every chart depicts the evolution of Recall@ok as database dimension grows for Flat index and completely different ef_search values of HNSW.
    The code will be considered right here
    import pandas as pd
    import numpy as np
    import faiss
    import torch
    import open_clip
    import os
    import random
    import matplotlib.pyplot as plt
    
    def evaluate_subset(dimension, embeddings_all, df_all, query_vectors_all, eval_indices_all, ef_search_values):
        # Subset embeddings
        embeddings = embeddings_all[:size]
        dimension = embeddings.form[1]
        
        # Construct Indices in-memory for this subset dimension
        index_flat = faiss.IndexFlatL2(dimension)
        index_flat.add(embeddings)
        
        index_hnsw = faiss.IndexHNSWFlat(dimension, 16)
        index_hnsw.hnsw.efConstruction = 100
        index_hnsw.add(embeddings)
    
        num_samples = len(eval_indices_all)
        outcomes = []
    
        ks = [1, 5, 10, 15, 20]
    
        # Consider Flat
        flat_recalls = {ok: 0 for ok in ks}
        for i, qv in enumerate(query_vectors_all):
            _, I = index_flat.search(qv, max(ks))
            goal = eval_indices_all[i]
            for ok in ks:
                if goal in I[0][:k]:
                    flat_recalls[k] += 1
        
        flat_res = {"Setting": "Flat"}
        for ok in ks:
            flat_res[f"R@{k}"] = flat_recalls[k]/num_samples
        outcomes.append(flat_res)
    
        # Consider HNSW with completely different efSearch
        for ef in ef_search_values:
            index_hnsw.hnsw.efSearch = ef
            hnsw_recalls = {ok: 0 for ok in ks}
            for i, qv in enumerate(query_vectors_all):
                _, I = index_hnsw.search(qv, max(ks))
                goal = eval_indices_all[i]
                for ok in ks:
                    if goal in I[0][:k]:
                        hnsw_recalls[k] += 1
            
            hnsw_res = {"Setting": f"HNSW (ef={ef})", "ef": ef}
            for ok in ks:
                hnsw_res[f"R@{k}"] = hnsw_recalls[k]/num_samples
            outcomes.append(hnsw_res)
        
        return outcomes
    
    def format_table(dimension, outcomes):
        ks = [1, 5, 10, 15, 20]
        strains = []
        strains.append(f"nDatabase Dimension: {dimension}")
        strains.append("="*80)
        header = f"{'Index/efSearch':<20}"
        for ok in ks:
            header += f" | {'R@'+str(ok):<8}"
        strains.append(header)
        strains.append("-" * 80)
        for row in outcomes:
            line = f"{row['Setting']:<20}"
            for ok in ks:
                line += f" | {row[f'R@{k}']:<8.2f}"
            strains.append(line)
        strains.append("="*80)
        return "n".be part of(strains)
    
    def major(n):
        dataset_path = r"C:databaselaion_final.parquet"
        embeddings_path = r"C:databaseembeddings.npy"
        results_dir = r"C:outcomes"
        
        db_sizes = [50000, 100000, 150000, 200000]
        ef_search_values = [10, 20, 40, 80, 160]
        num_samples = n
        output_txt = os.path.be part of(results_dir, f"eval_results_{num_samples}.txt")
        output_png = os.path.be part of(results_dir, f"recall_vs_dbsize_{num_samples}.png")
    
        if not os.path.exists(dataset_path) or not os.path.exists(embeddings_path):
            print("Error: Dataset or embeddings not discovered.")
            return
        
        os.makedirs(results_dir, exist_ok=True)
    
        # Load All Information As soon as
        print("Loading base knowledge...")
        df_all = pd.read_parquet(dataset_path)
        embeddings_all = np.load(embeddings_path).astype('float32')
    
        # Load CLIP mannequin as soon as
        print("Loading CLIP mannequin (ViT-B-32)...")
        mannequin, _, preprocess = open_clip.create_model_and_transforms('ViT-B-32', pretrained='laion2b_s34b_b79k')
        tokenizer = open_clip.get_tokenizer('ViT-B-32')
        system = "cuda" if torch.cuda.is_available() else "cpu"
        mannequin.to(system)
        mannequin.eval()
    
        # Use samples legitimate for all subsets
        eval_indices = random.pattern(vary(min(db_sizes)), num_samples)
        print(f"Sampling {num_samples} queries for constant analysis...")
    
        # Generate question vectors
        query_vectors = []
        for idx in eval_indices:
            textual content = df_all.iloc[idx]['TEXT']
            text_tokens = tokenizer([text]).to(system)
            with torch.no_grad():
                text_features = mannequin.encode_text(text_tokens)
                text_features /= text_features.norm(dim=-1, keepdim=True)
                query_vectors.append(text_features.cpu().numpy().astype('float32'))
    
        all_output_text = []
        # Accumulate all outcomes for plotting
        # construction: { 'R@1': { 'Flat': [val1, val2...], 'ef=10': [val1, val2...] }, ... }
        ks = [1, 5, 10, 15, 20]
        plot_data = {f"R@{ok}": { "Flat": [] } for ok in ks}
        for ef in ef_search_values:
            for ok in ks:
                plot_data[f"R@{k}"][f"HNSW ef={ef}"] = []
    
        for dimension in db_sizes:
            print(f"Evaluating with database dimension: {dimension}...")
            outcomes = evaluate_subset(dimension, embeddings_all, df_all, query_vectors, eval_indices, ef_search_values)
            table_str = format_table(dimension, outcomes)
            
            # Print to display
            print(table_str)
            all_output_text.append(table_str)
    
            # Accumulate for plot
            for row in outcomes:
                label = row["Setting"]
                if label == "Flat":
                    for ok in ks:
                        plot_data[f"R@{k}"]["Flat"].append(row[f"R@{k}"])
                else:
                    ef = row["ef"]
                    for ok in ks:
                        plot_data[f"R@{k}"][f"HNSW ef={ef}"].append(row[f"R@{k}"])
    
        # Save textual content outcomes
        with open(output_txt, "w", encoding="utf-8") as f:
            f.write("n".be part of(all_output_text))
        print(f"nFinal outcomes saved to {output_txt}")
    
        # Create Particular person Plots for every Ok
        for ok in ks:
            plt.determine(figsize=(10, 6))
            k_key = f"R@{ok}"
            
            for label, values in plot_data[k_key].gadgets():
                linestyle = '--' if label == "Flat" else '-'
                marker = 'o' if label == "Flat" else 's'
                plt.plot(db_sizes, values, label=label, linestyle=linestyle, marker=marker)
            
            plt.title(f"Recall@{ok} vs Database Dimension")
            plt.xlabel("Database Dimension")
            plt.ylabel("Recall")
            plt.grid(True)
            plt.legend()
            
            output_png = os.path.be part of(results_dir, f"recall_vs_dbsize_{ok}.png")
            plt.tight_layout()
            plt.savefig(output_png)
            plt.shut()
            print(f"Plot saved to {output_png}")
    
    if __name__ == "__main__":
        major(500)
    

    And the outcomes are the next:

    Recall vs Database dimension for ok = 5
    Recall vs Database dimension for ok = 1, 10, 15, 20

    Observations

    1. For the Flat index (dotted line), Recall@5 and Recall@10 are within the vary of 0.70 – 0.85, as will be anticipated of actual life RAG functions.
    2. Flat index gives the very best Recall@ok throughout all database sizes and types a benchmark higher certain for HNSW.
    3. At any given database dimension, Recall@ok will increase for a better ok. So for database dimension of 100k vectors, Recall@20 > Recall@15 > Recall@10 > Recall@5 > Recall@1. That is comprehensible as with a better ok, there may be extra chance that the bottom reality index is current within the retrieved set.
    4. Each Flat and HNSW deteriorate persistently because the database dimension grows. It is because high-dimensional vector areas develop into more and more crowded because the variety of vectors grows.
    5. Efficiency improves for HNSW for larger ef_search values.
    6. Because the database dimension approaches 200k, HNSW seems to degrade quicker than Flat search.

    Does HNSW degrade quicker than Flat Search?

    To view the relative efficiency of Flat vs HNSW indexes as database dimension grows, a barely completely different strategy is adopted:

    1. The database indexes building and question choice course of stays identical as earlier than.
    2. As an alternative of contemplating the bottom reality, we calculate the overlap between the Flat index and every of the HNSW ef_search outcomes for a given retrieval rely(ok).
    3. 5 charts are drawn for every of the ok values, denoting the evolution of overlap as database dimension grows. For an ideal match with Flat index, the HNSW line will present a rating of 1. Extra importantly, if the degradation of HNSW outcomes is greater than Flat index, the line could have a damaging slope, else could have a horizontal or optimistic slope.
    The code will be considered right here
    import pandas as pd
    import numpy as np
    import faiss
    import torch
    import open_clip
    import os
    import random
    import matplotlib.pyplot as plt
    import time
    
    def evaluate_subset_compare(dimension, embeddings_all, df_all, query_vectors_all, ef_search_values):
        # Subset embeddings
        embeddings = embeddings_all[:size]
        dimension = embeddings.form[1]
        
        # Construct Indices in-memory for this subset dimension
        index_flat = faiss.IndexFlatL2(dimension)
        index_flat.add(embeddings)
        
        index_hnsw = faiss.IndexHNSWFlat(dimension, 16)
        index_hnsw.hnsw.efConstruction = 100
        index_hnsw.add(embeddings)
    
        num_samples = len(query_vectors_all)
        outcomes = []
    
        ks = [1, 5, 10, 15, 20]
        max_k = max(ks)
    
        # 1. Consider Flat as soon as for this subset
        flat_times = []
        flat_results_all = []
        for qv in query_vectors_all:
            start_t = time.perf_counter()
            _, I_flat_all = index_flat.search(qv, max_k)
            flat_times.append(time.perf_counter() - start_t)
            flat_results_all.append(I_flat_all[0])
        
        avg_flat_time_ms = (sum(flat_times) / num_samples) * 1000
    
        # 2. Consider HNSW relative to Flat
        for ef in ef_search_values:
            index_hnsw.hnsw.efSearch = ef
            
            hnsw_times = []
            # Monitor intersection counts for every ok
            overlap_counts = {ok: 0 for ok in ks}
            for i, qv in enumerate(query_vectors_all):
                # HNSW top-max_k
                start_t = time.perf_counter()
                _, I_hnsw_all = index_hnsw.search(qv, max_k)
                hnsw_times.append(time.perf_counter() - start_t)
                
                # Flat consequence was already pre-calculated
                I_flat_all = flat_results_all[i]
                
                for ok in ks:
                    set_flat = set(I_flat_all[:k])
                    set_hnsw = set(I_hnsw_all[0][:k])
                    intersection = set_flat.intersection(set_hnsw)
                    overlap_counts[k] += len(intersection) / ok
            
            avg_hnsw_time_ms = (sum(hnsw_times) / num_samples) * 1000
            
            hnsw_res = {
                "Setting": f"HNSW (ef={ef})", 
                "ef": ef,
                "FlatTime_ms": avg_flat_time_ms,
                "HNSWTime_ms": avg_hnsw_time_ms
            }
            for ok in ks:
                # Common over all queries
                hnsw_res[f"R@{k}"] = overlap_counts[k] / num_samples
            outcomes.append(hnsw_res)
        
        return outcomes
    
    def format_all_tables(db_sizes, ef_search_values, all_results):
        ks = [1, 5, 10, 15, 20]
        strains = []
        
        # 1. Create one desk for every Recall@ok
        for ok in ks:
            k_label = f"R@{ok}"
            strains.append(f"nTable: {k_label} (HNSW Overlap with Flat)")
            strains.append("=" * (20 + 12 * len(db_sizes)))
            
            # Header
            header = f"{'ef_search':<18}"
            for dimension in db_sizes:
                header += f" | {dimension:<9}"
            strains.append(header)
            strains.append("-" * (20 + 12 * len(db_sizes)))
            
            # Rows (ef values)
            for ef in ef_search_values:
                row_str = f"{ef:<18}"
                for dimension in db_sizes:
                    # Discover the consequence for this dimension and ef
                    val = 0
                    for r in all_results[size]:
                        if r.get('ef') == ef:
                            val = r.get(k_label, 0)
                            break
                    row_str += f" | {val:<9.2f}"
                strains.append(row_str)
            strains.append("=" * (20 + 12 * len(db_sizes)))
    
        # 2. Create Search Time Desk
        strains.append("nTable: Common Search Time (ms)")
        strains.append("=" * (20 + 12 * len(db_sizes)))
        header = f"{'Index Setting':<18}"
        for dimension in db_sizes:
            header += f" | {dimension:<9}"
        strains.append(header)
        strains.append("-" * (20 + 12 * len(db_sizes)))
        
        # Flat Row
        row_flat = f"{'Flat Index':<18}"
        for dimension in db_sizes:
            # Flat time is identical for all ef in a dimension, so simply take any
            t = all_results[size][0]['FlatTime_ms']
            row_flat += f" | {t:<9.4f}"
        strains.append(row_flat)
        
        # HNSW Rows
        for ef in ef_search_values:
            row_str = f"HNSW (ef={ef:<3})"
            for dimension in db_sizes:
                t = 0
                for r in all_results[size]:
                    if r.get('ef') == ef:
                        t = r.get('HNSWTime_ms', 0)
                        break
                row_str += f" | {t:<9.4f}"
            strains.append(row_str)
        strains.append("=" * (20 + 12 * len(db_sizes)))
    
        return "n".be part of(strains)
    
    def major(n):
        dataset_path = r"C:databaselaion_final.parquet"
        embeddings_path = r"C:databaseembeddings.npy"
        results_dir = r"C:outcomes"
        
        db_sizes = [50000, 100000, 150000, 200000]
        ef_search_values = [10, 20, 40, 80, 160]
        num_samples = n
        output_txt = os.path.be part of(results_dir, f"compare_results_{num_samples}.txt")
        output_png_prefix = "compare_vs_dbsize"
    
        if not os.path.exists(dataset_path) or not os.path.exists(embeddings_path):
            print("Error: Dataset or embeddings not discovered.")
            return
        
        os.makedirs(results_dir, exist_ok=True)
    
        # Load All Information As soon as
        print("Loading base knowledge...")
        df_all = pd.read_parquet(dataset_path)
        embeddings_all = np.load(embeddings_path).astype('float32')
    
        # Load CLIP mannequin as soon as
        print("Loading CLIP mannequin (ViT-B-32)...")
        mannequin, _, preprocess = open_clip.create_model_and_transforms('ViT-B-32', pretrained='laion2b_s34b_b79k')
        tokenizer = open_clip.get_tokenizer('ViT-B-32')
        system = "cuda" if torch.cuda.is_available() else "cpu"
        mannequin.to(system)
        mannequin.eval()
    
        # Use queries from the primary 50k rows
        eval_indices = random.pattern(vary(min(db_sizes)), num_samples)
        print(f"Sampling {num_samples} queries...")
    
        # Generate question vectors
        query_vectors = []
        for idx in eval_indices:
            textual content = df_all.iloc[idx]['TEXT']
            text_tokens = tokenizer([text]).to(system)
            with torch.no_grad():
                text_features = mannequin.encode_text(text_tokens)
                text_features /= text_features.norm(dim=-1, keepdim=True)
                query_vectors.append(text_features.cpu().numpy().astype('float32'))
    
        all_results_data = {}
        ks = [1, 5, 10, 15, 20]
        plot_data = {f"R@{ok}": {} for ok in ks}
        for ef in ef_search_values:
            for ok in ks:
                plot_data[f"R@{k}"][f"ef={ef}"] = []
    
        for dimension in db_sizes:
            print(f"Evaluating with database dimension: {dimension}...")
            outcomes = evaluate_subset_compare(dimension, embeddings_all, df_all, query_vectors, ef_search_values)
            all_results_data[size] = outcomes
    
            # Accumulate for plot
            for row in outcomes:
                ef = row["ef"]
                for ok in ks:
                    plot_data[f"R@{k}"][f"ef={ef}"].append(row[f"R@{k}"])
    
        # Format pivoted tables
        final_output_text = format_all_tables(db_sizes, ef_search_values, all_results_data)
        print(final_output_text)
    
        # Save textual content outcomes
        with open(output_txt, "w", encoding="utf-8") as f:
            f.write(final_output_text)
        print(f"nFinal outcomes saved to {output_txt}")
    
        # Create Particular person Plots for every Ok
        for ok in ks:
            plt.determine(figsize=(10, 6))
            k_key = f"R@{ok}"
            
            for label, values in plot_data[k_key].gadgets():
                plt.plot(db_sizes, values, label=label, marker='s')
            
            plt.title(f"HNSW vs Flat Overlap Recall@{ok} vs Database Dimension")
            plt.xlabel("Database Dimension")
            plt.ylabel("Overlap Ratio")
            plt.grid(True)
            plt.legend()
            
            output_png = os.path.be part of(results_dir, f"{output_png_prefix}_{ok}.png")
            plt.tight_layout()
            plt.savefig(output_png)
            plt.shut()
            print(f"Plot saved to {output_png}")
    
    if __name__ == "__main__":
        major(500)
    

    And the outcomes are the next:

    Flat vs HNSW Index Overlap for ok = 5
    Flat vs HNSW Index Overlap for ok = 1, 10, 15, 20

    Observations

    1. In all circumstances, the strains have a damaging slope, indicating that HNSW degrades quicker than the Flat index as database grows.
    2. Larger ef_search values degrade slower than decrease values, which fall fairly sharply.
    3. Larger ef_search values have vital overlap (>90%) with the benchmark flat search as in comparison with the decrease values.

    Recall-latency trade-off

    We all know that HNSW is quicker than Flat search. To see it in motion, I’ve additionally measured the typical latency within the code of the earlier part. Listed here are the typical search occasions (in ms):

    Database dimension 50,000 100,000 150,000 200,000
    Flat Index         5.1440    9.3850    14.8843   18.4100 
    HNSW (ef=10 ) 0.0851    0.0742    0.0763    0.0768  
    HNSW (ef=20 ) 0.1159    0.0876    0.0959    0.0983  
    HNSW (ef=40 ) 0.1585    0.1366    0.1415    0.1493  
    HNSW (ef=80 ) 0.2508    0.2262    0.2398    0.2417  
    HNSW (ef=160 ) 0.4613    0.3992    0.4140    0.4064  

    Observations

    1. HNSW is orders of magnitude quicker than flat search, which is the first purpose for it to be the search algorithm of alternative for nearly all vector databases.
    2. Time taken by Flat search will increase nearly linearly with database dimension (O(N) complexity)
    3. For a given ef_search worth (a row), HNSW time is almost fixed. At this scale (200k vectors), HNSW latency stays almost fixed.
    4. As ef_search will increase in a column, the HNSW time will increase very considerably. For example, time taken for ef=160 is 3X that of ef=40

    Tuning the RAG pipeline

    The above evaluation reveals that whereas HNSW is certainly the choice to undertake in a manufacturing situation for latency causes, there’s a must periodically tune the ef_search to take care of the latency-recall stability because the database grows. Some greatest practices that ought to be adopted are as follows:

    1. Given the issue of measuring Recall@ok in a manufacturing database, hold a check case repository of floor reality doc chunks and queries, which will be run at common intervals to verify retrieval high quality. We may begin with probably the most frequent queries requested by the consumer, and chunks which are wanted for recall.
    2. One other oblique technique to confirm recall high quality could be to make use of a strong LLM to evaluate the standard of the retrieved context. As an alternative of asking “Did we get the easiest paperwork for the consumer question?”, which is tough to say exactly for a big database, we will ask a barely weaker query “Does the retrieved context really comprise the reply to the consumer’s query?” and let the decide LLM reply to that.
    3. Accumulate consumer suggestions in manufacturing. Consumer score of a response together with any guide correction can be utilized as a set off for efficiency tuning.
    4. Whereas tuning ef_search, begin with a conservatively excessive worth, measure Recall@ok, then scale back till latency is appropriate.
    5. Measure Recall on the top_k that the RAG makes use of, often between 3 and 10. Contemplate stress-free top_k to fifteen or 20 and let the LLM determine which chunks within the given context to make use of for the response throughout synthesis step. Assuming the context doesn’t develop into too giant to slot in the LLM’s context window, such an strategy would allow a excessive recall whereas having a reasonable ef_search worth, thereby maintaining latency low.

    Hybrid RAG pipeline

    HNSW tuning utilizing ef_search can not repair the difficulty of falling recall with rising database dimension past a degree. That’s as a result of vector search even utilizing a flat index, turns into noisy when too many vectors are packed shut collectively within the N dimensional area (N being the variety of dimensions output by the embedding mannequin). Because the charts within the above part present, recall drops by 10%+ as database grows from 50k to 200k. The dependable technique to preserve recall is to make use of metadata filtering (eg; utilizing a data graph), to determine potential doc ids and run retrieval just for these. I focus on this intimately in my article GraphRAG in Practice: How to Build Cost-Efficient, High-Recall Retrieval Systems

    Key Takeaways

    • HNSW is the default retrieval algorithm in most vector databases, however it’s hardly ever tuned or monitored in manufacturing RAG programs.
    • Retrieval high quality degrades silently because the vector database grows, even when latency stays steady.
    • For a similar corpus dimension, Flat search persistently achieves larger Recall@ok than HNSW, serving as a helpful higher certain for analysis.
    • HNSW recall degrades quicker than Flat seek for mounted ef_search values as database dimension will increase.
    • Rising ef_search improves recall, however latency grows quickly, creating a pointy recall–latency trade-off.
    • Merely tuning HNSW parameters is inadequate at scale—vector search itself turns into noisy in dense embedding areas.
    • Hybrid RAG pipelines utilizing metadata filters (SQL, graphs, inverted indexes) are probably the most dependable technique to preserve recall at scale.

    Conclusion

    HNSW has earned its place because the spine of contemporary vector databases—not as a result of it’s completely correct, however as a result of it’s quick sufficient to make large-scale semantic search sensible.

    Nevertheless, in RAG programs, pace with out recall is a false optimization.

    This text reveals that as vector databases develop, retrieval high quality deteriorates quietly—particularly beneath approximate search—whereas latency metrics stay deceptively steady. The result’s a system that seems wholesome from an infrastructure perspective, however regularly feeds weaker context to the LLM, rising hallucinations and lowering reply high quality.

    The answer is to not abandon HNSW, nor to arbitrarily improve ef_search.

    As an alternative, production-grade RAG programs should:

    • Measure retrieval high quality explicitly and often.
    • Deal with Flat search as a recall baseline.
    • Repeatedly rebalance recall and latency.
    • And in the end, transfer towards hybrid retrieval architectures that slim the search area earlier than vector similarity is utilized.

    In case your RAG system’s solutions are getting worse as your knowledge grows, the issue will not be your LLM, your prompts, or your embeddings—however the retrieval algorithm you by no means realized you had been counting on.

    Join with me and share your feedback at www.linkedin.com/in/partha-sarkar-lets-talk-AI

    Pictures used on this article are synthetically generated. LAOIN-Aesthetics dataset used beneath CC-BY 4.0 license. Figures and code created by me



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous ArticleI Evaluated Half a Million Credit Records with Federated Learning. Here’s What I Found
    Next Article Stone Center on Inequality and Shaping the Future of Work Launches at MIT | MIT News
    ProfitlyAI
    • Website

    Related Posts

    Artificial Intelligence

    Optimizing Data Transfer in Distributed AI/ML Training Workloads

    January 23, 2026
    Artificial Intelligence

    Achieving 5x Agentic Coding Performance with Few-Shot Prompting

    January 23, 2026
    Artificial Intelligence

    Why the Sophistication of Your Prompt Correlates Almost Perfectly with the Sophistication of the Response, as Research by Anthropic Found

    January 23, 2026
    Add A Comment
    Leave A Reply Cancel Reply

    Top Posts

    Googles Gemma 3 270M: AI som får plats på din mobil

    August 17, 2025

    Vision Transformer on a Budget

    June 2, 2025

    5 Statistical Concepts You Need to Know Before Your Next Data Science Interview

    May 26, 2025

    Ethical AI Innovations for Empowering Linguistic Diversity and Economic Empowerment

    April 9, 2025

    FLUX.2 AI-bildgenerering med upp till 4MP upplösning

    December 3, 2025
    Categories
    • AI Technology
    • AI Tools & Technologies
    • Artificial Intelligence
    • Latest AI Innovations
    • Latest News
    Most Popular

    Microsoft’s AI Chief Says We’re Not Ready for ‘Seemingly Conscious’ AI

    August 26, 2025

    Why Storytelling With Data Matters for Business and Data Analysts

    November 10, 2025

    ChatGPT kan nu hjälpa dig med shopping

    December 2, 2025
    Our Picks

    Optimizing Data Transfer in Distributed AI/ML Training Workloads

    January 23, 2026

    Achieving 5x Agentic Coding Performance with Few-Shot Prompting

    January 23, 2026

    Why the Sophistication of Your Prompt Correlates Almost Perfectly with the Sophistication of the Response, as Research by Anthropic Found

    January 23, 2026
    Categories
    • AI Technology
    • AI Tools & Technologies
    • Artificial Intelligence
    • Latest AI Innovations
    • Latest News
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About us
    • Contact us
    Copyright © 2025 ProfitlyAI All Rights Reserved.

    Type above and press Enter to search. Press Esc to cancel.