Close Menu
    Trending
    • How Expert-Vetted Reasoning Datasets Improve Reinforcement Learning Model Performance
    • What we’ve been getting wrong about AI’s truth crisis
    • Building Systems That Survive Real Life
    • The crucial first step for designing a successful enterprise AI system
    • Silicon Darwinism: Why Scarcity Is the Source of True Intelligence
    • How generative AI can help scientists synthesize complex materials | MIT News
    • Distributed Reinforcement Learning for Scalable High-Performance Policy Optimization
    • How to Apply Agentic Coding to Solve Problems
    ProfitlyAI
    • Home
    • Latest News
    • AI Technology
    • Latest AI Innovations
    • AI Tools & Technologies
    • Artificial Intelligence
    ProfitlyAI
    Home » Optimizing Vector Search: Why You Should Flatten Structured Data 
    Artificial Intelligence

    Optimizing Vector Search: Why You Should Flatten Structured Data 

    ProfitlyAIBy ProfitlyAIJanuary 29, 2026No Comments8 Mins Read
    Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
    Share
    Facebook Twitter LinkedIn Pinterest Email


    structured knowledge right into a RAG system, engineers usually default to embedding uncooked JSON right into a vector database. The fact, nevertheless, is that this intuitive method results in dramatically poor efficiency. Trendy embeddings are primarily based on the BERT structure, which is actually the encoder a part of a Transformer, and are educated on an enormous textual content dataset with the principle objective of capturing semantic which means. Trendy embedding fashions can present unimaginable retrieval efficiency, however they’re educated on a big set of unstructured textual content with a give attention to semantic which means. Consequently, despite the fact that embedding JSON might appear to be an intuitively easy and stylish resolution, utilizing a generic embedding mannequin for JSON objects would reveal outcomes removed from peak efficiency.

    Deep dive

    Tokenization

    Step one is tokenization, which takes the textual content and splits it into tokens, that are typically a generic a part of the phrase. The trendy embedding fashions make the most of Byte-Pair Encoding (BPE) or WordPiece tokenization algorithms. These algorithms are optimized for pure language, breaking phrases into widespread sub-components. When a tokenizer encounters uncooked JSON, it struggles with the excessive frequency of non-alphanumeric characters. For instance, "usd": 10, just isn’t considered as a key-value pair; as an alternative, it’s fragmented:

    • The quotes ("), colon (:), and comma (,)
    • Tokens usd and 10 

    This creates a low signal-to-noise ratio. In pure language, virtually all phrases contribute to the semantic “sign”. Whereas in JSON (and different structured codecs), a major share of tokens are “wasted” on structural syntax that accommodates zero semantic worth.

    Consideration calculation

    The core energy of Transformers lies within the consideration mechanism. This enables the mannequin to weight the significance of tokens relative to one another.

    Within the sentence The worth is 10 US {dollars} or 9 euros, consideration can simply hyperlink the worth 10 to the idea worth as a result of these relationships are well-represented within the mannequin’s pre-training knowledge and the mannequin has seen this linguistic sample thousands and thousands of occasions. Then again, within the uncooked JSON:

    "worth": {
      "usd": 10,
      "eur": 9,
     }

    the mannequin encounters structural syntax it was not primarily optimized to “learn”. With out the linguistic connector, the ensuing vector will fail to seize the true intent of the information, because the relationships between the important thing and the worth are obscured by the format itself. 

    Imply Pooling

    The ultimate step in producing a single embedding illustration of the doc is Imply Pooling. Mathematically, the ultimate embedding (E) is the centroid of all token vectors (e1, e2, e3) within the doc:

    Imply Pooling calculation: Changing a sequence of n token embeddings right into a single vector illustration by averaging their values. Picture by creator.

    That is the place the JSON tokens develop into a mathematical legal responsibility. If 25% of the tokens within the doc are structural markers (braces, quotes, colons), the ultimate vector is closely influenced by the “which means” of punctuation. Consequently, the vector is successfully “pulled” away from its true semantic middle within the vector area by these noise tokens. When a consumer submits a pure language question, the gap between the “clear” question vector and “noisy” JSON vector will increase, immediately hurting the retrieval metrics.

    Flatten it

    So now that we all know concerning the JSON limitations, we have to determine how one can resolve them. The final and most simple method is to flatten the JSON and convert it into pure language.

    Let’s contemplate the everyday product object:

    {
     "skuId": "123",
     "description": "This can be a check product used for demonstration functions",
     "amount": 5,
     "worth": {
      "usd": 10,
      "eur": 9,
     },
     "availableDiscounts": ["1", "2", "3"],
     "giftCardAvailable": "true", 
     "class": "demo product"
     ...
    }

    This can be a easy object with some attributes like description, and so forth. Let’s apply the tokenization to it and see the way it seems:

    Tokenization of uncooked JSON. Discover the excessive density of distinct tokens for syntax (braces, quotes, colons) that contribute to noise slightly than which means. Screenshot by creator utilizing OpenAI Tokenizer

    Now, let’s convert it into textual content to make the embeddings’ work simpler. With the intention to try this, we are able to outline a template and substitute the JSON values into it. For instance, this template might be used to explain the product:

    Product with SKU {skuId} belongs to the class "{class}"
    Description: {description}
    It has a amount of {amount} out there 
    The worth is {worth.usd} US {dollars} or {worth.eur} euros  
    Accessible low cost ids embrace {availableDiscounts as comma-separated listing}  
    Present playing cards are {giftCardAvailable ? "out there" : "not out there"} for this product

    So the ultimate consequence will appear to be:

    Product with SKU 123 belongs to the class "demo product"
    Description: This can be a check product used for demonstration functions
    It has a amount of 5 out there
    The worth is 10 US {dollars} or 9 euros
    Accessible low cost ids embrace 1, 2, and three
    Present playing cards can be found for this product

    And apply tokenizer to it:

    Tokenization of the flattened textual content. The ensuing sequence is shorter (14% fewer tokens) and composed primarily of semantically significant phrases. Screenshot by creator utilizing OpenAI Tokenizer

    Not solely does it have 14% fewer tokens now, nevertheless it is also a a lot clearer type with the semantic which means and required context.

    Let’s measure the outcomes

    Notice: Full, reproducible code for this experiment is accessible within the Google Colab notebook

    Now let’s attempt to measure retrieval efficiency for each choices. We’re going to give attention to the usual retrieval metrics like Recall@ok, Precision@ok, and MRR to maintain it easy, and can make the most of a generic embedding mannequin (all-MiniLM-L6-v2) and the Amazon ESCI dataset with random 5,000 queries and three,809 related merchandise.

    The all-MiniLM-L6-v2 is a well-liked selection, which is small (22.7m params) however offers quick and correct outcomes, making it a sensible choice for this experiment.

    For the dataset, the model of Amazon ESCI is used, particularly milistu/amazon-esci-data (), which is accessible on Hugging Face and accommodates a group of Amazon merchandise and search queries knowledge.

    The flattening perform used for textual content conversion is:

    def flatten_product(product):
      return (
        f"Product {product['product_title']} from model {product['product_brand']}" 
        f" and product id {product['product_id']}" 
        f" and outline {product['product_description']}"
    )

    A pattern of the uncooked JSON knowledge is:

    {
      "product_id": "B07NKPWJMG",
      "title": "RoWood 3D Puzzles for Adults, Picket Mechanical Gear Kits for Teenagers Children Age 14+",
      "description": "<p> <robust>Specs</robust><br /> Mannequin Quantity: Rowood Treasure field LK502<br /> Common construct time: 5 hours<br /> Whole Items: 123<br /> Mannequin weight: 0.69 kg<br /> Field weight: 0.74 KG<br /> Assembled dimension: 100*124*85 mm<br /> Field dimension: 320*235*39 mm<br /> Certificates: EN71,-1,-2,-3,ASTMF963<br /> Beneficial Age Vary: 14+<br /> <br /> <robust>Contents</robust><br /> Plywood sheets<br /> Metallic Spring<br /> Illustrated directions<br /> Equipment<br /> <br /> <robust>MADE FOR ASSEMBLY</robust><br /> -Observe the directions supplied within the booklet and meeting 3d puzzle with some thrilling and fascinating enjoyable. Fell the delight of self creation getting this beautiful picket work like a professional.<br /> <robust>GLORIFY YOUR LIVING SPACE</robust><br /> -Revive the enigmatic attraction and cheer your events and get-togethers with an expertise that's distinctive and attention-grabbing .<br /> <br />",
      "model": "RoWood",
      "coloration": "Treasure Field"
    }

    For the vector search, two FAISS indexes are created: one for the flattened textual content and one for the JSON-formatted textual content. Each indexes are flat, which signifies that they’ll examine distances for every of the present entries as an alternative of using an Approximate Nearest Neighbour (ANN) index. That is vital to make sure that retrieval metrics aren’t affected by the ANN.

    D = 384
    index_json = faiss.IndexFlatIP(D)
    index_flatten = faiss.IndexFlatIP(D)

    To scale back the dataset a random variety of 5,000 queries has been chosen and all corresponding merchandise have been embedded and added to the indexes. Consequently, the collected metrics are as follows:

    Evaluating the 2 indexing strategies utilizing the all-MiniLM-L6-v2 embedding mannequin on the Amazon ESCI dataset. The flattened method persistently yields larger scores throughout all key retrieval metrics (Precision@10, Recall@10, and MRR). Picture by creator

    And the efficiency change of the flattened model is:

    Changing the structured JSON to pure language textual content resulted in important good points, together with a 19.1% enhance in Recall@10 and a 27.2% enhance in MRR (Imply Reciprocal Rank), confirming the superior semantic illustration of the flattened knowledge. Picture by creator.

    The evaluation confirms that embedding uncooked structured knowledge into generic vector area is a suboptimal method and including a easy preprocessing step of flattening structured knowledge persistently delivers important enchancment for retrieval metrics (boosting recall@ok and precision@ok by about 20%). The primary takeaway for engineers constructing RAG techniques is that efficient knowledge preparation is extraordinarily vital for reaching peak efficiency of the semantic retrieval/RAG system.

    References

    [1] Full experiment code https://colab.research.google.com/drive/1dTgt6xwmA6CeIKE38lf2cZVahaJNbQB1?usp=sharing
    [2] Mannequin 
    https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2
    [3] Amazon ESCI dataset. Particular model used: https://huggingface.co/datasets/milistu/amazon-esci-data
    The unique dataset out there at https://www.amazon.science/code-and-datasets/shopping-queries-dataset-a-large-scale-esci-benchmark-for-improving-product-search
    [4] FAISS https://ai.meta.com/tools/faiss/



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous ArticleRoPE, Clearly Explained | Towards Data Science
    Next Article DHS is using Google and Adobe AI to make videos
    ProfitlyAI
    • Website

    Related Posts

    Artificial Intelligence

    Building Systems That Survive Real Life

    February 2, 2026
    Artificial Intelligence

    Silicon Darwinism: Why Scarcity Is the Source of True Intelligence

    February 2, 2026
    Artificial Intelligence

    How generative AI can help scientists synthesize complex materials | MIT News

    February 2, 2026
    Add A Comment
    Leave A Reply Cancel Reply

    Top Posts

    Google I/O, Claude 4, White Collar Jobs Automated in 5 Years, Jony Ive Joins OpenAI, and AI’s Impact on the Environment

    May 27, 2025

    LangChain for EDA: Build a CSV Sanity-Check Agent in Python

    September 9, 2025

    Deb8flow: Orchestrating Autonomous AI Debates with LangGraph and GPT-4o

    April 10, 2025

    MIT Schwarzman College of Computing and MBZUAI launch international collaboration to shape the future of AI | MIT News

    October 8, 2025

    Beyond Prompting: The Power of Context Engineering

    January 8, 2026
    Categories
    • AI Technology
    • AI Tools & Technologies
    • Artificial Intelligence
    • Latest AI Innovations
    • Latest News
    Most Popular

    The Three Ages of Data Science: When to Use Traditional Machine Learning, Deep Learning, or an LLM (Explained with One Example)

    November 11, 2025

    Inside the Mind of Demis Hassabis

    December 5, 2025

    AI Agents Processing Time Series and Large Dataframes

    April 23, 2025
    Our Picks

    How Expert-Vetted Reasoning Datasets Improve Reinforcement Learning Model Performance

    February 3, 2026

    What we’ve been getting wrong about AI’s truth crisis

    February 2, 2026

    Building Systems That Survive Real Life

    February 2, 2026
    Categories
    • AI Technology
    • AI Tools & Technologies
    • Artificial Intelligence
    • Latest AI Innovations
    • Latest News
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About us
    • Contact us
    Copyright © 2025 ProfitlyAI All Rights Reserved.

    Type above and press Enter to search. Press Esc to cancel.