Close Menu
    Trending
    • Gemini introducerar funktionen schemalagda åtgärder i Gemini-appen
    • AIFF 2025 Runway’s tredje årliga AI Film Festival
    • AI-agenter kan nu hjälpa läkare fatta bättre beslut inom cancervård
    • Not Everything Needs Automation: 5 Practical AI Agents That Deliver Enterprise Value
    • Prescriptive Modeling Unpacked: A Complete Guide to Intervention With Bayesian Modeling.
    • 5 Crucial Tweaks That Will Make Your Charts Accessible to People with Visual Impairments
    • Why AI Projects Fail | Towards Data Science
    • The Role of Luck in Sports: Can We Measure It?
    ProfitlyAI
    • Home
    • Latest News
    • AI Technology
    • Latest AI Innovations
    • AI Tools & Technologies
    • Artificial Intelligence
    ProfitlyAI
    Home » LLMs + Pandas: How I Use Generative AI to Generate Pandas DataFrame Summaries
    Artificial Intelligence

    LLMs + Pandas: How I Use Generative AI to Generate Pandas DataFrame Summaries

    ProfitlyAIBy ProfitlyAIJune 3, 2025No Comments12 Mins Read
    Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
    Share
    Facebook Twitter LinkedIn Pinterest Email


    datasets and are on the lookout for fast insights with out an excessive amount of guide grind, you’ve come to the proper place.

    In 2025, datasets usually comprise hundreds of thousands of rows and a whole bunch of columns, which makes guide evaluation subsequent to unattainable. Native Giant Language Fashions can rework your uncooked DataFrame statistics into polished, readable studies in seconds — minutes at worst. This strategy eliminates the tedious strategy of analyzing information by hand and writing government studies, particularly if the info construction doesn’t change.

    Pandas handles the heavy lifting of knowledge extraction whereas LLMs convert your technical outputs into presentable studies. You’ll nonetheless want to write down capabilities that pull key statistics out of your datasets, nevertheless it’s a one-time effort.

    This information assumes you may have Ollama put in domestically. In the event you don’t, you possibly can nonetheless use third-party LLM distributors, however I received’t clarify how to connect with their APIs.

    Desk of contents:

    • Dataset Introduction and Exploration
    • The Boring Half: Extracting Abstract Statistics
    • The Cool Half: Working with LLMs
    • What You May Enhance

    Dataset Introduction and Exploration

    For this information, I’m utilizing the MBA admissions dataset from Kaggle. Obtain it if you wish to observe alongside.

    The dataset is licensed below the Apache 2.0 license, which suggests you should use it freely for each private and industrial initiatives.

    To get began, you’ll want just a few Python libraries put in in your system.

    Picture 1 – Required Python libraries and variations (picture by creator)

    After you have every thing put in, import the required libraries in a brand new script or a pocket book:

    import pandas as pd
    from langchain_ollama import ChatOllama
    from typing import Literal

    Dataset loading and preprocessing

    Begin by loading the dataset with Pandas. This snippet hundreds the CSV file, prints fundamental details about the dataset form, and reveals what number of lacking values exist in every column:

    df = pd.read_csv("information/MBA.csv")
    
    # Primary dataset information
    print(f"Dataset form: {df.form}n")
    print("Lacking worth stats:")
    print(df.isnull().sum())
    print("-" * 25)
    df.pattern(5)
    Picture 2 – Primary dataset statistics (picture by creator)

    Since information cleansing isn’t the principle focus of this text, I’ll maintain the preprocessing minimal. The dataset solely has a few lacking values that want consideration:

    df["race"] = df["race"].fillna("Unknown")
    df["admission"] = df["admission"].fillna("Deny")

    That’s it! Let’s see how you can go from this to a significant report subsequent.

    The Boring Half: Extracting Abstract Statistics

    Even with all of the advances in AI functionality and availability, you in all probability don’t need to ship your complete dataset to an LLM supplier. There are a few good the reason why.

    It may eat method too many tokens, which interprets on to greater prices. Processing giant datasets can take a very long time, particularly while you’re working fashions domestically by yourself {hardware}. You may additionally be coping with delicate information that shouldn’t go away your group.

    Some guide work remains to be the way in which to go.

    This strategy requires you to write down a operate that extracts key parts and statistics out of your Pandas DataFrame. You’ll have to write down this operate from scratch for various datasets, however the core concept transfers simply between initiatives.

    The get_summary_context_message() operate takes in a DataFrame and returns a formatted multi-line string with an in depth abstract. Right here’s what it contains:

    • Complete software depend and gender distribution
    • Worldwide vs home applicant breakdown
    • GPA and GMAT rating quartile statistics
    • Admission charges by educational main (sorted by fee)
    • Admission charges by work trade (prime 8 industries)
    • Work expertise evaluation with categorical breakdowns
    • Key insights highlighting top-performing classes

    Right here’s the whole supply code for the operate:

    def get_summary_context_message(df: pd.DataFrame) -> str:
        """
        Generate a complete abstract report of MBA admissions dataset statistics.
        
        This operate analyzes MBA software information to offer detailed statistics on
        applicant demographics, educational efficiency, skilled backgrounds, and
        admission charges throughout numerous classes. The abstract contains gender and
        worldwide standing distributions, GPA and GMAT rating statistics, admission
        charges by educational main and work trade, and work expertise influence evaluation.
        
        Parameters
        ----------
        df : pd.DataFrame
            DataFrame containing MBA admissions information with the next anticipated columns:
            - 'gender', 'worldwide', 'gpa', 'gmat', 'main', 'work_industry', 'work_exp', 'admission'
        
        Returns
        -------
        str
            A formatted multi-line string containing complete MBA admissions
            statistics.
        """
        # Primary software statistics
        total_applications = len(df)
    
        # Gender distribution
        gender_counts = df["gender"].value_counts()
        male_count = gender_counts.get("Male", 0)
        female_count = gender_counts.get("Feminine", 0)
    
        # Worldwide standing
        international_count = (
            df["international"].sum()
            if df["international"].dtype == bool
            else (df["international"] == True).sum()
        )
    
        # GPA statistics
        gpa_data = df["gpa"].dropna()
        gpa_avg = gpa_data.imply()
        gpa_25th = gpa_data.quantile(0.25)
        gpa_50th = gpa_data.quantile(0.50)
        gpa_75th = gpa_data.quantile(0.75)
    
        # GMAT statistics
        gmat_data = df["gmat"].dropna()
        gmat_avg = gmat_data.imply()
        gmat_25th = gmat_data.quantile(0.25)
        gmat_50th = gmat_data.quantile(0.50)
        gmat_75th = gmat_data.quantile(0.75)
    
        # Main evaluation - admission charges by main
        major_stats = []
        for main in df["major"].distinctive():
            major_data = df[df["major"] == main]
            admitted = len(major_data[major_data["admission"] == "Admit"])
            complete = len(major_data)
            fee = (admitted / complete) * 100
            major_stats.append((main, admitted, complete, fee))
    
        # Kind by admission fee (descending)
        major_stats.kind(key=lambda x: x[3], reverse=True)
    
        # Work trade evaluation - admission charges by trade
        industry_stats = []
        for trade in df["work_industry"].distinctive():
            if pd.isna(trade):
                proceed
            industry_data = df[df["work_industry"] == trade]
            admitted = len(industry_data[industry_data["admission"] == "Admit"])
            complete = len(industry_data)
            fee = (admitted / complete) * 100
            industry_stats.append((trade, admitted, complete, fee))
    
        # Kind by admission fee (descending)
        industry_stats.kind(key=lambda x: x[3], reverse=True)
    
        # Work expertise evaluation
        work_exp_data = df["work_exp"].dropna()
        avg_work_exp_all = work_exp_data.imply()
    
        # Work expertise for admitted college students
        admitted_students = df[df["admission"] == "Admit"]
        admitted_work_exp = admitted_students["work_exp"].dropna()
        avg_work_exp_admitted = admitted_work_exp.imply()
    
        # Work expertise ranges evaluation
        def categorize_work_exp(exp):
            if pd.isna(exp):
                return "Unknown"
            elif exp < 2:
                return "0-1 years"
            elif exp < 4:
                return "2-3 years"
            elif exp < 6:
                return "4-5 years"
            elif exp < 8:
                return "6-7 years"
            else:
                return "8+ years"
    
        df["work_exp_category"] = df["work_exp"].apply(categorize_work_exp)
        work_exp_category_stats = []
    
        for class in ["0-1 years", "2-3 years", "4-5 years", "6-7 years", "8+ years"]:
            category_data = df[df["work_exp_category"] == class]
            if len(category_data) > 0:
                admitted = len(category_data[category_data["admission"] == "Admit"])
                complete = len(category_data)
                fee = (admitted / complete) * 100
                work_exp_category_stats.append((class, admitted, complete, fee))
    
        # Construct the abstract message
        abstract = f"""MBA Admissions Dataset Abstract (2025)
        
    Complete Functions: {total_applications:,} individuals utilized to the MBA program.
    
    Gender Distribution:
    - Male candidates: {male_count:,} ({male_count/total_applications*100:.1f}%)
    - Feminine candidates: {female_count:,} ({female_count/total_applications*100:.1f}%)
    
    Worldwide Standing:
    - Worldwide candidates: {international_count:,} ({international_count/total_applications*100:.1f}%)
    - Home candidates: {total_applications-international_count:,} ({(total_applications-international_count)/total_applications*100:.1f}%)
    
    Tutorial Efficiency Statistics:
    
    GPA Statistics:
    - Common GPA: {gpa_avg:.2f}
    - twenty fifth percentile: {gpa_25th:.2f}
    - fiftieth percentile (median): {gpa_50th:.2f}
    - seventy fifth percentile: {gpa_75th:.2f}
    
    GMAT Statistics:
    - Common GMAT: {gmat_avg:.0f}
    - twenty fifth percentile: {gmat_25th:.0f}
    - fiftieth percentile (median): {gmat_50th:.0f}
    - seventy fifth percentile: {gmat_75th:.0f}
    
    Main Evaluation - Admission Charges by Tutorial Background:"""
    
        for main, admitted, complete, fee in major_stats:
            abstract += (
                f"n- {main}: {admitted}/{complete} admitted ({fee:.1f}% admission fee)"
            )
    
        abstract += (
            "nnWork Trade Evaluation - Admission Charges by Skilled Background:"
        )
    
        # Present prime 8 industries by admission fee
        for trade, admitted, complete, fee in industry_stats[:8]:
            abstract += (
                f"n- {trade}: {admitted}/{complete} admitted ({fee:.1f}% admission fee)"
            )
    
        abstract += "nnWork Expertise Affect on Admissions:nnOverall Work Expertise Comparability:"
        abstract += (
            f"n- Common work expertise (all candidates): {avg_work_exp_all:.1f} years"
        )
        abstract += f"n- Common work expertise (admitted college students): {avg_work_exp_admitted:.1f} years"
    
        abstract += "nnAdmission Charges by Work Expertise Vary:"
        for class, admitted, complete, fee in work_exp_category_stats:
            abstract += (
                f"n- {class}: {admitted}/{complete} admitted ({fee:.1f}% admission fee)"
            )
    
        # Key insights
        best_major = major_stats[0]
        best_industry = industry_stats[0]
    
        abstract += "nnKey Insights:"
        abstract += (
            f"n- Highest admission fee by main: {best_major[0]} at {best_major[3]:.1f}%"
        )
        abstract += f"n- Highest admission fee by trade: {best_industry[0]} at {best_industry[3]:.1f}%"
    
        if avg_work_exp_admitted > avg_work_exp_all:
            abstract += f"n- Admitted college students have barely extra work expertise on common ({avg_work_exp_admitted:.1f} vs {avg_work_exp_all:.1f} years)"
        else:
            abstract += "n- Work expertise reveals minimal distinction between admitted and all candidates"
    
        return abstract

    When you’ve outlined the operate, merely name it and print the outcomes:

    print(get_summary_context_message(df))
    Picture 3 – Extracted findings and statistics from the dataset (picture by creator)

    Now let’s transfer on to the enjoyable half.

    The Cool Half: Working with LLMs

    That is the place issues get fascinating and your guide information extraction work pays off.

    Python helper operate for working with LLMs

    If in case you have respectable {hardware}, I strongly suggest utilizing native LLMs for easy duties like this. I take advantage of Ollama and the newest model of the Mistral mannequin for the precise LLM processing.

    Picture 4 – Out there Ollama fashions (picture by creator)

    If you wish to use one thing like ChatGPT via OpenAI API, you possibly can nonetheless do this. You’ll simply want to switch the operate under to arrange your API key and return the suitable occasion from Langchain.

    Whatever the choice you select, a name to get_llm() with a take a look at message shouldn’t return an error:

    def get_llm(model_name: str = "mistral:newest") -> ChatOllama:
        """
        Create and configure a ChatOllama occasion for native LLM inference.
        
        This operate initializes a ChatOllama consumer configured to connect with a
        native Ollama server. The consumer is ready up with deterministic output
        (temperature=0) for constant responses throughout a number of calls with the
        similar enter.
        
        Parameters
        ----------
        model_name : str, optionally available
            The identify of the Ollama mannequin to make use of for chat completions.
            Should be a sound mannequin identify that's obtainable on the native Ollama
            set up. Default is "mistral:newest".
        
        Returns
        -------
        ChatOllama
            A configured ChatOllama occasion prepared for chat completions.
        """
        return ChatOllama(
            mannequin=model_name, base_url="http://localhost:11434", temperature=0
        )
    
    
    print(get_llm().invoke("take a look at").content material)
    Picture 5 – LLM take a look at message (picture by creator)

    Summarization immediate

    That is the place you may get inventive and write ultra-specific directions in your LLM. I’ve determined to maintain issues mild for demonstration functions, however be at liberty to experiment right here.

    There isn’t a single proper or flawed immediate.

    No matter you do, be certain that to incorporate the format arguments utilizing curly brackets – these values will likely be stuffed dynamically later:

    SUMMARIZE_DATAFRAME_PROMPT = """
    You're an knowledgeable information analyst and information summarizer. Your job is to soak up complicated datasets
    and return user-friendly descriptions and findings.
    
    You got this dataset:
    - Identify: {dataset_name}
    - Supply: {dataset_source}
    
    This dataset was analyzed in a pipeline earlier than it was given to you.
    These are the findings returned by the evaluation pipeline:
    
    <context>
    {context}
    </context>
    
    Primarily based on these findings, write an in depth report in {report_format} format.
    Give the report a significant title and separate findings into sections with headings and subheadings.
    Output solely the report in {report_format} and nothing else.
    
    Report:
    """

    Summarization Python operate

    With the immediate and the get_llm() capabilities declared, the one factor left is to attach the dots. The get_report_summary() operate takes in arguments that may fill the format placeholders within the immediate, then invokes the LLM with that immediate to generate a report.

    You’ll be able to select between Markdown or HTML codecs:

    def get_report_summary(
        dataset: pd.DataFrame,
        dataset_name: str,
        dataset_source: str,
        report_format: Literal["markdown", "html"] = "markdown",
    ) -> str:
        """
        Generate an AI-powered abstract report from a pandas DataFrame.
        
        This operate analyzes a dataset and generates a complete abstract report
        utilizing a big language mannequin (LLM). It first extracts statistical context
        from the dataset, then makes use of an LLM to create a human-readable report within the
        specified format.
        
        Parameters
        ----------
        dataset : pd.DataFrame
            The pandas DataFrame to investigate and summarize.
        dataset_name : str
            A descriptive identify for the dataset that will likely be included within the
            generated report for context and identification.
        dataset_source : str
            Details about the supply or origin of the dataset.
        report_format : {"markdown", "html"}, optionally available
            The specified output format for the generated report. Choices are:
            - "markdown" : Generate report in Markdown format (default)
            - "html" : Generate report in HTML format
        
        Returns
        -------
        str
            A formatted abstract report.
        
        """
        context_message = get_summary_context_message(df=dataset)
        immediate = SUMMARIZE_DATAFRAME_PROMPT.format(
            dataset_name=dataset_name,
            dataset_source=dataset_source,
            context=context_message,
            report_format=report_format,
        )
        return get_llm().invoke(enter=immediate).content material

    Utilizing the operate is easy – simply go within the dataset, its identify, and supply. The report format defaults to Markdown:

    md_report = get_report_summary(
        dataset=df, 
        dataset_name="MBA Admissions (2025)",
        dataset_source="https://www.kaggle.com/datasets/taweilo/mba-admission-dataset"
    )
    print(md_report)
    Picture 6 – Closing report in Markdown format (picture by creator)

    The HTML report is simply as detailed, however may use some styling. Perhaps you would ask the LLM to deal with that as effectively!

    Picture 7 – Closing report in HTML format (picture by creator)

    What You May Enhance

    I may have simply turned this right into a 30-minute learn by optimizing each element of the pipeline, however I saved it easy for demonstration functions. You don’t must (and shouldn’t) cease right here although.

    Listed below are the issues you possibly can enhance to make this pipeline much more highly effective:

    • Write a operate that saves the report (Markdown or HTML) on to disk. This fashion you possibly can automate your complete course of and generate studies on a schedule with out guide intervention.
    • Within the immediate, ask the LLM so as to add CSS styling to the HTML report to make it look extra presentable. You may even present your organization’s model colours and fonts to keep up consistency throughout all of your information studies.
    • Develop the immediate to observe extra particular directions. You may want studies that target particular enterprise metrics, observe a selected template, or embrace suggestions based mostly on the findings.
    • Develop the get_llm() operate so it will probably join each to Ollama and different distributors like OpenAI, Anthropic, or Google. This offers you flexibility to change between native and cloud-based fashions relying in your wants.
    • Do actually something within the get_summary_context_message() operate because it serves as the inspiration for all context information supplied to the LLM. That is the place you may get inventive with characteristic engineering, statistical evaluation, and information insights that matter to your particular use case.

    I hope this minimal instance has set you heading in the right direction to automate your personal information reporting workflows.



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous ArticleTeaching AI models the broad strokes to sketch more like humans do | MIT News
    Next Article Teaching AI models what they don’t know | MIT News
    ProfitlyAI
    • Website

    Related Posts

    Artificial Intelligence

    Not Everything Needs Automation: 5 Practical AI Agents That Deliver Enterprise Value

    June 6, 2025
    Artificial Intelligence

    Prescriptive Modeling Unpacked: A Complete Guide to Intervention With Bayesian Modeling.

    June 6, 2025
    Artificial Intelligence

    5 Crucial Tweaks That Will Make Your Charts Accessible to People with Visual Impairments

    June 6, 2025
    Add A Comment
    Leave A Reply Cancel Reply

    Top Posts

    Manus AI lanserar intelligent bildgenerering – mer än bara en bildgenerator

    May 17, 2025

    DeepCoder: Open Source AI som når O3-mini Prestanda

    April 9, 2025

    Kernel Case Study: Flash Attention

    April 3, 2025

    Get Started with Rust: Installation and Your First CLI Tool – A Beginner’s Guide

    May 13, 2025

    Natasha Lyonne to Direct AI-Powered Sci-Fi Film That Could Redefine Hollywood

    April 30, 2025
    Categories
    • AI Technology
    • AI Tools & Technologies
    • Artificial Intelligence
    • Latest AI Innovations
    • Latest News
    Most Popular

    Demystifying Structured and Unstructured Data in Healthcare: Unlocking the Potential of EHR, Medical Imaging, and Predictive Analytics

    April 7, 2025

    Pope Leo XIV Declares AI a Threat to Human Dignity and Workers’ Rights

    May 12, 2025

    Why Diversity in Data is Crucial for Accurate Computer Vision Models

    April 6, 2025
    Our Picks

    Gemini introducerar funktionen schemalagda åtgärder i Gemini-appen

    June 7, 2025

    AIFF 2025 Runway’s tredje årliga AI Film Festival

    June 7, 2025

    AI-agenter kan nu hjälpa läkare fatta bättre beslut inom cancervård

    June 7, 2025
    Categories
    • AI Technology
    • AI Tools & Technologies
    • Artificial Intelligence
    • Latest AI Innovations
    • Latest News
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About us
    • Contact us
    Copyright © 2025 ProfitlyAI All Rights Reserved.

    Type above and press Enter to search. Press Esc to cancel.