Close Menu
    Trending
    • How Expert-Vetted Reasoning Datasets Improve Reinforcement Learning Model Performance
    • What we’ve been getting wrong about AI’s truth crisis
    • Building Systems That Survive Real Life
    • The crucial first step for designing a successful enterprise AI system
    • Silicon Darwinism: Why Scarcity Is the Source of True Intelligence
    • How generative AI can help scientists synthesize complex materials | MIT News
    • Distributed Reinforcement Learning for Scalable High-Performance Policy Optimization
    • How to Apply Agentic Coding to Solve Problems
    ProfitlyAI
    • Home
    • Latest News
    • AI Technology
    • Latest AI Innovations
    • AI Tools & Technologies
    • Artificial Intelligence
    ProfitlyAI
    Home » EDA in Public (Part 1): Cleaning and Exploring Sales Data with Pandas
    Artificial Intelligence

    EDA in Public (Part 1): Cleaning and Exploring Sales Data with Pandas

    ProfitlyAIBy ProfitlyAIDecember 12, 2025No Comments12 Mins Read
    Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
    Share
    Facebook Twitter LinkedIn Pinterest Email


    ! Welcome to the beginning of a serious knowledge journey that I’m calling “EDA in Public.” For individuals who know me, I consider one of the best ways to study something is to deal with a real-world downside and share the complete messy course of — together with errors, victories, and every little thing in between. When you’ve been seeking to degree up your Pandas and knowledge evaluation expertise, that is the sequence for you.

    We’re going to behave as knowledge analysts for a fictional, mid-sized e-commerce firm I’ll name NovaShop. They handed us a uncooked, messy gross sales CSV and requested a easy query: “How are we doing?”

    The objective of Half 1 is foundational: we’ll clear this difficult e-commerce dataset, discover its primary construction, and grasp the core EDA expertise in Pandas that each knowledge scientist makes use of each day. This sequence is structured to take you from a newbie (Half 1) to an superior knowledge analyst (Half 3), so be happy to leap in wherever you’re at.

    Earlier than we leap into the code, let’s nail down our core motivation. For NovaShop, we have to reply some easy, but highly effective questions: Which merchandise drive probably the most income? Which nations generate probably the most gross sales? Let’s discover out.

    Dataset Overview: Unpacking the Gross sales Knowledge

    To start out our evaluation for NovaShop, we’ll be utilizing the UCI Online Retail Dataset. It is a nice, extremely life like, non-streamlined dataset that captures all transactions for a UK-based non-store on-line retail firm between late 2010 and late 2011.

    This dataset is licensed underneath a Creative Commons Attribution 4.0 International (CC BY 4.0) license.

    This enables for the sharing and adaptation of the datasets for any objective,  supplied that the suitable credit score is given.

    The dataset incorporates over half 1,000,000 rows, and it’s stuffed with the sorts of anomalies you encounter in the actual world — lacking values, adverse numbers, and inconsistent textual content formatting. That is precisely what we wish!

    Listed below are the eight key columns we’ll be working with, and what they inform us from a enterprise perspective:

    • InvoiceNo: That is the Bill quantity. A singular 6-digit quantity assigned to every transaction. If the code begins with ‘C’, it signifies a cancellation (a return).
    • StockCode/Product code: A singular 5-digit code assigned to every distinct product.
    • Description: The title of the merchandise. This typically wants cleansing (additional areas, inconsistent circumstances).
    • Amount: Variety of objects bought. What number of items have been concerned per transaction? Could be adverse for returns.
    • InvoiceDate: The date and time of the transaction. That is important for time-series evaluation in a while.
    • UnitPrice: Product value per unit in Sterling (GBP). The worth of 1 merchandise. It might typically be 0 or adverse attributable to errors/free objects.
    • CustomerID: A singular 5-digit quantity assigned to every registered buyer. Crucially, that is typically lacking, which means we have now many transactions from visitors.
    • Nation: Identify of the nation the place the client resides. This could be nice for segmenting worldwide gross sales.

    Let’s take a fast peek on the first few rows to see what we’re coping with. That is the output of df.head():

    Let’s import the dataset into Pandas and see what number of rows we’re coping with.

    import pandas as pd
    import numpy as np
    df = pd.read_csv(‘On-line Retail.csv’)
    df.form

    Output:

    (541909, 8)

    That’s numerous rows. I might need to slice the rows a bit.

    Knowledge Loading and Slicing: Coping with Quantity

    For Half 1, we’re going to take a 10% random pattern of the complete dataset. This provides us a way more manageable dimension — round 54,000 rows — whereas sustaining the traits of the unique knowledge. It is a widespread and sensible approach when coping with Massive Knowledge environments or when specializing in prototyping.

    # Knowledge Loading and Slicing
    FILE_PATH = ‘On-line Retail.csv’
    SAMPLE_FRACTION = 0.1 # We are going to pattern 10% of the info
    full_df = pd.read_csv(FILE_PATH, encoding=’unicode_escape’)
    
    # Take a random 10% pattern for quicker processing in Half 1
    df = full_df.pattern(frac=SAMPLE_FRACTION, random_state=42).reset_index(drop=True)

    Now, let’s take a look at the form once more utilizing the df.form

    Output:

    (54191, 8)

    Excellent! Now we are able to start.

    Preliminary Knowledge Inspection: Getting Our Arms Soiled

    My first step will likely be to find what we’re cleansing. I have to reply these questions

    • What does the dataset include?
    • What are the info sorts and non-null counts?
    • What do the numbers appear like?

    Visible Examine: df.head() and df.tail()

    Trying on the first and previous few rows, I may verify just a few issues:

    • Cancellations & Returns: I occurred to note some InvoiceNo values beginning with ‘C’, and the corresponding Amount is adverse. This confirms returns are included, and for income evaluation, I’ll have to separate or exclude them.
    • Missingness: I may visually see NaN values within the CustomerID column, confirming the mass of visitor transactions.
    • Inconsistent Textual content: I checked the Description column for whitespaces. Based mostly on the small pattern I bought, I couldn’t actually inform. But it surely doesn’t harm to handle them after I go into knowledge cleansing. I may additionally maintain the capitalisation constant. It’s at all times greatest follow to strip all main/trailing whitespace from all string columns to stop refined errors when grouping.
    • I additionally observed some capitalisation inconsistencies within the Nation column. I noticed a rustic named EIRE. It in all probability means Eire, may want to alter that.

    What Are the Knowledge Varieties and Non-Null Counts? (.data())

    The subsequent important step is to examine the construction utilizing the .data() technique. This tells me the info sort of every column and, most significantly, what number of non-null (not lacking) values we have now.

    Key Findings

    • Lacking Values: After taking our 10% pattern, I may see an enormous hole in CustomerID. Roughly 25% of the client IDs are lacking. I additionally observed just a few hundred lacking values in Description that I’ll want to handle.
    • Knowledge Varieties: The InvoiceDate column continues to be listed as an object (string) sort. Gotta convert this to a correct Pandas datetime object.CustomerID can also be at present a float, probably as a result of it incorporates these NaN values! It is a little element I’ll want to recollect if I ever wish to use it as a real integer ID.

    What Do the Numbers Look Like? (.describe())

    Subsequent, I used .describe() to get a fast statistical abstract of all of the numerical columns (Amount and UnitPrice).

    • Amount (Amount): The minimal amount is -2472. This confirms that returns exist, however the scale of these minimal factors suggests an excessive outlier transaction. For primary gross sales evaluation, I would have to filter out these adverse numbers and probably the intense optimistic and adverse outliers.
    • Unit Value (UnitPrice): The minimal value is 0. This implies some merchandise got away totally free or are placeholder entries. Since a product ought to have a optimistic value in a traditional sale, filtering out any rows the place UnitPrice is zero or much less is at all times greatest follow to calculate income precisely.

    Based mostly on these fast inspections carried out. This knowledge is way from clear. We’ve huge missingness, incorrect knowledge sorts, extremely problematic adverse/zero values in our core numerical columns, and textual content inconsistencies to deal with.

    Dealing with Lacking Values

    Now that we all know the place our knowledge is lacking, we are able to start discussing methods for dealing with null values. I’m largely involved with Description and CustomerID.

    Lacking Descriptions

    The Description tells us what product was bought, so dropping it renders the transaction meaningless. In our pattern, lower than 1% of the rows have descriptions lacking. Dropping these rows fully is smart, as they’re ineffective for evaluation on the product degree.

    # Drop rows the place the Description is lacking
    df.dropna(subset=['Description'], inplace=True)
    # checking for null counts
    df['Description'].isnull().sum()

    Output:

    np.int64(0)

    Alright, good! All null values are gone.

    Lacking Buyer IDs

    The lacking CustomerIDs (round 25% of our pattern) are a a lot larger deal. If I drop all of them, I’ll lose virtually 1 / 4 of our gross sales knowledge, which can give NovaShop a dangerously skewed view of its complete income.

    For easy income and product evaluation (the objective of Half 1), I don’t really want the CustomerID. We are able to proceed with the evaluation on all rows even with out an ID. We solely want the ID if we plan on doing buyer segmentation (like RFM evaluation), which I’ll save for Half 2!

    Knowledge Kind Conversion and Duplicate Elimination

    Now that we’ve dealt with the lacking descriptions, the following two quick points are coping with totally duplicated rows and fixing our important InvoiceDate column.

    Fixing the Bill Date Kind

    Keep in mind how .data() confirmed InvoiceDate as a string (object)? We have to repair this instantly so Pandas is aware of the way to kind and combination our knowledge chronologically.

    # Convert InvoiceDate from object (string) to datetime
    df[‘InvoiceDate’] = pd.to_datetime(df[‘InvoiceDate’])

    Eradicating Duplicates

    If two rows are similar throughout all columns, they’re duplicates and can artificially inflate our gross sales numbers. Let’s examine for them and take away them.

    # Take away Duplicates
    num_duplicates = df.duplicated().sum()
    print(f”Discovered {num_duplicates} totally duplicated rows.”)

    Output:

    Discovered 62 totally duplicated rows

    Let’s drop them

    # Take away duplicates, protecting the primary occasion
    df.drop_duplicates(inplace=True)
    print(f”Dataframe dimension after eradicating duplicates: {len(df)} rows.”)

    Output:

    Dataframe dimension after eradicating duplicates: 53980 rows.

    Filtering Out Returns and Errors

    Our .describe() technique confirmed us some severe crimson flags: adverse portions (returns) and nil/adverse unit costs (errors/free objects). For Half 1, we wish to calculate internet income from gross sales, so we should filter this noise out.

    Dealing with Unfavourable Portions and Costs

    We are going to maintain solely the transactions the place:

    • Amount is strictly better than zero (filtering out all returns and cancellations).
    • UnitPrice is strictly better than zero (filtering out errors and free objects).
    # Filter: Preserve solely transactions the place Amount is optimistic (i.e., gross sales, not returns)
    df = df[df[‘Quantity’] > 0]
    
    # Filter: Preserve solely transactions the place UnitPrice is optimistic (i.e., not free or an error)
    df = df[df[‘UnitPrice’] > 0]
    print(f”Dataframe dimension after filtering: {len(df)} rows.”)

    Output

    Dataframe dimension after filtering: 52933 rows.

    Cleansing Textual content Columns

    Lastly, let’s deal with these textual content standardization points we noticed visually, just like the EIRE nation code and any potential hidden whitespace in Description.

    • Strip Whitespace: We use .str.strip() to take away main/trailing areas from each Description and Nation.
    • Standardize Nation: We manually map inconsistencies like ‘EIRE’ to ‘Eire’.
    # Cleansing Textual content Columns
    # Clear Description and Nation columns
    df[‘Description’] = df[‘Description’].str.strip()
    df[‘Country’] = df[‘Country’].str.strip()
    
    # Deal with particular nation title inconsistencies
    # EIRE is a standard inconsistency on this dataset for Eire
    df[‘Country’].exchange(‘EIRE’, ‘Eire’, inplace=True)
    print(“Textual content columns cleaned and standardized.”)

    Function Engineering & First Perception

    The information is now clear. Now we are able to lastly begin asking the enjoyable questions. The one most vital metric for any gross sales dataset is income. Since our authentic knowledge solely has Amount and UnitPrice, we have to engineer the Income column ourselves.

    Function Engineering: Creating the Income Column

    Income for a transaction is solely the variety of objects bought multiplied by the worth of every merchandise.

    df[‘Revenue’] = df[‘Quantity’] * df[‘UnitPrice’]

    First Perception: Which International locations Drive Income?

    Let’s use our clear knowledge to reply NovaShop’s query: “Which nations are driving our gross sales?”
    We’ll use a robust three-step mixture:

    • Group by the Nation column.
    • Combination (sum) the Income inside every group.
    • Kind the outcomes from highest income to lowest.
    # Group by Nation, sum the Income, and kind for the highest 10
    top_countries = df.groupby(‘Nation’)[‘Revenue’].sum().sort_values(ascending=False).head(10)
    print(“n — — High 10 International locations by Income (GBP) — -”)
    print(top_countries)

    Output:

    --- High 10 International locations by Income (GBP) ---
    Nation
    United Kingdom 941268.661
    Netherlands 27435.830
    EIRE 26066.000
    France 23645.330
    Germany 22389.510
    Australia 12429.990
    Spain 5600.900
    Switzerland 5483.890
    Hong Kong 3597.850
    Belgium 3593.510
    Identify: Income, dtype: float64

    The output for this usually exhibits an enormous dominance by the United Kingdom. That is anticipated because the firm is UK-based. The subsequent a number of nations give us a fast roadmap of the place NovaShop’s worldwide focus needs to be.

    Conclusion

    We did it. We took a uncooked, half-a-million-row dataset, sliced it for manageability, battled lacking values, fastened knowledge sorts, filtered out errors, and calculated our first key enterprise metric. The laborious, foundational work is finished.

    Right here’s a fast recap of what we conquered in Half 1:

    • Knowledge Vetting: We used .head(), .data(), and .describe() to establish essential points just like the adverse costs/portions, lacking Buyer IDs, and the wrong datetime format.
    • Knowledge Cleansing: We systematically eliminated nulls and duplicates, transformed InvoiceDate to a correct datetime object, and filtered out non-sale transactions (returns and free objects).
    • Function Engineering: We created the vital Income column.
    • First Perception: We generated the High 10 International locations by Income for NovaShop, giving them their first actionable knowledge level from the clear set.

    The clear dataset is now primed for extra subtle evaluation. In Half 2, we’re going to dive deep into Product Evaluation and Time-Collection Decomposition. We’ll determine which objects are the true money-makers and analyse how gross sales quantity modifications hour-by-hour and month-by-month.

    I’m excited to maintain going! When you loved this text. Be at liberty to let me know on any of those channels. Your suggestions will imply so much to me.

    Medium

    LinkedIn

    Twitter

    YouTube



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous ArticleSpectral Community Detection in Clinical Knowledge Graphs
    Next Article Pope Warns of AI’s Impact on Society and Human Dignity
    ProfitlyAI
    • Website

    Related Posts

    Artificial Intelligence

    Building Systems That Survive Real Life

    February 2, 2026
    Artificial Intelligence

    Silicon Darwinism: Why Scarcity Is the Source of True Intelligence

    February 2, 2026
    Artificial Intelligence

    How generative AI can help scientists synthesize complex materials | MIT News

    February 2, 2026
    Add A Comment
    Leave A Reply Cancel Reply

    Top Posts

    Gift from Sebastian Man ’79, SM ’80 supports MIT Stephen A. Schwarzman College of Computing building | MIT News

    April 5, 2025

    What Is a Knowledge Graph — and Why It Matters

    January 14, 2026

    Deploy agentic AI faster with DataRobot and NVIDIA

    April 5, 2025

    Choosing the Best Model Size and Dataset Size under a Fixed Budget for LLMs

    October 24, 2025

    Where are we with Shor’s algorithm?

    July 7, 2025
    Categories
    • AI Technology
    • AI Tools & Technologies
    • Artificial Intelligence
    • Latest AI Innovations
    • Latest News
    Most Popular

    OpenAI has trained its LLM to confess to bad behavior

    December 3, 2025

    Federated Learning and Custom Aggregation Schemes

    October 22, 2025

    Estimating from No Data: Deriving a Continuous Score from Categories

    August 12, 2025
    Our Picks

    How Expert-Vetted Reasoning Datasets Improve Reinforcement Learning Model Performance

    February 3, 2026

    What we’ve been getting wrong about AI’s truth crisis

    February 2, 2026

    Building Systems That Survive Real Life

    February 2, 2026
    Categories
    • AI Technology
    • AI Tools & Technologies
    • Artificial Intelligence
    • Latest AI Innovations
    • Latest News
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About us
    • Contact us
    Copyright © 2025 ProfitlyAI All Rights Reserved.

    Type above and press Enter to search. Press Esc to cancel.