Close Menu
    Trending
    • Deploy a Streamlit App to AWS
    • How to Ensure Reliability in LLM Applications
    • Automating Deep Learning: A Gentle Introduction to AutoKeras and Keras Tuner
    • From Equal Weights to Smart Weights: OTPO’s Approach to Better LLM Alignment
    • The Future of AI Agent Communication with ACP
    • Vad världen har frågat ChatGPT under 2025
    • Google’s generative video model Veo 3 has a subtitles problem
    • MedGemma – Nya AI-modeller för hälso och sjukvård
    ProfitlyAI
    • Home
    • Latest News
    • AI Technology
    • Latest AI Innovations
    • AI Tools & Technologies
    • Artificial Intelligence
    ProfitlyAI
    Home » Abstract Classes: A Software Engineering Concept Data Scientists Must Know To Succeed
    Artificial Intelligence

    Abstract Classes: A Software Engineering Concept Data Scientists Must Know To Succeed

    ProfitlyAIBy ProfitlyAIJune 17, 2025No Comments14 Mins Read
    Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
    Share
    Facebook Twitter LinkedIn Pinterest Email


    you must learn this text

    If you’re planning to enter information science, be it a graduate or an expert searching for a profession change, or a supervisor accountable for establishing finest practices, this text is for you.

    Information science attracts a wide range of completely different backgrounds. From my skilled expertise, I’ve labored with colleagues who had been as soon as:

    • Nuclear physicists
    • Submit-docs researching gravitational waves
    • PhDs in computational biology
    • Linguists

    simply to call a couple of.

    It’s fantastic to have the ability to meet such a various set of backgrounds and I’ve seen such a wide range of minds result in the expansion of a inventive and efficient information science perform.

    Nonetheless, I’ve additionally seen one huge draw back to this selection:

    Everybody has had completely different ranges of publicity to key Software program Engineering ideas, leading to a patchwork of coding abilities.

    Because of this, I’ve seen work carried out by some information scientists that’s sensible, however is:

    • Unreadable — you don’t have any concept what they’re attempting to do.
    • Flaky — it breaks the second another person tries to run it.
    • Unmaintainable — code rapidly turns into out of date or breaks simply.
    • Un-extensible — code is single-use and its behaviour can’t be prolonged

    which in the end dampens the impression their work can have and creates all kinds of points down the road.

    So, in a sequence of articles, I plan to stipulate some core software program engineering ideas that I’ve tailor-made to be requirements for information scientists.

    They’re easy ideas, however the distinction between realizing them vs not realizing them clearly attracts the road between newbie {and professional}.

    Summary Artwork, Picture by Steve Johnson on Unsplash

    In the present day’s idea: Summary courses

    Summary courses are an extension of sophistication inheritance, and it may be a really useful gizmo for information scientists if used appropriately.

    When you want a refresher on class inheritance, see my article on it here.

    Like we did for class inheritance, I received’t trouble with a proper definition. Trying again to once I first began coding, I discovered it arduous to decipher the imprecise and summary (no pun supposed) definitions on the market within the Web.

    It’s a lot simpler as an instance it by going by way of a sensible instance.

    So, let’s go straight into an instance {that a} information scientist is prone to encounter to show how they’re used, and why they’re helpful.

    Instance: Getting ready information for ingestion right into a characteristic technology pipeline

    Picture by Scott Graham on Unsplash

    Let’s say we’re a consultancy that specialises in fraud detection for monetary establishments.

    We work with quite a lot of completely different purchasers, and we have now a set of options that carry a constant sign throughout completely different consumer initiatives as a result of they embed area information gathered from subject material consultants.

    So it is smart to construct these options for every undertaking, even when they’re dropped throughout characteristic choice or are changed with bespoke options constructed for that consumer.

    The problem

    We information scientists know that working throughout completely different initiatives/environments/purchasers signifies that the enter information for each isn’t the identical;

    • Purchasers could present completely different file sorts: CSV, Parquet, JSON, tar, to call a couple of.
    • Totally different environments could require completely different units of credentials.
    • Most undoubtedly every dataset has their very own quirks and so each requires completely different information cleansing steps.

    Subsequently, you might suppose that we would wish to construct a brand new characteristic technology pipeline for every consumer.

    How else would you deal with the intricacies of every dataset?

    No, there’s a higher method

    Provided that:

    • We all know we’re going to be constructing the similar set of helpful options for every consumer
    • We are able to construct one characteristic technology pipeline that may be reused for every consumer
    • Thus, the one new downside we have to remedy is cleansing the enter information.

    Thus, our downside may be formulated into the next levels:

    Picture by writer. Blue circles are datasets, yellow squares are pipelines.
    • Information Cleansing pipeline
      • Accountable for dealing with any distinctive cleansing and processing that’s required for a given consumer so as to format the dataset right into a standardised schema dictated by the characteristic technology pipeline.
    • The Characteristic Technology pipeline
      • Implements the characteristic engineering logic assuming the enter information will observe a hard and fast schema to output our helpful set of options.

    Given a hard and fast enter information schema, constructing the characteristic technology pipeline is trivial.

    Subsequently, we have now boiled down our downside to the next:

    How can we guarantee the standard of the information cleansing pipelines such that their outputs at all times adhere to the downstream necessities?

    The actual downside we’re fixing

    Our downside of ‘guaranteeing the output at all times adhere to downstream necessities’ isn’t just about getting code to run. That’s the simple half.

    The arduous half is designing code that’s strong to a myriad of exterior, non-technical components resembling:

    • Human error
      • Folks naturally overlook small particulars or prior assumptions. They could construct a knowledge cleansing pipeline while overlooking sure necessities.
    • Leavers
      • Over time, your workforce inevitably adjustments. Your colleagues could have information that they assumed to be apparent, and due to this fact they by no means bothered to doc it. As soon as they’ve left, that information is misplaced. Solely by way of trial and error, and hours of debugging will your workforce ever recuperate that information.
    • New joiners
      • In the meantime, new joiners don’t have any information about prior assumptions that had been as soon as assumed apparent, so their code often requires a variety of debugging and rewriting.

    That is the place summary courses actually shine.

    Enter information necessities

    We talked about that we will repair the schema for the characteristic technology pipeline enter information, so let’s outline this for our instance.

    Let’s say that our pipeline expects to learn in parquet recordsdata, containing the next columns:

    row_id:
        int, a novel ID for each transaction.
    timestamp:
        str, in ISO 8601 format. The timestamp a transaction was made.
    quantity: 
        int, the transaction quantity denominated in pennies (for our US readers, the equal shall be cents).
    route: 
        str, the route of the transaction, considered one of ['OUTBOUND', 'INBOUND']
    account_holder_id: 
        str, distinctive identifier for the entity that owns the account the transaction was made on.
    account_id: 
        str, distinctive identifier for the account the transaction was made on.

    Let’s additionally add in a requirement that the dataset have to be ordered by timestamp.

    The summary class

    Now, time to outline our summary class.

    An summary class is actually a blueprint from which we will inherit from to create little one courses, in any other case named ‘concrete‘ courses.

    Let’s spec out the completely different strategies we may have for our information cleansing blueprint.

    import os
    from abc import ABC, abstractmethod
    
    class BaseRawDataPipeline(ABC):
        def __init__(
            self,
            input_data_path: str | os.PathLike,
            output_data_path: str | os.PathLike
        ):
            self.input_data_path = input_data_path
            self.output_data_path = output_data_path
    
        @abstractmethod
        def rework(self, raw_data):
            """Remodel the uncooked information.
            
            Args:
                raw_data: The uncooked information to be remodeled.
            """
            ...
    
        @abstractmethod
        def load(self):
            """Load within the uncooked information."""
            ...
    
        def save(self, transformed_data):
            """save the remodeled information."""
            ...
    
        def validate(self, transformed_data):
            """validate the remodeled information."""
            ...
    
        def run(self):
            """Run the information cleansing pipeline."""
            ...

    You’ll be able to see that we have now imported the ABC class from the abc module, which permits us to create summary courses in Python.

    Picture by writer. Diagram of the summary class and concrete class relationships and strategies.

    Pre-defined behaviour

    Picture by writer. The strategies to be pre-defined are circled crimson.

    Let’s now add some pre-defined behaviour to our summary class.

    Keep in mind, this behaviour shall be made obtainable to all little one courses which inherit from this class so that is the place we bake in behaviour that you just wish to implement for all future initiatives.

    For our instance, the behaviour that wants fixing throughout all initiatives are all associated to how we output the processed dataset.

    1. The run methodology

    First, we outline the run methodology. That is the strategy that shall be referred to as to run the information cleansing pipeline.

        def run(self):
            """Run the information cleansing pipeline."""
            inputs = self.load()
            output = self.rework(*inputs)
            self.validate(output)
            self.save(output)

    The run methodology acts as a single level of entry for all future little one courses.

    This standardises how any information cleansing pipeline shall be run, which allows us to then construct new performance round any pipeline with out worrying in regards to the underlying implementation.

    You’ll be able to think about how incorporating such pipelines into some orchestrator or scheduler shall be simpler if all pipelines are executed by way of the identical run methodology, versus having to deal with many alternative names resembling run, execute, course of, match, rework and so on.

    2. The save methodology

    Subsequent, we repair how we output the remodeled information.

        def save(self, transformed_data:pl.LazyFrame):
            """save the remodeled information to parquet."""
            transformed_data.sink_parquet(
                self.output_file_path,
            )

    We’re assuming we’ll use `polars` for information manipulation, and the output is saved as `parquet` recordsdata as per our specification for the characteristic technology pipeline.

    3. The validate methodology

    Lastly, we populate the validate methodology which can examine that the dataset adheres to our anticipated output format earlier than saving it down.

        @property
        def output_schema(self):
            return dict(
                row_id=pl.Int64,
                timestamp=pl.Datetime,
                quantity=pl.Int64,
                route=pl.Categorical,
                account_holder_id=pl.Categorical,
                account_id=pl.Categorical,
            )
        
        def validate(self, transformed_data):
            """validate the remodeled information."""
            schema = transformed_data.collect_schema()
            assert (
                self.output_schema == schema, 
                f"Anticipated {self.output_schema} however bought {schema}"
            )

    We’ve created a property referred to as output_schema. This ensures that each one little one courses can have this obtainable, while stopping it from being by chance eliminated or overridden if it was outlined in, for instance, __init__.

    Challenge-specific behaviour

    Picture by writer. Challenge particular strategies that should be overridden are circled crimson.

    In our instance, the load and rework strategies are the place project-specific behaviour shall be held, so we go away them clean within the base class – the implementation is deferred to the long run information scientist accountable for penning this logic for the undertaking.

    Additionally, you will discover that we have now used the abstractmethod decorator on the rework and load strategies. This decorator enforces these strategies to be outlined by a baby class. If a consumer forgets to outline them, an error shall be raised to remind them to take action.

    Let’s now transfer on to some instance initiatives the place we will outline the rework and load strategies.

    Instance undertaking

    The consumer on this undertaking sends us their dataset as CSV recordsdata with the next construction:

    event_id: str
    unix_timestamp: int
    user_uuid: int
    wallet_uuid: int
    payment_value: float
    nation: str

    We be taught from them that:

    • Every transaction is exclusive recognized by the mix of event_id and unix_timestamp
    • The wallet_uuid is the equal identifier for the ‘account’
    • The user_uuid is the equal identifier for the ‘account holder’
    • The payment_value is the transaction quantity, denominated in Pound Sterling (or Greenback).
    • The CSV file is separated by | and has no header.

    The concrete class

    Now, we implement the load and rework features to deal with the distinctive complexities outlined above in a baby class of BaseRawDataPipeline.

    Keep in mind, these strategies are all that should be written by the information scientists engaged on this undertaking. All of the aforementioned strategies are pre-defined so that they needn’t fear about it, decreasing the quantity of labor your workforce must do.

    1. Loading the information

    The load perform is kind of easy:

    class Project1RawDataPipeline(BaseRawDataPipeline):
    
        def load(self):
            """Load within the uncooked information.
            
            Observe:
                As per the consumer's specification, the CSV file is separated 
                by `|` and has no header.
            """
            return pl.scan_csv(
                self.input_data_path,
                sep="|",
                has_header=False
            )

    We use polars’ scan_csv method to stream the information, with the suitable arguments to deal with the CSV file construction for our consumer.

    2. Reworking the information

    The rework methodology can be easy for this undertaking, since we don’t have any complicated joins or aggregations to carry out. So we will match all of it right into a single perform.

    class Project1RawDataPipeline(BaseRawDataPipeline):
    
        ...
    
        def rework(self, raw_data: pl.LazyFrame):
            """Remodel the uncooked information.
    
            Args:
                raw_data (pl.LazyFrame):
                    The uncooked information to be remodeled. Should include the next columns:
                        - 'event_id'
                        - 'unix_timestamp'
                        - 'user_uuid'
                        - 'wallet_uuid'
                        - 'payment_value'
    
            Returns:
                pl.DataFrame:
                    The remodeled information.
    
                    Operations:
                        1. row_id is constructed by concatenating event_id and unix_timestamp
                        2. account_id and account_holder_id are renamed from user_uuid and wallet_uuid
                        3. transaction_amount is transformed from payment_value. Supply information
                        denomination is in £/$, so we have to convert to p/cents.
            """
    
            # choose solely the columns we want
            DESIRED_COLUMNS = [
                "event_id",
                "unix_timestamp",
                "user_uuid",
                "wallet_uuid",
                "payment_value",
            ]
            df = raw_data.choose(DESIRED_COLUMNS)
    
            df = df.choose(
                # concatenate event_id and unix_timestamp
                # to get a novel identifier for every row.
                pl.concat_str(
                    [
                        pl.col("event_id"),
                        pl.col("unix_timestamp")
                    ],
                    separator="-"
                ).alias('row_id'),
    
                # convert unix timestamp to ISO format string
                pl.from_epoch("unix_timestamp", "s").dt.to_string("iso").alias("timestamp"),
    
                pl.col("user_uuid").alias("account_id"),
                pl.col("wallet_uuid").alias("account_holder_id"),
    
                # convert from £ to p
                # OR convert from $ to cents
                (pl.col("payment_value") * 100).alias("transaction_amount"),
            )
    
            return df

    Thus, by overloading these two strategies, we’ve applied all we want for our consumer undertaking.

    The output we all know conforms to the necessities of the downstream characteristic engineering pipeline, so we routinely have assurance that our outputs are appropriate.

    No debugging required. No problem. No fuss.

    Ultimate abstract: Why use summary courses in information science pipelines?

    Summary courses provide a strong strategy to convey consistency, robustness, and improved maintainability to information science initiatives. By utilizing Summary Lessons like in our instance, our information science workforce sees the next advantages:

    1. No want to fret about compatibility

    By defining a transparent blueprint with summary courses, the information scientist solely must give attention to implementing the load and rework strategies particular to their consumer’s information.

    So long as these strategies conform to the anticipated enter/output sorts, compatibility with the downstream characteristic technology pipeline is assured.

    This separation of considerations simplifies the event course of, reduces bugs, and accelerates growth for brand new initiatives.

    2. Simpler to doc

    The structured format naturally encourages in-line documentation by way of methodology docstrings.

    This proximity of design selections and implementation makes it simpler to speak assumptions, transformations, and nuances for every consumer’s dataset.

    Nicely-documented code is less complicated to learn, preserve, and hand over, decreasing the information loss attributable to workforce adjustments or turnover.

    3. Improved code readability and maintainability

    With summary courses imposing a constant interface, the ensuing codebase avoids the pitfalls of unreadable, flaky, or unmaintainable scripts.

    Every little one class adheres to a standardized methodology construction (load, rework, validate, save, run), making the pipelines extra predictable and simpler to debug.

    4. Robustness to human components

    Summary courses assist scale back dangers from human error, teammates leaving, or studying new joiners by embedding important behaviours within the base class. This ensures that important steps are by no means skipped, even when particular person contributors are unaware of all downstream necessities.

    5. Extensibility and reusability

    By isolating client-specific logic in concrete courses whereas sharing frequent behaviors within the summary base, it turns into simple to increase pipelines for brand new purchasers or initiatives. You’ll be able to add new information cleansing steps or assist new file codecs with out rewriting the complete pipeline.

    In abstract, summary courses ranges up your information science codebase from ad-hoc scripts to scalable, and maintainable production-grade code. Whether or not you’re a knowledge scientist, a workforce lead, or a supervisor, adopting these software program engineering ideas will considerably increase the impression and longevity of your work.

    Associated articles:

    When you loved this text, then take a look at a few of my different associated articles.

    • Inheritance: A software program engineering idea information scientists should know to succeed (here)
    • Encapsulation: A softwre engineering idea information scientists should know to succeed (here)
    • The Information Science Instrument You Want For Environment friendly ML-Ops (here)
    • DSLP: The info science undertaking administration framework that remodeled my workforce (here)
    • The best way to stand out in your information scientist interview (here)
    • An Interactive Visualisation For Your Graph Neural Community Explanations (here)
    • The New Greatest Python Package deal for Visualising Community Graphs (here)



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous ArticleCombining technology, education, and human connection to improve online learning | MIT News
    Next Article MiniMax M1: En ny utmanare till DeepSeek-R1 med hälften av beräkningskraften
    ProfitlyAI
    • Website

    Related Posts

    Artificial Intelligence

    Deploy a Streamlit App to AWS

    July 15, 2025
    Artificial Intelligence

    How to Ensure Reliability in LLM Applications

    July 15, 2025
    Artificial Intelligence

    Automating Deep Learning: A Gentle Introduction to AutoKeras and Keras Tuner

    July 15, 2025
    Add A Comment
    Leave A Reply Cancel Reply

    Top Posts

    Microsoft introducerar Copilot Vision till Windows och mobilen för AI-hjälp

    April 7, 2025

    Use PyTorch to Easily Access Your GPU

    May 21, 2025

    How to Approach Data Collection for Conversational AI

    May 5, 2025

    AI-agenter har potential att bli kraftfulla verktyg för cyberattacker

    April 9, 2025

    Xiaomi tar klivet in på AI-marknaden med sitt första språkmodell MiMo

    May 1, 2025
    Categories
    • AI Technology
    • AI Tools & Technologies
    • Artificial Intelligence
    • Latest AI Innovations
    • Latest News
    Most Popular

    Gemini-appen får nu automatisk åtkomst till meddelanden och samtal på Android

    July 9, 2025

    AI Influencers Are Winning Brand Deals, Is This the End of Human Influence?

    May 2, 2025

    This AI Startup Is Making an Anime Series and Giving Away $1 Million to Creators

    May 2, 2025
    Our Picks

    Deploy a Streamlit App to AWS

    July 15, 2025

    How to Ensure Reliability in LLM Applications

    July 15, 2025

    Automating Deep Learning: A Gentle Introduction to AutoKeras and Keras Tuner

    July 15, 2025
    Categories
    • AI Technology
    • AI Tools & Technologies
    • Artificial Intelligence
    • Latest AI Innovations
    • Latest News
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About us
    • Contact us
    Copyright © 2025 ProfitlyAI All Rights Reserved.

    Type above and press Enter to search. Press Esc to cancel.