Close Menu
    Trending
    • Topp 10 AI-filmer genom tiderna
    • OpenAIs nya webbläsare ChatGPT Atlas
    • Creating AI that matters | MIT News
    • Scaling Recommender Transformers to a Billion Parameters
    • Hidden Gems in NumPy: 7 Functions Every Data Scientist Should Know
    • Is RAG Dead? The Rise of Context Engineering and Semantic Layers for Agentic AI
    • ChatGPT Gets More Personal. Is Society Ready for It?
    • Why the Future Is Human + Machine
    ProfitlyAI
    • Home
    • Latest News
    • AI Technology
    • Latest AI Innovations
    • AI Tools & Technologies
    • Artificial Intelligence
    ProfitlyAI
    Home » From Configuration to Orchestration: Building an ETL Workflow with AWS Is No Longer a Struggle
    Artificial Intelligence

    From Configuration to Orchestration: Building an ETL Workflow with AWS Is No Longer a Struggle

    ProfitlyAIBy ProfitlyAIJune 19, 2025No Comments8 Mins Read
    Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
    Share
    Facebook Twitter LinkedIn Pinterest Email


    to steer the cloud trade with a whopping 32% share as a consequence of its early market entry, strong expertise and complete service choices. Nevertheless, many customers discover AWS difficult to navigate, and this discontentment lead extra firms and organisations to choose its opponents Microsoft Azure and Google Cloud Platform.

    Regardless of its steeper studying curve and fewer intuitive interface, AWS stays the highest cloud service as a consequence of its reliability, hybrid cloud and most service choices. Extra importantly, the collection of correct methods can considerably scale back configuration complexity, streamline workflows, and increase efficiency.

    On this article, I’ll introduce an environment friendly technique to arrange an entire ETL pipeline with orchestration on AWS, based mostly by myself expertise. It should additionally offer you a refreshed view on the manufacturing of information with AWS or make you are feeling much less struggling when conducting configuration if that is your first time to make use of AWS for sure duties.

    Technique for Designing an Environment friendly Information Pipeline

    AWS has essentially the most complete ecosystem with its huge companies. To construct a production-ready knowledge warehouse on AWS no less than requires the next companies:

    • IAM – Though this service isn’t included into any a part of the workflow, it’s the muse for accessing all different companies.
    • AWS S3 – Information Lake storage
    • AWS Glue – ETL processing
    • Amazon Redshift – Information Warehouse
    • CloudWatch – Monitoring and logging

    You additionally want entry to Airflow if you need to schedule extra advanced dependencies and conduct superior retries by way of error dealing with though Redshift can deal with some fundamental cron jobs.

    To make your work simpler, I extremely advocate to put in an IDE (Visible Studio Code or PyCharm and naturally you possibly can select your individual favorite IDE). An IDE dramatically improves your effectivity for advanced python code, native testing/debugging, model management integration and workforce collaboration. And within the subsequent session, I’ll present step-by-step configurations.

    Preliminary Setup

    Listed below are the steps of preliminary configurations:

    • Launch a digital surroundings in your IDE
    • Set up dependencies – mainly, we have to set up the libraries that shall be used afterward.
    pip set up apache-airflow==2.7.0 boto3 pandas pyspark sqlalchemy
    • Set up AWS CLI – this step means that you can write scripts to automate numerous AWS operations and makes the administration of AWS sources extra effectively.
    • AWS Configuration – be sure to enter these IAM person credentials when prompted:
      • AWS Entry Key ID: Out of your IAM person.
      • AWS Secret Entry Key: Out of your IAM person.
      • Default area: us-east-1 (or your most well-liked area)
      • Default output format: json.
    • Combine Airflow – listed here are the steps:
      • Initialize Airflow
      • Create DAG information in Airflow
      • Run the online server at http://localhost:8080 (login:admin/admin)
      • Open one other terminal tab and begin the scheduler
    export AIRFLOW_HOME=$(pwd)/airflow
    airflow db init
    airflow customers create 
      --username admin 
      --password admin 
      --firstname Admin 
      --lastname Person 
      --role Admin 
      --email [email protected]
    #Initialize Airflow
    airflow webserver --port 8080 ##run the webserver
    airflow scheduler #begin the scheduler

    Growth Workflow: COVID-19 Information Case Research

    I’m utilizing JHU’s public COVID-19 dataset (CC BY 4.0 licensed) for demonstration goal. You may discuss with knowledge here,

    The chart beneath exhibits the workflow from knowledge ingestion to knowledge loading to Redshift tables within the improvement surroundings.

    Growth workflow created by writer

    Information Ingestion

    In step one of information ingestion to AWS S3, I processed knowledge by melting them to lengthy format and changing the date format. I saved the info within the parquet format to enhance the storage effectivity, improve question efficiency and scale back storage prices. The code for this step is as beneath:

    import pandas as pd
    from datetime import datetime
    import os
    import boto3
    import sys
    
    def process_covid_data():
        strive:
            # Load uncooked knowledge
            url = "https://github.com/CSSEGISandData/COVID-19/uncooked/grasp/archived_data/archived_time_series/time_series_19-covid-Confirmed_archived_0325.csv"
            df = pd.read_csv(url)
            
            # --- Information Processing ---
            # 1. Soften to lengthy format
            df = df.soften(
                id_vars=['Province/State', 'Country/Region', 'Lat', 'Long'], 
                var_name='date_str',
                value_name='confirmed_cases'
            )
            
            # 2. Convert dates (JHU format: MM/DD/YY)
            df['date'] = pd.to_datetime(
                df['date_str'], 
                format='%m/%d/%y',
                errors='coerce'
            ).dropna()
            
            # 3. Save as partitioned Parquet
            output_dir = "covid_processed"
            df.to_parquet(
                output_dir,
                engine='pyarrow',
                compression='snappy',
                partition_cols=['date']
            )
            
            # 4. Add to S3
            s3 = boto3.consumer('s3')
            total_files = 0
            
            for root, _, information in os.stroll(output_dir):
                for file in information:
                    local_path = os.path.be a part of(root, file)
                    s3_path = os.path.be a part of(
                        'uncooked/covid/',
                        os.path.relpath(local_path, output_dir)
                    )
                    s3.upload_file(
                        Filename=local_path,
                        Bucket='my-dev-bucket',
                        Key=s3_path
                    )
                total_files += len(information)
            
            print(f"Efficiently processed and uploaded {total_files} Parquet information")
            print(f"Information covers from {df['date'].min()} to {df['date'].max()}")
            return True
    
        besides Exception as e:
            print(f"Error: {str(e)}", file=sys.stderr)
            return False
    
    if __name__ == "__main__":
        process_covid_data()

    After operating the python code, you must be capable of see the parquet information within the S3 buckets, beneath the folder of ‘uncooked/covid/’.

    Screenshot by writer

    ETL Pipeline Growth

    AWS Glue is especially used for ETL Pipeline Growth. Though it can be used for knowledge ingestion even when the info hasn’t loaded to S3, its energy lies in processing knowledge as soon as it’s in S3 for knowledge warehousing functions. Right here’s PySpark scripts for knowledge remodel:

    # transform_covid.py
    from awsglue.context import GlueContext
    from pyspark.sql.capabilities import *
    
    glueContext = GlueContext(SparkContext.getOrCreate())
    df = glueContext.create_dynamic_frame.from_options(
        "s3",
        {"paths": ["s3://my-dev-bucket/raw/covid/"]},
        format="parquet"
    ).toDF()
    
    # Add transformations right here
    df_transformed = df.withColumn("load_date", current_date())
    
    # Write to processed zone
    df_transformed.write.parquet(
        "s3://my-dev-bucket/processed/covid/",
        mode="overwrite"
    )
    Screenshot by writer

    The following step is to load knowledge to Redshift. In Redshift Console, click on on “Question Editor Q2” on the left aspect and you’ll edit your SQL code and end the Redshift COPY.

    # Create a desk covid_data in dev schema
    CREATE TABLE dev.covid_data (
        "Province/State" VARCHAR(100),  
        "Nation/Area" VARCHAR(100),
        "Lat" FLOAT8,
        "Lengthy" FLOAT8,
        date_str VARCHAR(100),
        confirmed_cases FLOAT8  
    )
    DISTKEY("Nation/Area")   
    SORTKEY(date_str);
    # COPY knowledge to redshift
    COPY dev.covid_data (
        "Province/State",
        "Nation/Area",
        "Lat",
        "Lengthy",
        date_str,
        confirmed_cases
    )
    FROM 's3://my-dev-bucket/processed/covid/'
    IAM_ROLE 'arn:aws:iam::your-account-id:position/RedshiftLoadRole'
    REGION 'your-region'
    FORMAT PARQUET;

    You then’ll see the info efficiently uploaded to the info warehouse.

    Screenshot by writer

    Pipeline Automation

    The best technique to automate your knowledge pipeline is to schedule jobs beneath Redshift question editor v2 by making a Saved Process (I’ve a extra detailed introduction about SQL Saved Process, you possibly can discuss with this article).

    CREATE OR REPLACE PROCEDURE dev.run_covid_etl()
    AS $$
    BEGIN
      TRUNCATE TABLE dev.covid_data;
      COPY dev.covid_data 
      FROM 's3://simba-dev-bucket/uncooked/covid'
      IAM_ROLE 'arn:aws:iam::your-account-id:position/RedshiftLoadRole'
      REGION 'your-region'
      FORMAT PARQUET;
    END;
    $$ LANGUAGE plpgsql;
    Screenshot by writer

    Alternatively, you possibly can run Airflow for scheduled jobs.

    from datetime import datetime
    from airflow import DAG
    from airflow.suppliers.amazon.aws.operators.redshift_sql import RedshiftSQLOperator
    
    default_args = {
        'proprietor': 'data_team',
        'depends_on_past': False,
        'start_date': datetime(2023, 1, 1),
        'retries': 2
    }
    
    with DAG(
        'redshift_etl_dev',
        default_args=default_args,
        schedule_interval='@day by day',
        catchup=False
    ) as dag:
    
        run_etl = RedshiftSQLOperator(
            task_id='run_covid_etl',
            redshift_conn_id='redshift_dev',
            sql='CALL dev.run_covid_etl()',
        )

    Manufacturing Workflow

    Airflow DAG is highly effective to orchestrates your complete ETL pipeline if there are lots of dependencies and it’s additionally a superb observe in manufacturing surroundings.

    After creating and testing your ETL pipeline, you possibly can automate your duties in manufacturing surroundings utilizing Airflow.

    Manufacturing workflow created by writer

    Listed below are the examine listing of key preparation steps to assist the profitable deployment in Airflow:

    • Create S3 bucket my-prod-bucket 
    • Create Glue job prod_covid_transformation in AWS Console
    • Create Redshift Saved Process prod.load_covid_data()
    • Configure Airflow
    • Configure SMTP for emails in airflow.cfg

    Then the deployment of the info pipeline in Airflow is:

    from datetime import datetime, timedelta
    from airflow import DAG
    from airflow.operators.python import PythonOperator
    from airflow.suppliers.amazon.aws.operators.glue import GlueJobOperator
    from airflow.suppliers.amazon.aws.operators.redshift_sql import RedshiftSQLOperator
    from airflow.operators.e-mail import EmailOperator
    
    # 1. DAG CONFIGURATION
    default_args = {
        'proprietor': 'data_team',
        'retries': 3,
        'retry_delay': timedelta(minutes=5),
        'start_date': datetime(2023, 1, 1)
    }
    
    # 2. DATA INGESTION FUNCTION
    def load_covid_data():
        import pandas as pd
        import boto3
        
        url = "https://github.com/CSSEGISandData/COVID-19/uncooked/grasp/archived_data/archived_time_series/time_series_19-covid-Confirmed_archived_0325.csv"
        df = pd.read_csv(url)
    
        df = df.soften(
            id_vars=['Province/State', 'Country/Region', 'Lat', 'Long'], 
            var_name='date_str',
            value_name='confirmed_cases'
        )
        df['date'] = pd.to_datetime(df['date_str'], format='%m/%d/%y')
        
        df.to_parquet(
            's3://my-prod-bucket/uncooked/covid/',
            engine='pyarrow',
            partition_cols=['date']
        )
    
    # 3. DAG DEFINITION
    with DAG(
        'covid_etl',
        default_args=default_args,
        schedule_interval='@day by day',
        catchup=False
    ) as dag:
    
        # Job 1: Ingest Information
        ingest = PythonOperator(
            task_id='ingest_data',
            python_callable=load_covid_data
        )
    
        # Job 2: Remodel with Glue
        remodel = GlueJobOperator(
            task_id='transform_data',
            job_name='prod_covid_transformation',
            script_args={
                '--input_path': 's3://my-prod-bucket/uncooked/covid/',
                '--output_path': 's3://my-prod-bucket/processed/covid/'
            }
        )
    
        # Job 3: Load to Redshift
        load = RedshiftSQLOperator(
            task_id='load_data',
            sql="CALL prod.load_covid_data()"
        )
    
        # Job 4: Notifications
        notify = EmailOperator(
            task_id='send_email',
            to='you-email-address',
            topic='ETL Standing: {{ ds }}',
            html_content='ETL job accomplished: <a href="{{ ti.log_url }}">View Logs</a>'
        )

    My Ultimate Ideas

    Though some customers, particularly those that are new to the cloud and searching for easy options are usually daunted by AWS’s excessive barrier to entry and be overwhelmed by the large selections of companies, it’s definitely worth the time and efforts and listed here are the explanations:

    • The method of configuration, and the designing, constructing and testing of the info pipelines provides you the deep understanding of a typical knowledge engineering workflow. The talents will profit you even when you produce your tasks with different cloud companies, reminiscent of Azure, GCP and Alibaba Cloud.
    • The mature ecosystem that AWS has and an unlimited array of companies that it affords allow customers to customize their knowledge structure methods and luxuriate in extra flexibility and scalability of their tasks.

    Thanks for studying! Hope this text useful to construct your cloud-base knowledge pipeline!



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous ArticleWhat PyTorch Really Means by a Leaf Tensor and Its Grad
    Next Article Understanding Matrices | Part 2: Matrix-Matrix Multiplication
    ProfitlyAI
    • Website

    Related Posts

    Artificial Intelligence

    Creating AI that matters | MIT News

    October 21, 2025
    Artificial Intelligence

    Scaling Recommender Transformers to a Billion Parameters

    October 21, 2025
    Artificial Intelligence

    Hidden Gems in NumPy: 7 Functions Every Data Scientist Should Know

    October 21, 2025
    Add A Comment
    Leave A Reply Cancel Reply

    Top Posts

    Advanced Topic Modeling with LLMs

    July 21, 2025

    Dia en ny öppen källkods text till tal-modell

    April 24, 2025

    Building a Command-Line Quiz Application in R

    October 5, 2025

    Simplify AI Data Collection: 6 Essential Guidelines

    April 3, 2025

    A Visual Guide to Tuning Gradient Boosted Trees

    September 15, 2025
    Categories
    • AI Technology
    • AI Tools & Technologies
    • Artificial Intelligence
    • Latest AI Innovations
    • Latest News
    Most Popular

    Voice Recognition Technology: Overview, Applications, and Benefits

    April 4, 2025

    ShapeLLM-Omni designad för att förstå och generera 3D-innehåll

    June 8, 2025

    How I Automated My Machine Learning Workflow with Just 10 Lines of Python

    June 6, 2025
    Our Picks

    Topp 10 AI-filmer genom tiderna

    October 22, 2025

    OpenAIs nya webbläsare ChatGPT Atlas

    October 22, 2025

    Creating AI that matters | MIT News

    October 21, 2025
    Categories
    • AI Technology
    • AI Tools & Technologies
    • Artificial Intelligence
    • Latest AI Innovations
    • Latest News
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About us
    • Contact us
    Copyright © 2025 ProfitlyAI All Rights Reserved.

    Type above and press Enter to search. Press Esc to cancel.