Starting My Data Engineering Project
Date: October 06, 2025
Why I Started
After a year of working across data validation, automation, and analytics roles, I wanted to push beyond dashboards and reports into how data actually moves. I wanted to understand ingestion, transformation, and orchestration; the backbone of every reliable data system. This project is my first full attempt to build an end‑to‑end data engineering workflow using AWS and Databricks.
What I’m Learning
I’m learning how real data pipelines are structured and maintained. A few key lessons so far:
- Designing schemas for traceability and evolution
- Understanding partitioning and file formats like Parquet and Delta
- Managing permissions and access in AWS S3
- Writing PySpark transformations that scale
- Implementing Airflow for orchestration and job scheduling
- Keeping everything version‑controlled, documented, and reproducible
These skills are teaching me how to think like a systems engineer rather than a data consumer.
The Data
I’m working with the Toronto Licensed Cats and Dogs dataset, an open dataset published by the City of Toronto. It includes pet license information such as animal type, breed, and neighborhood. It’s perfect for testing data ingestion patterns because it’s large enough to require partitioning and updates, but still human‑readable for validation.
My pipeline currently ingests this dataset from a local source into AWS S3, transforms it using PySpark in Databricks, and writes Delta tables into bronze and silver layers for structured analysis.
What’s Next
Next steps include:
- Building the Gold layer to produce cleaned, enriched outputs ready for analytics tools
- Integrating Airflow orchestration to schedule data refreshes automatically
- Adding data quality checks and alerts for validation failures
- Experimenting with dbt for SQL‑based transformations and testing
- Creating a Power BI or Tableau dashboard connected to the Gold layer
How I Plan to Elevate My Abilities
I plan to formalize what I’m learning through certification and more complex builds. Specifically:
- Earning the Databricks Certified Data Engineer Associate credential
- Expanding into AWS Glue and Athena for serverless querying
- Experimenting with streaming data ingestion to learn real‑time concepts
- Contributing to open datasets or pipelines on GitHub for public visibility
Each step builds on the same foundation: reliability, scalability, and reproducibility. My goal isn’t just to finish this project, but to treat it as the baseline for everything that comes next.
Repository: GitHub – Licensed Pets Data Pipeline
Connect: LinkedIn – Sayeem Mahfuz