Snowflake Data Pipeline

Flowchart of the Snowflake data pipeline architecture
Power BI dashboard
Snowflake raw table
Power Query Star Schema
Power BI dashboard

Project Information

  • Category: Data Engineering & Business Intelligence
  • Platform: Snowflake, Power BI
  • Project Type: Academic / Professional Portfolio
  • Scope: End-to-end data pipeline: Kaggle → Snowflake → Power BI

End-to-End Snowflake Pipeline

Overview:

This project implements a complete data pipeline ingesting public datasets from Kaggle, storing them in Snowflake raw tables, transforming into a Star Schema via Snowflake views and Power Query, and generating dashboards in Power BI for analytical insights.

Objectives:

  • Design a scalable ETL pipeline with Python and Snowflake
  • Create a dimensional Star Schema for reporting
  • Build interactive dashboards in Power BI
  • Ensure data accuracy and consistency through validation checks
  • Provide analytics-ready datasets for business decisions

Architecture:

  • Extraction: Kaggle datasets downloaded via Python scripts
  • Loading: Raw tables in Snowflake (ANALYTICS_REPO schema)
  • Transformation: Snowflake views create dimensions and fact tables (Star Schema)
  • Visualization: Power BI dashboards consuming the Star Schema

Fact & Dimension Tables:

  • Fact_Sales
  • Dim_Date
  • Dim_Customer
  • Dim_Product
  • Dim_AgeGroup

Technology Stack:

  • Kaggle, Python, Snowflake, Power Query, Power BI
  • Pandas, json, kagglehub
  • Data validation and version control (Git/GitHub)

Key Learnings:

  • End-to-end data engineering workflow
  • Dimensional modelling for BI
  • Data validation and accuracy checks
  • Building professional dashboards for decision-making

Comparison of Transformation Approaches:

Approach Advantages Disadvantages
Snowflake Views - No storage duplication, lightweight for static datasets
- Easy to maintain and adjust logic without reloading data
- Ideal for datasets without incremental updates
- Slightly slower for very large queries
- Not suitable for frequently updated transactional data
- Limited optimization for complex joins or aggregations
Snowflake Tables - Persistent storage, optimized for large-scale reporting
- Supports incremental loading and indexes
- Best for frequently updated or transactional datasets
- Requires storage space and management
- Updates require ETL jobs or scripts
- Less flexible for rapid schema changes
Power Query (Desktop / ETL) - Very user-friendly, visual interface
- Quick prototyping and column type adjustments
- Good for small/medium static datasets
- Limited scalability for large datasets
- Local processing may be slower
- Transformations tied to Excel/Power BI environment

Note: In this project, a Snowflake view was chosen instead of a table because the dataset is static and does not require incremental updates. This allows for a lighter, easier-to-maintain solution without unnecessary storage duplication, while still enabling Star Schema transformation for Power BI dashboards.

Full source code available on https://github.com/fmulato/SNOWFLAKE_PROJECT