Skip to content

Creation of an automated pipeline that extracts and transforms movie data from Wikipedia, Kaggle, and MovieLens. The cleaned data is loaded into a PostgreSQL database.

Notifications You must be signed in to change notification settings

Mishkanian/Movies-ETL

Repository files navigation

Movie Data - ETL in a Single Function

Project Overview

The purpose of this project is to create an automated pipeline that takes in new movie data, performs the appropriate transformations, and loads the data into existing tables. The existing code is refactored into one function that performs the ETL process.

Resources and Software

  • Resources:
    • Wikipedia Movie Data
    • Kaggle Movie Data: The file size of this data exceeds the maximum size allowed on GitHub. However, you may download it directly from Kaggle.
      • In the zip file downloaded from Kaggle, only "movies_metadata.csv" and "ratings.csv" are used in this project.
  • Python 3.7
  • pgAdmin 4.50
  • PostgreSQL v13

Results

The new ETL function performs correctly and the data is successfuly added to a PostgreSQL database as seen in the images below. The final code can be viewed here.

movie

ratings

These outputs were created with the following queries:

SELECT COUNT(*) AS "Number of Movies row" FROM movies;
SELECT COUNT(*) AS "Number of Ratings row" FROM ratings;

Author: Michael Mishkanian

For all questions and inquiries, please contact me on LinkedIn.

About

Creation of an automated pipeline that extracts and transforms movie data from Wikipedia, Kaggle, and MovieLens. The cleaned data is loaded into a PostgreSQL database.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published