View on GitHub

Movie-Recommendation-Engines

Writing recommendation systems for movies, and performing data analysis on movie datasets to gain valuable insights.

The Cold Start Problem

The cold start problem refers to the scenario faced by recommendation systems when there isn’t enough information [either due to lack of users, or due to the items being new / the platform being launched just now] for the system to make a decent recommendation action.

Thus, the system cannot draw inference for users or items for which it has not yet gathered sufficient information.

Types of problems

New Community

New Item

New User

Movie Recommendation

The use case of Movie Recommendations suffers from popularity bias . That is, we would have a minority of movies in a dataset which have been viewed / rated / favorited by a lot of people - on the other hand, most of the movies would not have been viewed / rated / favorited by a number of people significant enough for them to be recommended.

Popularity Bias

A handful of items receive a large number of interaction, while most of the items recieve only a fraction of them. FEW INERACTIONS => POOR RECOMMENDATIONS.

Zipf’s Law

Zipf’s law (/zɪf/, not /tsɪpf/ as in German) is an empirical law formulated using mathematical statistics that refers to the fact that for many types of data studied in the physical and social sciences, the rank-frequency distribution is an inverse relation. The Zipfian distribution is one of a family of related discrete power law probability distributions. It is related to the zeta distribution, but is not identical.

The same relationship occurs in many other rankings of human created systems, such as the ranks of mathematical expressions or ranks of notes in music, and even in uncontrolled environments, such as the population ranks of cities in various countries, corporation sizes, income rankings, ranks of number of people watching the same TV channel, and so on. The appearance of the distribution in rankings of cities by population was first noticed by Felix Auerbach in 1913. Empirically, a data set can be tested to see whether Zipf’s law applies by checking the goodness of fit of an empirical distribution to the hypothesized power law distribution with a Kolmogorov–Smirnov test.

Information Retrieval (Scraping)

Example results from bs4 processing:

function lockScroll() {
            var lockX = window.scrollX;
            var lockY = window.scrollY;

            function lockIt() {
                window.scrollTo(lockX, lockY);
                return false;
            }

            window.addEventListener("scroll", lockIt, false);
            return {
                stop: function () {
                    window.removeEventListener("scroll", lockIt, false);
                }
            }
        }
        window.addEventListener("load", function () {
            $('#ResultsScrollable').bind("scroll", function () {
                if ($(this).scrollTop() + $(this).innerHeight() >= $(this)[0].scrollHeight) {
                    var locker = lockScroll();
                    var loadBtn = document.getElementById('loadMoreJobs');
                    if (loadBtn) loadBtn.click();
                    locker.stop();
                } 
            });
        });

Problem Statement

This project is divided into two parts:

Approach - Workflow

The problem was divided into several steps:

  1. Data Collection: Data was collected from the MovieLens website and through a script that queried for data from various TMDB Endpoints.
  2. Data Wrangling: The datasets were uploaded to a dataframe and explored. Null values were filled in wherever appropriate and polluted values were discarded or wrangled.
  3. EDA: Extensive data visualisation and summary statistics were used to extract insights and pattern from the various datasets. The history, facts and trivia behind movies were narrated through data.
  4. Machine Learning: Gradient Boosting Classifer and Regressor were trained on our feature engineered dataset to predict movie success and revenue respectively. Their feature importances were noted to gain insights into what factors influence the revenues of a movie relative to budget.
  5. Recommendation Systems: Four different recommendation systems were built using various ideas and algorithms such as IMDB’s Weighted Rating, Content Based Filtering and Collaborative Filtering.

Final Results

A Gradient Boosting Regressor and Classifier were built to predict Film Revenue and Success respectively with a Score of 0.84 and 0.88 respectively.

In addition, four recommendation engines were built based on different ideas and algorithms:

Repository Structure

  1. movies_eda.ipynb: The Jupyter notebook that contains the EDA and narrates the Story of Film.
  2. movies_recommender.ipynb: The Jupyter notebook containing code for the recommendation engines
  3. scrapers: The folder containing all the scrapers used to gather data from TMDB.

Extension - Currently working on: