What makes a successful film? Predicting a film’s revenue and user rating with machine learning

Ryan Anderson
10 min readAug 6, 2019

The Movie DB (TMDB) provides an API for film data, the data which can be downloaded from here. I strove to find out whether, knowing only things I could know before a film was released, what the rating and revenue of the film would be. What parameters best predict a good or top grossing film? Which cast or actors predict them?

Summary

I trained a model on a randomized 90% of the movies, and then tested it on the remaining 10%. For these test movies:

  • It was a simple challenge to get a very good prediction of film revenue. R² = 0.77. In layman's terms, knowing only facts about the film before release, the model can make a certifiably good prediction — enough for a cinema to decide ahead of time whether to show a film for an extended period of time, for instance.
  • It was much more difficult to predict film rating, but I could do a fair bit better than if I had just predicted an average rating for each movie, getting an R² of 0.53.
  • Do you even know Denny Caira? Film crew turned out to be the difference in a bad and a good film rating prediction, as well as the biggest difference between a well and poorly rated movie. Much more so than actors are.
  • I had some fun, too. Scroll down for a list of actors most associated with high rated and top grossing films.

The data

The data is well labeled, but I will get into too many details. To summarise:

  • TMDB is community built, with data often provided by the public, so not everything is present or very accurate. For instance, over 900 revenue values were missing.
  • I ignored some non-useful variables, such as film title and homepage. Obviously these can’t be used to predict the success of a movie.
  • Some variables were discarded for other reasons: production_country, because I felt that the information therein would be stored in production_company. Original_language, because I felt that that column would mostly be covered by spoken_languages, with a few exceptions. Popularity, because obviously that was measured after the film was released.

The variables used for input were:

  • budget
  • a list of film genres
  • release date — split up into year and day of the year
  • a list of spoken languages
  • runtime
  • a list of production companies
  • a list of cast members
  • a list of crew members
  • keywords — a list of user assigned keywords. Admittedly some of these would only be known after the movie was released, but these did not give away too much. A typical keyword would be ‘based on a novel.’

The variables used for model prediction were:

  • User vote (akin to IMDb rating, referred to as ‘rating’ throughout)
  • User-reported box office revenue (referred to as ‘revenue’ throughout)

Data preparation

Source file: data_prep.py

Problem: revenue data is not good enough

  • I removed zero revenue rows, resulting in 900 rows lost. Not great, but I can’t predict revenue without revenue.
  • I adjusted revenue for inflation. Initially, I thought this wouldn’t make such a difference, but it actually improved R² by 0.02.

Problem: How should I represent release date?

  • I decided to separate the variable into year and day of the year. Year, because revenue would definitely correlate with world population and societal patterns. Day of the year, as we know film revenue can correlate with a Christmas or summer release. This paid off, as day of the year turned out to be top-30 variable predicting revenue.

A much bigger problem: Many columns are JSON lists of ‘columns’

  • Some columns had lists stored inside: Each of genre, keywords, production company, spoken languages, cast and crew was actually a list of genre, keywords etc. These can’t be processed by any machine learning libraries I know of.
  • I had to create a new library to transform these lists into columns for my model, a process known as encoding categorical features
  • This created a new issue: There are far too many cast, crew and keywords for my poor computer to handle. I had to limit this per input column. This is not great for my model, as I would now just take the most common 500 actors as opposed to all actors, the top 500 cast, the top 100 keywords, and the top 100 film studios. This could be improved by hosting the solution on the cloud and throwing more power at the model training, or by being more patient.

Rows of JSON sets of actors, such as one row here…

[{"cast_id": 242, "character": "Jake Sully", "credit_id": "5602a8a7c3a3685532001c9a", "gender": 2, "id": 65731, "name": "Sam Worthington", "order": 0}, {"cast_id": 3, "character": "Neytiri", "credit_id": "52fe48009251416c750ac9cb", "gender": 1, "id": 8691, "name": "Zoe Saldana", "order": 1}, .... 

… would be transformed into a much more model-friendly set of columns with 1s and 0s.

Testing my model’s success

I chose to go with a vanilla R-squared (R²) indication of success. This is the default option for data scientists tackling regression problems, and is simply a measure of how much better my model is than just predicting the mean rating or revenue for each film.

  • negative if the model performs worse than just picking the mean. The higher the negative, the worse.
  • zero if it the model just picks the mean movie rating or revenue for each movie.
  • above zero if it performs better than the mean, with “1” being the perfect model.

Naturally, at some point you choose to stop when your solution is good enough. This definition differs per problem you’re solving, but generally

  • 0.6–0.9 indicates a good model
  • Anything above is too good to be true, pointing to some unfair input variables or overfitting. For instance, I accidentally included row count in my first model run, and since the data was sorted by revenue, my model almost perfectly predicted revenue.
  • 0–0.6 means you’re at least picking something up, but may not be good enough to use for important business decisions, for instance.

Predicting film rating

Source file: film_rating_with_cast_best_regressor.py

For my model selection, I ran the data through a Hyperparameter grid search using the XGBoost regressor library. I tried several other libraries in the grid search, including random forest regressors and a terribly performing neural network. The grid search greatly improved the performance of the vanilla XGBoost regressor, a library which comes highly recommended for speed and accuracy.

Naturally, accurately predicting a movie rating from purely movie metadata is a bit of a pipe dream. There are a lot of variables that one won’t see in the metadata, such as the quality of script, or whether the role brought out the best in Johnny Depp.

That said, the best result I got was R² of 0.53. By machine learning standards, this is OK, but nothing to write home about. 53% of the variance beyond the average rating was explained by the model. In other words, it was missing out on a lot, but still clearly predicting most movies that were better or worse than average.

Interestingly, this figure shows an intuitive quick win. The model just has to be ‘tilted’ in order to provide better predictions.

But instead, I decided to analyse what features were most involved in the model’s success:

What are the variables most associated with film rating?

An output from the XGBoost library provides the importance of features (input variables) it uses for prediction. This must be taken with a slight pinch of salt, given that the model itself has not made perfect predictions. However, the output provides a very clear story:

Here, we can see only around 200 of the input variables held any importance at all. The rest were essentially discarded by the algorithm. That’s ok! In future, with a better computer, I would simply pick more input variables (crew and cast) and crunch them through an analysis (LDA ?) in advance to pick out those with no correlation to film rating.

In text form, the variables most associated with film rating were as follows. Disclaimer: These can just as well be NEGATIVELY affecting the rating, as you might pick up from a couple of the names (horror, teenager). The algorithm just returns the ones with the largest effect on its predictor.

('Drama', 0.02771771)
('Film runtime', 0.017870583) - !
('Horror', 0.015099976)
('Animation', 0.010213515)
('John Lasseter', 0.0099559575) - of Pixar fame
('Family', 0.009091541)
('Comedy', 0.009024642)
('Harvey Weinstein', 0.009003568)
('Whoopi Goldberg', 0.008995796) - ?!
('Bill Murray', 0.008862046)
('Action', 0.008832617)
('Documentary', 0.008824027)
('Morgan Creek Productions', 0.008456202)
('Franchise Pictures', 0.008374982)
('Hans Zimmer', 0.008047262)
('DreamWorks Animation', 0.007945064)
('Hospital', 0.007892966)
('Janet Hirshenson', 0.007849025)
('Jason Friedberg', 0.007827318)
('en', 0.0077783377) - English movies
('Teenager', 0.0077319876)

Predicting film revenue — an easier task

Source file: film_revenue_with_cast_best_regressor.py

As one might expect, this would be an easier task, given obvious factors such as:

  • A film’s budget is probably a good indicator of whether it was targeted as a box-office hit
  • Many high revenue films are superhero movies

Lo and behold, the same method as earlier returned an r-squared of 0.77 for this prediction. In other words, one can build a very good prediction of a film’s revenue based purely on inputs known before the film goes public. This has real world consequences: for instance, a cinema could use this to predict how long they’d like to run a film for, ahead of time.

This looks a lot better than my rating predictions. Sure, there are outliers, but they are fairly evenly spaced above and below the prediction line.

What are the variables most associated with film revenue?

This list will be less of a surprise. Again, though, the same disclaimer applies. Variables can be negatively affecting revenue, and this model is not perfect. The list confirms the strong connection between budget and revenue. After all, why would one be making films if you did not get return on your investment?

Unsurprisingly, superhero movies and Pixar movies make a strong appearance here, with their keywords, studios, genres and crew dominating the list. Surprisingly, one production manager, Denny Caira, is a bigger predictor than budget. This man clearly has made a name for himself in the industry!

('Denny Caira', 0.037734445)
('Film Budget', 0.03122561)
('Adventure', 0.025690554)
('James Cameron', 0.024247296)
('Pixar Animation Studios', 0.022682142)
('David B. Nowell', 0.022539908)
('marvel comic', 0.022318095)
('Terry Claborn', 0.01921264)
('John Williams', 0.015954955)
('3d', 0.014985539)
('Animation', 0.013459805)
('John Ratzenberger', 0.013009616)
('Christopher Boyes', 0.012793044)
('Fantasy', 0.012175937)
('Gwendolyn Yates Whittle', 0.011877648)
('Lucasfilm', 0.011471849)
('Christopher Lee', 0.011401703)
('superhero', 0.010956859)
('Jim Cummings', 0.010577998)
('John Lasseter', 0.010427481)
('Drama', 0.010378849)

Bonus: which actors are most associated with…

Source file: film_actors_to_ratings.py, variables modified for revenue.

Note: for this, I ran the algorithm through more actors than before, and did not include other variables such as crew or budget. This is purely about the correlation between an actor and film success. That’s why the names do not exactly correspond with previous lists.

Which actors are most associated with film rating?

('Robert Duvall', 0.011352766) - of Godfather fame
('Morgan Freeman', 0.010981469)
('Scarlett Johansson', 0.010919917)
('Paul Giamatti', 0.0108840475)
('Helena Bonham Carter', 0.010548236)
('Jim Broadbent', 0.010294276)
('Harrison Ford', 0.010112257)
('Leonardo DiCaprio', 0.010015999)
('Mark Ruffalo', 0.009964598)
('Matthew Lillard', 0.00989507)
('Ian Holm', 0.009870403)
('Timothy Spall', 0.009850885)
('Philip Seymour Hoffman', 0.009718503)
('Rachel McAdams', 0.00953982)
('Emily Watson', 0.009512347)
('Alan Rickman', 0.009455477)
('Keira Knightley', 0.009296855)
('Eddie Marsan', 0.009277014)
('Stan Lee', 0.0092619965)
('Emma Thompson', 0.009148427)
('Edward Norton', 0.00904271)

Which actors are most associated with film revenue?

Obviously, Stan Lee does not make a movie producer rich. He simply cameod in all of the Marvel movies. This list shows correlation of actor with top-grossing (generally superhero) movies more than it shows who causes a movie to do well.

('Stan Lee', 0.04299625)
('Hugo Weaving', 0.030377517) - "You hear that Mr, Anderson? That is the sound of inevitability. That is the sound of profit"
('John Ratzenberger', 0.024940673) - In every Pixar film
('Frank Welker', 0.018594962)
('Alan Rickman', 0.01844035)
('Gary Oldman', 0.018401919)
('Geoffrey Rush', 0.018003061)
('Christopher Lee', 0.017147299)
('Robbie Coltrane', 0.015522939)
('Ian McKellen', 0.015420574)
('Timothy Spall', 0.0151223475)
('Zoe Saldana', 0.014832611)
('Stellan Skarsgård', 0.014798376)
('Maggie Smith', 0.014290353)
('Will Smith', 0.01418642)
('Tom Cruise', 0.013842676)
('Jeremy Renner', 0.013476725)
('Alan Tudyk', 0.013410641)
('Judi Dench', 0.01316438)
('Leonardo DiCaprio', 0.01244637)
('Liam Neeson', 0.012093888)

On including crew

Initially, when building the models, I did not include crew members in the analysis. This was a massive oversight. Including just the top 200 producers, writers and directors in the model improved the R² on revenue prediction by from 0.68 to 0.77.

More impressively, this improved rating prediction from and R-squared of 0.19 to 0.53 — an astounding improvement just by adding one type of variable. More than 30% of the variance in rating is explained by crew members.

Room for improvement

My method was not perfect. I discarded a fair amount of useful data, and took shortcuts. If I were to go for the best possible solution, especially to try improve my rating prediction, I would:

  • Including ALL data on actors, keywords, genres. This would need a lot more processing power, but would likely help my model pick out many of the outliers, especially those enigmatic actors and directors who aren’t just revenue churning machines
  • Model training: My XGBoost grid took almost a day to run. Being lazy, I ran it only on revenue. Retuning the model hyperparameters to a rating prediction would slightly improve the prediction thereof.
  • Including zero revenue films in my rating prediction, as I was too lazy to change the data preparation phase per prediction
  • Doing a PCA or LDA to eliminate any obviously uncorrelated variables
  • Run my XGBoost model through a more extreme parameter grid, picking even better parameters
  • Explore a neural network solution, given the sheer size of this problem.

Conclusion

Overall, I am pleasantly surprised with the performance of my model. I did not expect to get such good predictions, especially that of film rating. I learnt a bit about the film industry in the process, especially how much crew matter in making a good film.

Links, code and tools

My site: https://rian-van-den-ander.github.io/

Code: GitHub

Tools: Python, xgboost

--

--

Ryan Anderson

Professional Data Scientist | MBA. I offer consulting: My passion is to use DS/ML/AI/IA to help us be better humans. ryanandersonds.com / Bluesky: ryands