The Oscars are an interesting challenge when it comes to data predictions for 2 main reasons. The first is that you have a completely new slate of films each year. If you were predicting football results, a sensible starting point may be the team's place in the league table, or its current form. This approach won't work here.
And the second is that you only have a binary result. A film either wins or it doesn't. You have no idea how close the voting is. Whilst we may think that voting between La La Land and Moonlight was close - there is no data to prove this. For all we know, everyone voted for Moonlight in an absolute landslide. Maybe La La Land wasn't even second choice?
So, I started to think about this problem in a different way. What do we know about the people who vote on the awards? Whilst the make-up of the voters has been a well-discussed topic in recent years, the key is that the Academy is made up of branches relating to different disciplines (such as producing, writing, acting etc.) And these branches vote to determine the nominees in their discipline. A lot of these branches overlap with Guilds (such as the Screen Actors Guild) which hold their own awards ceremonies.
Another factor is "Oscar buzz". Insiders seem to roughly know the films that will be nominated ahead of time. So there must be some signals that indicate which they will be. I suppose the best method here would be to be a Hollywood insider, but as you may have gathered, I am not one! The best solution I have found are 'rival' awards shows such as the Golden Globes that often seem to be in a rush to be the first ceremony to crown the eventual winner.
So other awards are the main data source I used in my model. And the logic behind the model is as follows - awards that historically nominate the eventual Oscar winner will be given more credence in the prediction, than those awards that are more hit-and-miss. The model does this separately for each category - some awards are good predictors of the acting race, but horrible on the technical side, and vice-versa.
The technique I use for this model is called Linear Discriminant Analysis. This is a technique best used to allocate objects into different groups based on the objects attributes. In this case the groups are Winners and Nominees, and the attributes are how each film fared in other awards.
The technique then looks to create 2 discriminant functions for each group based on a weighting factor of every attribute. To get this weighting factor, we look at both how frequently the attribute predicts the correct grouping, but also how it correlates with other attributes. For example, award shows A and B may both predict the overall winner 50% of the time, but if they both are correct at the same time, we don't get any extra information from B. However, if B still predicts at 50%, but gets it correct when A is wrong - we suddenly have a very strong predictor.
After we have created these discriminant functions, we then apply them to each of this year's films. This gives each film 2 scores - a Winner score and a Nominee score. If the Winner score is higher than the Nominee score, then our model thinks we have a winner.
A drawback of Discriminant Analysis is that there is nothing to say it has to predict exactly one winner in each category. Most of the time this isn't a problem, but occasionally it does predict no winners, or indeed 2 winners. In these cases I take the film with the biggest/smallest difference between Winner and Nominee score. This ISN'T statistically sound, Discriminant Analysis is only meant to assign objects to groups, not say which object is the most significant in each group. However, as this occurrence is rare, a flawed methodology is more efficient than a new technique for rare outliers. But if you disagree, do let me know.
I've also looked to use more attributes than just other award shows. I've included various other data sources over the years such as Production Cost, Box Office Score, Total Nominations, Years since first/last nomination, run-time, critics review score etc. But over time I've dropped these as they proved either outdated (e.g. Box Office in times of Covid and streaming) or added little/no prediction power. Only total number of nominations remain for some categories.