Moodiest fanbases in the NBA

I won’t get too technical in this post, BUT all the code can be found in this repo. If there are any questions feel free to hit me up on twitter!

Three weeks ago, before the beer virus made us all stay indoors, I was sitting at a party in Berlin with one of my best childhood friends. For some reason (which was me boasting) we started talking about my current side project: Deep Learning.

He does not have any background in Statistics or Machine Learning and it’s always a cool exercise to explain projects without any technical vocabulary. At least that’s what recruiters like to hear.

So here I was, rambling on about how “my model” would see a bunch of tweets and make connections that would then help it understand the sentiment of a specific tweet that you give to it. Claiming we could do something that would let you know when certain NBA fanbases were very pessimistic about their team, while others were flying high.

At the time though, I was nowhere near to have that model. So now, with some extra free time on my hands I thought I’d try to build a Classifier for tweets surrounding NBA teams using the fastai library.

Walk it, walk it like I talk it – Migos

The Goal

Without ever having to read through any tweets of any team at any time, knowing how a team’s fanbase felt at any given point in the last year.

To achieve this, we need a few key ingredients. 

First: we need to have tweets and we need to have a model that classifies them.
If I say have this example tweet:

Our model should probably say that it’s quite negative.

Now in the past, NLP models used to be quite bad at understanding context. Models would be built in a way that they might understand that certain words are bad, but wouldn’t understand that context might make them mean the opposite.

Transfer learning has been a real gamechanger in NLP, but also in other areas like Image Recognition.
The idea of transfer learning is that we take another model, which was trained for a different (but related) purpose and use that as basis for our classifier.

Building a language model

We can imagine that for a sentiment classifier it’s vitally important to understand context as well as the general concept of a language. Therefore a great basis is to first create a language model that we can then build from.
A language model is mostly used for predicting the next word in any given sentence and in order to do that well, it will have to learn quite a bit about the language itself as well as the context of the given sentence.
It’s what you see on your phone, when you start a sentence and it automatically suggests how you might want to continue. 

Behind those sometimes more, sometimes less helpful suggestions for the next word is a language model.
Since there is a variety of applications for them, there are already pretty good pre-trained langauge models out there.
One of them can be used through the fast.ai library and is based on the WikiText103 dataset, which is a huge subset of articles from wikipedia. Since the language for explaining quantum physics on wikipedia and arguing about the NBA draft on twitter differs a bit, we should not use that language model out of the box. Instead we retrain the later layers of that network with about 250,000 tweets.

The result is a model that should be able to predict the next word in a sentence, based on what it’s seen before. To check out how well our model is doing, we can simply give it some words and see how it strings together a sentence, similarly to the gif above. In this case however I use “is the greatest” to check out whether the language model picked up on the often upcoming discussion about who the greatest basketball player of all time is.

Nice! That seems to work. It’s especially good to see how the model automatically uses the correct pronoun. And it’s also either generating tweets relating to the NBA or general twitter babble.

Having validated our language model, we can use its encoder, which is the part that entails the understanding of the language (as a mathematical representation) and throw away the part that makes predictions on the next words – the decoder – since we won’t have any need for it in our classifier.

Training the classifier

Bear with me, we’re almost there.
In order to now train a classifier, we pass the encoder as well as the vocabulary from our language model into a new learner and feed it a dataset of automatically labelled tweets. These are a subset coming from the sentiment140 dataset, which holds over 1.6 million tweets, which were classified depending on the emojis they contain.
This might cause some problems where emojis are used ironically, but for a first experiment it should be good enough.

After a bit of training (you can follow the training process more closely on github) we’re getting an accuracy of about 83%, which seems good enough to get a general overview for team’s sentiment.
Whenever we give our model a tweet now, it will come back to us with a probability of the tweet being either negative or positive.
The tweet This is amazing for example would return us the numbers 0.009, 0.9910.
Our model is therefore quite certain with a probability of 99.10% that this tweet is positive, while it only places a probability of 0.9% on the tweet being negative.
Pretty encouraging.

Happy Birthday and RIP: Words that make our model certain

It’s now time to finally look at the real thing.
Since our NBA tweets are unlabelled, we don’t really know whether the accuracy from our training set holds up for the tweets we are predicting on.
There are two options now, either I label a few thousand tweets myself to get a stable estimate on the performance or I obviously do the lazy thing:
Look at the tail ends of our predictions and check out whether they seem sensible.

The following top 10 negative tweets according to our model, all had a probability of more than 99% to be negative:

1 RIP Houston Rockets  …

2 AWFUL

3 No knees and no jumper

4 Sucks to have such awful opinions

5 Dennis Smith Jr. @Dennis1SmithJr set to play tonight vs the Atlanta Hawks at 8 pm EST his first preseason game of the season after being out with a strained lower back

6 Rip lakers

7 RIP Washington Wizards

8 RIP MUSCALAUER

9 RIP Washington wizards

10 Rip Washington wizards 

They do seem pretty negative. Especially the abbreviation “RIP” seems to be a key indicator for the classifier to give a negative label. This might also be influenced by having our training set automatically labelled based on the emojis in a tweet. I would guess that many RIP tweets come along with sad emojis.
The tweet regarding Dennis Smith Jr. on the other hand doesn’t necessarily seem super negative and I personally wouldn’t be able to tell which words (or word combinations) made the classifier think it was negative.

There is a similar pattern for the top 10 most positive tweets.

1 Congrats to your nba champion Utah Jazz! Royce O’Neale mvp

2 Fireworks at Disney Orlando’s Magic Kingdom

3 Rodney Hood, a future Sacramento King

4 Happy Birthday thanks for playing for the love of the game SEERED enjoy Chicago

5 incredible Washington Wizards GIF; thank you

6 "You're welcome" - Phoenix Suns

7 Happy birthday to the best Celtics reporter at the Boston Sports Journal, for sure

8 Happy birthday Houston rockets

9 Relive The Toronto Raptors' Historic NBA Championship Run   via @YouTube

10 Happy Birthday my fellow 

For the positive tweets “Happy Birthday” seems to be a very strong indicator, which makes a lot of sense, considering that I can’t really imagine anybody wishing somebody a happy birthday and then telling them to f*** off in the same tweet.
Two of the tweets are probably mislabelled, but it’s easy to see why.
Saying that somebody is going to be a future king is probably mostly positive, except for when claiming that Rodney Hood is going to be a future Sacramento King, which is just speculating about his next team.
So this is probably rather a neutral tweet, but we get where the model is coming from.
“Fireworks at Disney Orlando’s Magic Kingdom” is also just a neutral description of what’s happening. However I would be quite sure that not many sentences that include the words Fireworks, Magic and Kingdom are negative.

When can the model not decide?

The model makes the right decisions on both ends of the spectrum, but what are the kinds of tweets where it has problems to decide?
We would expect them to rather be neutral tweets or possibly press releases/article headlines etc. I drew randomly 10 tweets for which the model was quite unsure (probability for either class hovering around 50%).

1 Marcus Smart: Team USA camp a “huge” chance for Boston Celtics teammates to build chemistry  …

2 Buddy Hield 'trusts in God' as Sacramento Kings seek return to NBA playoffs - Sports Spectrum  …

3 Even more forgery, counterfeit, copy, sham, fraud, hoax, imitation, mock-up, dummy, reproduction, lookalike, likeness. #fake

4 Yeah I realized lmfao

5 They were based in Charlotte, North Carolina and called the Charlotte Hornets. Then the team moved to New Orleans and became... the New Orleans Pelicans. . Happens all the time in the us. LA Dodgers were originally from Brooklyn, NY.

6 Revenge Game: Troy Daniels plays Houston tonight.  He played 22 games in his career for the Rockets.

7 Sixers have a great nucleus. — watching Indiana Pacers vs Philadelphia 76ers

8 RT @cavs: 
***
RT @MrCavalier34: Cavs giving Mavs some of their own medicine from outside , Clarkson HOT
#Cleveland #CAVS #AllForOne
#LeBronJames #StriveForGreatness
#NBA #NBAAllStar #TeamLeBron …

9 Wow Portland beat my Nuggets just to get swept???? Really??? Denver would have done better

10 So you’re going to just straight up ignore the @PelicansNBA ? Right.

Most of these sound fine and can be considered neutral. Tweet number 3 though is quite concerning since it literally only contains negative words. Seriously, it would be hard to to find any tweet with a higher ratio of negative to overall words.
In general it’s noticeable that these tweets are on average a fair bit longer than the positive or negative tweets from above. This might stem from the fact that our training set is from 2009, where twitter only allowed for 140 chracters per tweet, while we predict on tweets from the year 2019, where 280 characters are allowed.

Sentiment across fanbases

Let’s stop looking at individual tweets. The whole idea of this model was to be able to judge sentiment across an entire fanbase without having to look at a single tweet.

Happiest Fans 2019

Team% of positive tweets
Toronto Raptors 69.38%
Orlando Magic64.66%
Utah Jazz63.08%
Charlotte Hornets62.36%
Memphis Grizzlies61.93%

Most Negative Fans 2019

Team% of positive tweets
Los Angeles Clippers47.99%
Minnesota Timberwolves47.57%
New York Knicks47.24%
Brooklyn Nets46.94%
Houston Rockets43.75%

Having Toronto top off the list of most positive tweets is definitely a great sign, considering they won their (and Canada’s) first NBA title last year!
The Houston Rockets on the other hand severely underperformed in the 2018/2019 season and then blew up their team in the summer, so it seems sensible they might end up at the end of this list.

How does the sentiment change over time?

Of course fanbases often swing from negative to positive depending on how well their team is performing at a given moment. The nice thing here is that we can easily spot the franchises for which the fans seemed to be moodiest.
In order to do this we can simply check the average difference from month to month per team in sentiment level. In order to make these changes more meaningful I decided to get a bit more data and include the winning percentage of the teams on a month to month basis as well.
That data is readily available thanks to the Basketball Reference Web Scraper.

For most of these teams the winning percentage as well as the percentage of positive tweets seems to strongly correlate or follow each other.
But it’s definitely not perfect. It is interesting however to check out the teams that made it into the playoffs which were played from mid April till June (although none of the teams were long made it until June). The Celtics, Bucks, Rockets and Trail Blazers all made it into the playoffs, with the Rockets and Celtics even hoping to reach the finals. Of course their winning percentage will take a hit, since it’s not as easy to win in the Playoffs as it is during the regular season. For all teams we can clearly that the sentiment drops when they get eliminated.
Another interesting case are the Washington Wizards who lost their key players to injury during the winter, letting their winning percentage drop a lot and – in very similar movements – the positivity of their fanbase.

Overall the results seem quite promising. It is important to note though that the correlation between winning percentage and the sentiment is quite low at only 0.2.
So maybe we’re just looking at charts and then making up stories that fit (good description of Data Science to be honest). On the other hand, I do think there is a bit more to it. It might be that training on an automatically labelled set of tweets is not perfect and also that sentiment is not mainly influenced by the winning percentage. But the former can be easily improved by retraining the model with another dataset, while the latter can be investigated deeper by adding other teamspecific factors.
The main thing is though, that at the next party I can boast that I have actually built a model that can classify tweets…definitely worth a week of work.

Leave a comment