Tag Archives: data science

Twitter traffic jams in Washington, created by… John Oliver

Summary: In the first week of June, 20% of the Tweets about traffic, delays and congestion by people around the Washington Beltway were caused by John Oliver’s “Last Week Tonight” segment about Net Neutrality.

At work, we are always exploring a wide range of sensors to obtain useful insights that can used to make work and routine activities faster, more efficient and less risky. One of our Alpha Tests is examining use of “arrays” of high-targeted Twitter sensors to detect early indications of traffic congestion, accidents and other sources of delays. Specifically we are training our system how to use Twitter is a good traffic sensor (by good, in “data science speak” we are determining whether we can train a model for traffic detection that has a good balance of precision and recall, and hence a good F1 Score). To do this, I setup a test bed around the nation’s second-worst commuter corridor: the Washington DC Beltway (our my backyard).

Earlier this month our array of geographic Twitter sensors picked up an interesting surge in highly localized tweets about traffic-related congestion and delays. This was not an expected “bad commuter-day”-like surge. The number of topic- and geographically-related tweets seen on June 4th was more than double the expected number for a Tuesday in June around the Beltway; the number seen during lunchtime was almost 5x normal.

So what was the cause? Before answering, it is worth taking a step back.

The folks at Twitter have done a wonderful job at not only allowing you to fetch tweets based on topics, hash tags and geographies. They have also added some great machine learning-driven processing to screen out likely spammers and suspect accounts. Nevertheless Twitter data, like all sensor data, is messy. It is common to see Tweets with words spelled wrong, words used out of context, or simply nonsensical Tweets. In addition, people frequently repeat the same tweets throughout the day (a tactic to raise social media exposure) and do lots of other things that you must train the machine to account for.

That’s why we use a Lambda Architecture to process our streaming sensor data (I’ll write about why everyone–from marketers to DevOps staff should be excited about Lambda architectures in a future post). As such, not only do use Complex Event Processing (via Apache Storm) to detect patterns as they happen; we also keep a permanent copy of all raw data that we can explore to discover new patterns and improve our machine learning models).

That is exactly what we did as soon as we detected the surge. Here is what we found: the cause of the traffic- and congestion-related Twitter surge around the Beltway was… John Oliver:

  1. In the back half of June 1st’s episode of “Last Week Tonight” (HBO, 11pm ET), John Oliver had an interesting 13-minute segment on Net Neutrality. In this segment he encouraged people to visit the FCC website and comment on this topic.
  2. Seventeen hours later, the FCC tweeted that “[they were] experiencing technical difficulties with [their] comment system due to heavy traffic.” They tweeted a similar message 74-minutes later.
  3. This triggered a wave of re-tweets and comments about the outage in many places. Interestingly this wave was delayed in the Beltway. It surged the next day, just before lunchtime in DC, continuing throughout the afternoon. The two spikes were at lunchtime and just after work . Evidently, people are not re-tweeting while working. The timing of the spikes also reveals some interesting behavior patterns on Twitter use in DC.
  4. By 4am on Wednesday the surge was over. People around the Beltway were back to their normal tweeting about traffic, construction, delays, lights, outages and other items confounding their commute.

Of course, as soon as we saw the new pattern, we adjusted our model to account for this pattern. However, we thought it would be interesting to show in a simple graph how much “traffic on traffic, delays and congestion” Mr. Oliver induced in the geography around the Beltway for a 36-hour period. Over the first week of June, one out of every five Tweets about traffic, delays and congestion by people around the Beltway were not about commuter traffic, but instead around FCC website traffic caused by John Oliver:

Tweets from people geographically Tweeting around the Washington Beltway on traffic, congestion, delays and related frustration for first week of June. (Click to enlarge.)
Tweets from people geographically Tweeting around the Washington Beltway on traffic, congestion, delays and related frustration for first week of June. (Click to enlarge.)

Obviously, a simple count of tweets is a gross measure. To really use Twitter as a sensor, one needs to factor in many other variables: use text vs. hash-tags, tweets vs. mentions and re-tweets, the software client used to send the tweet (e.g., HootSuite is less likely to be a good source for accurate commuter traffic data); the number of followers the tweeter has (not a simple linear weighting) and much more. However, the simple count is simple first-order visualization. It also makes interesting “water-cooler conversation.”

Who Won Sochi? Wrangling Olympic Medal Count Data

I have always been a big fan of the Olympics (albeit I like the Summer Games better given my interest in Track & Field, Fencing and Soccer). However, something that has always bothered me is concept of the Medal Count. For years I have seen countries listed as “winning” because their medal count was higher—even though several countries “below” it often had many more Gold medals. Shouldn’t a Gold medal count for more than a Silver (and much more than a Bronze)? What would you rather have as an athlete: three Gold medals or four Bronzes?

Evidently, I am not the only one debating this point. Googling “value of olympic medals for rank count” yielded a range of debates on the first page alone (Bleacher Report, USA Today, The New Republic, the Washington Post and even Bicycling.com). Wikipedia even has an entry on this debate.

This year, however, I noticed that throughout the games that Google’s medal count stats page (Google “olympic medal count”) was not ranking countries by absolute medal count. For quite a while Norway and Germany were on top—even when they did not have the highest total number of medals—because they had more Gold medals than anyone else. Clearly Google was using a different weighting than “all medals are alike.” Not a surprise given their background in data.

Winter_Games_CoverartI started to wonder what type of weighting they were using. In 1984 (when the Olympics were in Los Angeles) a bunch of gaming companies came out with various Olympic games. Konami’s standup arcade game Track & Field was widely popular (and highly abusive to trackballs). The game I used to play the most (thanks to hacking it) was Epyx’s Summer (and Winter) Games. This game had the “real” challenge of figuring out a “who won the Olympics” as it was a head-to-head multi-player game (someone had to win). It used the 5:3:1 Medal Weighting Model to determine this: each Gold medal was worth 5 points, each Silver 3 points, each Bronze 1. I wondered if Google was using this model, so I decided to wrangle the data and find out.

Data processing

I used Google’s Sochi Olympic Medal Count as my source of data as this had counts and Google ranks of winners (I go this via their Russian site so I could get final results, there were 26 countries who won any Olympic Medal).

Of course, by the end of the Olympics it was a bit less interesting as Russia had both the most medals and the highest rank. However, I still wanted to figure out their weighting as a curious exercise. I built a model that calculated ranks for various Medal Weighting Model (MWM) approaches and calculated the absolute value of all Rank Error deltas from Google’s ranking. I both computed both the sum of these errors (Total Rank Error or TRE) and highlighted any non-zero error, enabling me to quickly see any errors in various MWM weightings.

Trying out a few random models

The first model I tried was the “Bob Costas Model” where every medal is the same (1:1:1). This was a clearly different than Google’s as it a TRE of 72. I then tried the Epyx 5:3:1 model… no dice: this one had a TRE of 35 (better than Bob, but not great). I tried a few other mathematical series:

  • Fibonacci: 0,1,1 (TRE=50); 1,1,2 (TRE=42); and 1,2,3 (TRE=43)
  • Fibonacci Prime (TRE=54)
  • Abundant Numbers (TRE=54)
  • Prime Numbers: (TRE=42)
  • Lucas Numbers (TRE=28)
  • Geometric Sequence (TRE=23)
  • Weird Numbers (TRE=2)
  • Happy Numbers (TRE=39)

I then tried logical sequences such as the lowest ratios where a Silver is worth more than a Bronze, and Gold is worth more than both (TRE=31). Still not luck

Getting more systematic

I decided to get more systematic and begin to visualize the TRE based on different MWM weights. I decided to keep Whole Number weights as I was operating under the general principal that each Medal has N points and that points (true in most sports—but not in things like Diving, Figure Skating and Gymnastics—nevertheless, I wanted to keep things simple).

I first looked at Gold Weight influence, WGOLD:1:1 where I varied WGOLD from 1 upwards. This clearly showed a rapid decay in TRE that flattened out at 2 with Gold was worth 13x that of a single Silver or Bronze medal:

Rapid decay in TRE as Gold medals gain higher weighting
Rapid decay in TRE as Gold medals gain higher weighting

This reinforced that Gold was King, but that Silver was better than Bronze by some value (not surprising). I then kept WGOLD at 13 and started to reduce WBRONZE. I found an interesting result: as soon as I made Bronze worth any value smaller than Silver (even ε = 0.001), I got Zero TRE (a complete match to Google’s Rank). However, I could not image a scoring system of 13:1:<1 (or 13:1:0.99). It was just too geeky. As such I tried a different approaches, all with Whole Number ratios of Gold:Silver:Bronze. The lowest ratios I found with Zero TREs were the following:

  • Gold=21, Silver=2, Bronze=1
  • Gold=29, Silver=3, Bronze=1
  • Gold=40, Silver=4, Bronze=1
  • Gold=45, Silver=5, Bronze=1

TRE never went to zero when Bronze was given Zero weight. Of these models, 40:4:1 had the most symmetry (10:1 to 4:1), so used that is my approximated Google Olympic Rank MDW (it did have zero TRE for all medal winners).

So who won?

I figured I would look at the Top Five Ranked Countries over various models:

Demonstration of how easy it is to add a Grading Curve to the rankings. The higher the TRE the more underweighted winning Gold medals (i.e., truly winning events) is. The country in bold is the one that benefits most from the Grading Curve
Demonstration of how easy it is to add a Grading Curve to the rankings. The higher the TRE the more underweighted winning Gold medals (i.e., truly winning events) is. The country in bold is the one that benefits most from the Grading Curve

Obviously, Russia is the all around winner as they won the most medals and the most Golds and the most Silvers. (Making this exercise a bit less interesting than it was about a week ago). However, it will be fun to apply this in 2016.

And at least Mr Putin is happy.