Social Media

Social Media, CrowdSourcing, Sharing Economies and User-generated Content

After Moving to Slack: Inbox Zero (at least twice per week)

TL,DR Version: Moving my entire Engineering organization to Slack and adopting a ChatOps collaboration mindset has reduced email volume over 95% and now enables us to resolve issues in 1/4th of the time.

I have always been a fan of using automation, web hooks, and chat-assisted operations to streamline work and enable collaboration across different locations. However, this traditionally required some Engineering and Operations (i.e., DevOps) investment in setting up collaboration servers, programming bots, etc. Slack has changed this, virtually eliminating all technical friction of moving to a ChatOps model. Here is our of how Slack enabled us to move to a ChatOps environment with far less email, faster response times, and greater overall productivity.

In late 2013, I joined a new company that—at the time—had no one on staff at the time with a DevOps-style background. After coming from several years of Chat-integrated operations, it felt like hitting the brakes. One night a few weeks later, I saw a Tweet by @mparca on February 9 about this new great service called Slack. I took a quick look and realized I could achieve everything that HipChat and Hubot did—with a simple push-button SaaS service. My initial set up (our organization, some channels, and hooks to Github) took less than 10 minutes. As it was free (for many features), I did not even need to process a Purchase Order (even better). I was off to setting ChatOps at a new place.

Initially, things went pretty slow. At the time, most of our tech stack and tools were hosted on-premise. Our chat tool of choice as a Skype (not a hook-friendly app). I got a few people to move to Slack, but not many.

Over time, as we implemented a full continuous integration and deployment chain, we added more and more hooks into Slack. First came a move to Atlassian (Cloud). Next Jenkins. Next Sentry. Then Ansible. Then Icinga. Then came custom RTM scripts for more complex things, such as letting our Data Scientists know they have left an idle PySpark context running for more than eight hours. What made this so easy was that everything but the custom RTM scripts could be done in less than five mouse-clicks (it is very helpful that so many collaboration and monitoring tools have enabled web hooks).

As we added more hooks, and started to bring more people onboard, I noticed an interesting shift. People joining our team began to just use Slack to communicate. One developer would come across an interesting new open source repo or article and share it with the rest in #general. Developers with DevOps privileges would jump on issues as soon as they saw a Sentry alert in #prod (saving the need to even text or phone the on-call engineer). Some people even answering questions in code review while they were doing things like waiting at the airport to depart on vacation.

Today our Engineering Teams (Product, Hardware, Software, Data, and Ops) all now primarily use Slack for communications. Most even use it in favor of texting. We Slack each other tickets that are ready for work, UAT, or release). We use group chats to have conversations to answer questions about stories, designs, bugs, and more. We use Slack channels integrated with Github for better code reviews. We use Slack to facilitate pair programming (and pair testing). Slack is now our default tool  for issue diagnosis, as sharing log messages, code snippets, and JSON is much clearer thanks to native markdown.

We achieve this with the following channel model:

  • One channel for each environment (so we can let people know if we are about to add nodes, run a load test, etc.). We have our respective Jenkins, Ansible, Icinga and Sentry hooks tied into each environment channel as well.
  • One channel for each code repository (to see PRs and conduct code reviews). We have aligned our JIRA projects with these to integrate tickets as well.
  • We have some basic team channels for more focused group conversations
  • We also have a #rm channel for simple-to-read log of what was released, when

As our organization moved to this model, life and work got easier in some rather visceral ways:

First, I have been able to dial down my notifications to only ping me when four things happen: I get a call, I get a text, I get a direct Slack message, or there is public Slack in a mission-critical channel. I no longer get endless interruptions, making me more productive at work (and more attentive in meetings and at home). If my phone does ping, it means there is something very important—which actually lets me react to these issues faster.

Second, my email volume is down over 95%. The bulk of the emails I now get are related to true business questions (vs. endless status messages and FYIs). As a result, I can answer email faster and now regularly hit “Inbox Zero” at least twice a week—while managing a 24x7xForever SaaS Engineering organization with follow-the-sun development and operations spanning California, Washington, Europe, and Asia.

Our full embrace of Slack did not happen overnight. It organically evolved over a period of about 18 months–a natural rate of adoption for organizational change. Because it was organic, we did have to institute policies  that forced usage. Instead we allowed our teams to naturally adopt Slack in ways that made work easier. I hope more organizations can make this transition as everyone could benefit from less email and fewer interruptions.

Natural adoption of Slack over other forms of communication would have happened if the usability was not as good as it is. One of my favorite features is how well Slack detects when I am no longer at my desk: if I walk down the hall, my phone chirps on key Slack messages; when I sit back down my phone stops and my laptop takes over.  This happens within seconds.

Oh BTW, we do all of this with the baseline free Slack account. That’s one less excuse to not give it a try.

PS – Want to work in an environment like this? Check us out.

Twitter traffic jams in Washington, created by… John Oliver

Summary: In the first week of June, 20% of the Tweets about traffic, delays and congestion by people around the Washington Beltway were caused by John Oliver’s “Last Week Tonight” segment about Net Neutrality.

At work, we are always exploring a wide range of sensors to obtain useful insights that can used to make work and routine activities faster, more efficient and less risky. One of our Alpha Tests is examining use of “arrays” of high-targeted Twitter sensors to detect early indications of traffic congestion, accidents and other sources of delays. Specifically we are training our system how to use Twitter is a good traffic sensor (by good, in “data science speak” we are determining whether we can train a model for traffic detection that has a good balance of precision and recall, and hence a good F1 Score). To do this, I setup a test bed around the nation’s second-worst commuter corridor: the Washington DC Beltway (our my backyard).

Earlier this month our array of geographic Twitter sensors picked up an interesting surge in highly localized tweets about traffic-related congestion and delays. This was not an expected “bad commuter-day”-like surge. The number of topic- and geographically-related tweets seen on June 4th was more than double the expected number for a Tuesday in June around the Beltway; the number seen during lunchtime was almost 5x normal.

So what was the cause? Before answering, it is worth taking a step back.

The folks at Twitter have done a wonderful job at not only allowing you to fetch tweets based on topics, hash tags and geographies. They have also added some great machine learning-driven processing to screen out likely spammers and suspect accounts. Nevertheless Twitter data, like all sensor data, is messy. It is common to see Tweets with words spelled wrong, words used out of context, or simply nonsensical Tweets. In addition, people frequently repeat the same tweets throughout the day (a tactic to raise social media exposure) and do lots of other things that you must train the machine to account for.

That’s why we use a Lambda Architecture to process our streaming sensor data (I’ll write about why everyone–from marketers to DevOps staff should be excited about Lambda architectures in a future post). As such, not only do use Complex Event Processing (via Apache Storm) to detect patterns as they happen; we also keep a permanent copy of all raw data that we can explore to discover new patterns and improve our machine learning models).

That is exactly what we did as soon as we detected the surge. Here is what we found: the cause of the traffic- and congestion-related Twitter surge around the Beltway was… John Oliver:

  1. In the back half of June 1st’s episode of “Last Week Tonight” (HBO, 11pm ET), John Oliver had an interesting 13-minute segment on Net Neutrality. In this segment he encouraged people to visit the FCC website and comment on this topic.
  2. Seventeen hours later, the FCC tweeted that “[they were] experiencing technical difficulties with [their] comment system due to heavy traffic.” They tweeted a similar message 74-minutes later.
  3. This triggered a wave of re-tweets and comments about the outage in many places. Interestingly this wave was delayed in the Beltway. It surged the next day, just before lunchtime in DC, continuing throughout the afternoon. The two spikes were at lunchtime and just after work . Evidently, people are not re-tweeting while working. The timing of the spikes also reveals some interesting behavior patterns on Twitter use in DC.
  4. By 4am on Wednesday the surge was over. People around the Beltway were back to their normal tweeting about traffic, construction, delays, lights, outages and other items confounding their commute.

Of course, as soon as we saw the new pattern, we adjusted our model to account for this pattern. However, we thought it would be interesting to show in a simple graph how much “traffic on traffic, delays and congestion” Mr. Oliver induced in the geography around the Beltway for a 36-hour period. Over the first week of June, one out of every five Tweets about traffic, delays and congestion by people around the Beltway were not about commuter traffic, but instead around FCC website traffic caused by John Oliver:

Tweets from people geographically Tweeting around the Washington Beltway on traffic, congestion, delays and related frustration for first week of June. (Click to enlarge.)
Tweets from people geographically Tweeting around the Washington Beltway on traffic, congestion, delays and related frustration for first week of June. (Click to enlarge.)

Obviously, a simple count of tweets is a gross measure. To really use Twitter as a sensor, one needs to factor in many other variables: use text vs. hash-tags, tweets vs. mentions and re-tweets, the software client used to send the tweet (e.g., HootSuite is less likely to be a good source for accurate commuter traffic data); the number of followers the tweeter has (not a simple linear weighting) and much more. However, the simple count is simple first-order visualization. It also makes interesting “water-cooler conversation.”