A Practitioner’s Guide: Best NoSQL databases to solve 9 real-time transaction challenges

There are a lot of articles out there praising the features and performance of one NoSQL database over another. However, as a practitioner of the principle “pick the right tool for the job”, I thought I would write an article on picking the right NoSQL database for the transactional challenge you are facing—whether you are a startup, mid-sized enterprise or Fortune-500 company.

Before I get started, here are some caveats. First, I have no affiliation with any the companies who provide, or serve as custodians of, the databases outlined in this post. (I have, however, used them all.) Second, I am a big proponent in open source software. This is NOT based on a philosophical bent, but instead decades of experience scaling platforms for hundreds of millions of transactions per day. Third, I am big believer in ecosystems. When you big a technology with a good ecosystem, you get many benefits: you can hire people who no it, you can find lots of tools and libraries to enhance your use of it, and you benefit from patches to solve problems others have already found for your. Ecosystem size factors pretty heavily in my recommendations, as you may not want to bet your company on an untried technology.

Finally, I should address the elephant in the room. You will probably have someone in your company say something like this:

SQL can do anything NoSQL can do. It’s too risky to use it. We lose ACID compliance. We have a huge learning curve, etc.

Yes, relational technology is great for many uses. However, there are many situations where NoSQL technology can do things bigger, faster, at lower cost, and with less effort. Not just by a little bit, but often factors of a 1000x or more. Simply using RDBMS technology for every challenge is akin to using one hammer to for any type of nail, screw, peg, etc.

With this out of the way, let’s get started.

Challenge 1: Search (and Wildcard Searches)

You want to create a place where people can search for content in your site (and handle the idiosyncrasies of misspelling, grammar, and interchange of words like “one” for “1”). Similarly, you want to allow back office and enterprise application users to perform wildcard searches (such as all orders that contain “Nike”) as quickly as they can search for content on Google.

To solve these problems, use a search engine (which is technically a NoSQL database of inverted indices paired with lots scoring and fast spell checking functions). My favorite is ElasticSearch(ES). It is really fast and gives you some interesting capabilities you can use for specialized search (see Challenge #: Recommendation).  A second choice is Apache SOLR. SOLR is quite a bit slower than ES. However, it is included natively in many NoSQL distributions (HortonWorks, Cloudera, DataStax, etc.) If you already have gone through the cost of implementing these it makes sense to stick with Apache SOLR to get more value out of your investment.

BTW, never give users the option of wildcard or text searches in non-Search databases. It is a performance and scalability nightmare.

Challenge 2: Managing Versioned Pages of Information

You have a content management system for writers, journalists, etc. and want to maintain versions of their articles. You have an information management system for life sciences such as electronic medical records (EMR) or eClinical systems, and need to keep a copy of each version of a medical records page (or a clinical case report form, a.k.a. CRF).

To solve these problems use a document-oriented database. Instead of tracking associations of records (or paragraphs) to data to versions, you can simple store the whole document (often in JSON, preserving markup and annotation data) as a single entry for each version. My favorite is Amazon’s DynamoDB, based on its ease of setup and scalability. A close second for me is MongoDB (especially if you have a requirement for on-premise management). MongoDB would benefit by making it easier to setup multi-node clusters with encrypted data transfer. This still takes too much DevOps work.

Caveat: While Document DBs are great for storing pages as documents, I would not use them to present documents as pages to high-volume end users (e.g., for a content management system). Rather than present the ‘current published’ version of content from a database you should use some sort of cache. The easiest solution to setup is Amazon’s CloudFront CDN. However, if your scale and team are big enough Varnish is a more cost-effective solution.

Challenge 3: Managing Streams of Data, Events, and Actions

You want to manage long streams of information. You want to store all the events in a customer’s life cycle (for customer lifetime value management) and that activity is frequent. You want to store sensor events and GPS position reads for an asset (e.g., movement of freight in a reefer container). This challenge is getting more frequent given the explosion of data for mobile, sensors and social sharing events

The best tool for this is a wide-column database (very different from OLAP columnar analytic databases). Wide column databases bring two advantages: they let you store sparse information very efficiently (imaging storing all the topics a person could Tweet about or all the variables a sensor could capture). Second – with some good engineering work – they scale like mad and let you fetch this information back on 1/1000th of the time as traditional relational databases.

All the major columnar databases are children of Google’s BigTable database (that tells you something). My favorite wide-column database is Cassandra, especially if you want to use it for real-time streaming analytics and complex event processing. A far second choice for me is HBase (IMHO, HBase is essentially Cassandra–but with overly complicated ops management). However, if you already have a large bundled Hadoop installation from a major provider (e.g., HortonWorks, Cloudera, MapR) you probably already have HBase installed. HBase would also be my first choice if you company or groups is primarily a batch analytics shop (i.e., a big MapReduce data warehousing shop).

Challenge 4: Recommendation Engines

You want to recommend the best thing given other things your customer has viewed, liked or purchased. You want to recommend the best business to connect with based on given customers relationships with similar businesses. You want to find potential new customers based on their similarity to existing customers.

In reality, the best tools for recommendation are machine learning based. However, graph databases can make things easier machine learning algorithms or more basic recommendation features (such “users who liked this also like these”) to get the right information. Graph databases are still a bit of nascent market. The biggest leader is Neo4J. However, if you already have a Cassandra, HBase or (I am not sure why) Oracle Berkley DB installation you could simply install Titan to use this data. This would be my number one choice as Cassandra and HBase are fantastic ways to store very sparse info on customer preferences (e.g., all the things the viewed, liked or bought) thanks to their ability to support hundreds of thousands of columns per row—and read quickly along a column down a single column across many, many rows.

If you are just getting your feet wet in the recommendation engine space, you can also use ElasticSearch’s boost feature to highlight content for recommendation based on item scores. (These scores can be simple post-transaction calculations or fully robust machine learning driven scores). I once used ES to setup a recommendation engine in less than two weeks that actually beat a Google Search result. ES (and Apache SOLR) also have the “more like this” recommendation feature available out-of-the-box.

Challenge 5: Relationship- and Attribute-based Exploration

You want to create a site that users can browse to find “things like this” or “things related to that.” One example is searching for restaurants based on confluence of style, ingredients or similarity to restaurants I like (this could be equally applied to wine exploration on Lot18 or art exploration on Art.sy). I very popular use of this is browsing sports statistics, such as using Pro-Football-Reference.com for Fantasy Football team assembly.

For these challenges I would go immediately to a graph database. Out of the box, Neo4J would be my first choice. If I already had a big Hadoop or Cassandra installation, I would go to Titan.

Challenge 6: Knowledge Base Exploration

You want to create knowledge base to find the right “how to” content to address a problem, e.g., fix my account, figure out how to perform a feature in an app, etc. You could do this from customer facing site, call center, support desk, etc.

I spent many years working with companies who spent tens of millions of dollars using exotic database solutions to solve these problems. However, none of them worked as well as Search Engines. Search Engines are fast, scalable and handle all the vagaries of user behavior (such as spelling and grammar errors, use of “one” vs. “1”, etc.). In addition, thanks to Google, people are now “trained” to use Search to find answers to their questions.

ElasticSearch’s boost feature makes an amazing knowledge base manager. Search finds the relevant content and attributes such as views, votes, helpfulness ratings, etc. drive the boost score to raise answers up or down. The best example for this is StackOverflow (the developers crib notes for just about any problem).

Challenge 7: Sorted Catalog Browsing

You want to let users search a catalog of items, then switch to sort by things like price and rating or switch to browser by category and sub-category. You want to allow users to look at broad categories and drill into sub-categories to find items to buy. Basically how we all start looking for something at Amazon.com.

The starter technology for this is Search (again ElasticSearch is my recommendation of choice). You can use ES’s aggregation features to search and browser by category. However, as you get really big (i.e., millions of items in your catalog) or you wish to allow users to pivot from search results into structured browsing or explicit assignment of SKU items to categories you would add a columnar store database such as Cassandra or HBase. To see this action, search on ‘shirts’ at Amazon. Notice the phrase ‘Choose a Department to Sort’? This forces you to move from a search result to a columnar-based query based on nested keys. (To be completely clear, Amazon uses its own home-built tech for this, not Cassandra or even the stuff they expose on AWS for the rest of us to use. However, the principles are similar).

Challenge 8: Capturing Logs for Analysis

You want to capture logs of information for later forensic analysis. This could be for debugging, performance analysis, security and penetration analysis, or simply for audit compliance.

My number one choice for this is Riak. It is fast and simple to use. (The Mozilla Foundation uses this to capture all crash report data). A second choice for this is MongoDB as it is already bundled with many log analysis applications. (However, Mongo is more costly to scale than Riak). Another choice is simply streaming this data to a file store such as Apache HDFS to later interrogate with HBase, Hive, PrestoDB, or ElasticSearch.

Challenge 9: State Machine and Session Management

Ok, this one a bit technical. However sometimes you need to keep track of info for rapid lookup for later. You may want to keep track of the pages a person recently viewed on your site. You may want to keep a pre-approval status. You might want to hold a ticket for concert in temporary lock before purchase. You may want to keep the state of a customer or item for complex rule processing (e.g., keep track that I am exercising for fitness target alerts).

The best types of databases for these challenges are key-value Stores. My favorite is Redis. It is ease to setup and super fast. However, Redis is optimized for in-memory option. It is less optimal if you need guaranteed persistence of the information you want to store for rapid lookup. If you need persistence (such as storing pre-calculated algorithm or pre-qualification models), and have no pre-existing infrastructure, I recommend MemcacheDB. However, if you already are using Cassandra, HBase or DynamoDB you can simply use that as rapid key-value store (with persistence). You may even find this more cost-effective than setting up a Redis cluster even if you do not need persistence

***

There are a lot of NoSQL technologies out there. The trick, like all things in life, is the pick the right tool for the job.

MR-NFL

Minority Report meets the NFL: Which College Football Program Is Most Likely to Lead to Future NFL Arrests?

A few weeks ago I published a deep-dive analysis of USA Today’s NFL Arrest Database.  While I received many comments (mostly via email or Twitter DM), two rose to the top:

  1. College is a formative experience. Did the college the player attended affect the likelihood of arrest (and criminal charge)?
  2. Many towns have very active football programs. Did the town or high school the player attended drive specific outcomes?

Are we getting into Minority Report territory?

The more we look at variables that could be used to to classify future criminal behavior (e.g., does going to college X now indicate you will be arrested for Y seven years later?), the more we get into a world more like that depicted in the movie Minority Report. As such, we need to be really careful to ensure as compare “apples-to-apples” for all analysis.

This post will answer the first question. I am still processing the data on high schools before writing up the second.

The college that led to the most arrests: WVU

It is a bit tricky analyzing which college led to the most arrests. You cannot simply count arrests by NFL players and group them by college program. This would penalize colleges with great programs (a college with 200+ alumni in the NFL should have more alumni with arrests than a college with only 5 alumni). Similarly ,you cannot simply look at the ratio of arrested NFL alumni to total NFL alumni (as this would penalize a college with few alumni).

So how did I answer this question? I combined two factors. I overlayed the following:

  • Top 5% college programs with most alumni in the NFL
  • Top 5% college programs with most alumni arrested in the NFL

Here is a visualization of the result:

Spiral chart shows how many NFL arrests were from players from each respective college program. The seven schools highlighted were schools in BOTH the top 5% for NFL placement AND top 5% for NFL arrests.
Spiral chart shows how many NFL arrests were from players from each respective college program. The seven schools highlighted were schools in BOTH the top 5% for NFL placement AND top 5% for NFL arrests.

West Virginia University not only had the most arrested NFL alumni. It also had the most arrested NFL alumni in comparison to all other top college programs. 

I took a look at the ratio of each of these school’s arrested NFL alumni per 125 total NFL alumni to get a “arrest per squad that made it to the NFL.” The results we interesting:

  • The average “Top 5%” team had 4.53 arrest per NFL squad
  • WVU had 18.06, nearly 4x the average arrest rate
  • WVU also had almost double the next highest arrest rate: University of Miami (FL) who had 9.87 arrests per NFL squad

Other Schools with Many Arrested NFL Alumni

As the spiral diagram shows above, WVU was not the only school in the “Top 5%” for both alumni who made it to the NFL and alumni arrested in the NFL. There were seven schools or made this list:

College/
University

NFL
Alumni

Arrests of
NFL Alumni

Arrests
per Squad

West Virginia

180

26

18.06

Miami (FL)

304

24

9.87

USC

470

23

6.12

Ohio State

409

22

6.72

Florida

278

19

8.54

Michigan

346

17

6.14

Georgia

281

16

7.12

Are these numbers really bad?  The answer is a definitive “Yes”.  Let’s take a look:

  • While these colleges represent 2% of all colleges  that have placed players in the NFL, they represent 20% of all future NFL arrests.
  • The average college team’s NFL alumni have an “arrest per squad” rate of 4.89
  • The average “arrest per squad” rate of the team with the highest placement of NFL players is even better: 4.53
  • These seven colleges have a much higher rate: 8.10. This is 118% higher than average arrest rate of the all other schools with the most success placing alumni in the NFL

What is the path from College to NFL team to type of arrest?

After look at my Sankey diagram of NFL team to  criminal charge to legal outcome, many people asked me if I could show a similar diagram leading from college to NFL team to criminal charge. Doing this for all 158 colleges with arrested NFL alumni would be unreadable. However, here is a Sankey diagram of the flow from the seven universities with the most NFL arrests:

Sankey diagram representing the "flow" from university to NFL team to arrest (for the "Top 7" programs highlighted above).
Sankey diagram representing the “flow” from university to NFL team to arrest (for the “Top 7″ programs highlighted above).

So, do specific college programs have a higher tendency for a type of crime?

My prior analysis  showed a strong correlation of criminal charge type by NFL team.  This led to people to ask me the following “smoking gun” question: does any college stand out as the “leader” in arrests of a particular criminal charge. The answer is No.

Yes, there is a college whose NFL alumni had the most arrests for criminal charge X. However, the numbers of arrest by criminal charge by college are so small that there are no statistically valid indicators that college team indicates any future criminal pattern. We should all be happy for that.

There might be a wider dimension than college that we could assess (e.g., conference, geographic area). However, college–in and of itself–is not a valid dimension to predict future criminal charge.

Some other interesting (and positive) insights

With all the attention on NFL arrests if it easy to overlook the positive. My analysis of colleges showed some strong positives as well.

Very successful college programs–in general–do not equate to high arrest rates:

  • The colleges with the highest success rate placing players have a 19% lower arrest rate than the average college program. Notre Dame, UCLA, UL Monroe, Wisconsin, Syracuse, Minnesota, Boston College, Mississippi, Baylor, Indiana, Northwestern, Northwestern State (LA), and Arizona stand out as schools with the lowest alumni arrest rates.
  • The most successful program, Notre Dame (with a whopping 536 alumni who made it to the NFL) had only three  alumni arrested. This corresponds to an arrest rate that is 86% lower than the average school.

Also, NFL players are arrested 1/3 less often than the average US population. Clearly emulating the the examples set at Notre Dame, UCLA, UL Monroe, Wisconsin, Syracuse, Minnesota, Boston College, Mississippi, Baylor, Indiana, Northwestern, Northwestern State (LA), and Arizona would lead to better outcomes for all.

More to follow later…

NFLsankeyBlogImage

Data Analysis of USAToday’s NFL Arrest database: 15 Surprising Insights

As soon as I learned that USA Today had released on open database of NFL player arrests (2000 to present), the data scientist in me thought, “I imagine there are some interesting patterns in there.” Rather than wondering, I downloaded it and dived right in.

The arrest data is easily readable, but lacks some important items (such as the age of the player at the time of arrest). As such, I decided to mash-up the data with two other sources: DOB, Height and Weight data from NFL.com and the strength and speed data from the NFL Combine. This would let me explore some of the more interesting (and potentially controversial) claims I heard in many TV interviews about the effect of increases in player size and strength had on aggression and crime.

My findings

Here are my findings from analyzing the data:

  1. Arrest frequency is NOT increasing. It is actually down from a really bad spate from 2006-2008
  2. NFL players, in general, have a one-third less likelihood of being arrested than everyday US residents. They have 15x the median US income and 3x the college graduation rate.
  3. However, many of those who are arrested are arrested many times throughout their career. 124 people were arrested more than once. One player was arrested 9 times. Sixty-five arrests were for multiple counts, across multiple criminal charges.
  4. Guilty verdicts (conviction, plea, or plea agreement) are the most common legal outcome. They occur almost 7x more frequently than Acquittals
  5. Nevertheless the most common action taken by NFL teams in response to arrests is “No Response.” This occurs 84% of the time
  6. Two-thirds of arrests occur off-season. However over 99% are arrest of players under contract. Free agent arrests are rare (although all of them later signed onto teams)
  7. Three teams (Minnesota, Cincinnati and Denver) have seen double the “normal” number of arrests per team
  8. Four criminal charges (DUI, Drugs, Domestic Violence and Assault) represent 60% of all arrests.
  9. Six charges (DUI, Drugs, Domestic Violence, Assault, Gun Charges and Disorderly Conduct) represent 80% of all arrests. Each of these has a single team with more arrests than any other.
  10. Of the most frequent charges, conviction rate varied enormously. DUIs had the highest conviction rate; Domestic Violence the lowest. While Domestic Violence pleas + convictions outcomes outnumbered acquittals 10:1, the vast majority of these cases were dropped or resolved in Diversion Programs
  11. The median arrested NFL player is: 25 years, 6 months old; is 6’2” tall, weighs 230 lbs., can run the 40-yd dash in 4.61 seconds and can bench press 225 lbs. 21times.
  12. While 88% of the arrests were of players under 30, age was not a factor (in arrest or criminal charge). The distribution of age at time of arrest virtually matched the distribution of ages across the NFL.
  13. There has been much talk in the media about the size of players and the potential impact on aggression. However, contrary to the opinions, neither height nor weight was a factor in likelihood of arrest or type of criminal charge.
  14. Unsurprisingly,  player speed was not a factor as well.
  15. However, analysis of player strength did show a pattern–not one about the strongest players, but about the least-strong. players It turns out those arrested for Sexual Assault stood out as the group with the lowest distribution of strength scores in the NFL Combine.

The data

  • 730 arrests between 2000 and the present (the database actually expanded by one entry a few days after launch to account for the arrest of Jonathan Dwyer)
  • These 730 arrests spanned 544 players (more on that below). Of these 544 players, 330 had publicly-available NFL Combine results
  • The arrests spanned 51 separate criminal charges (with some interesting concentrations, see below)

Sankey NFL Arrest “Flow”

The diagram at the top of this post is called a Sankey Diagram. This allows exploration of which teams had players arrested for each type of criminal charge then what were the distribution of outcomes for these charges. You can explore this chart, for all teams and all criminal charges in full at this page.

Deeper Dive

Sankeys allow exploration of broad pattern (such as which teams have the most Domestic Violence arrests) and interesting outliers (e.g., which team had a player arrested for Pimping; and what was the result). However, they do not make it easy to explore other dimensions of this data. The rest of this post takes a deeper dive into the data, exploring each of the 15 findings with different summaries and visualizations.

   Next: Arrest Frequency (and who was arrested nine times)