Tag Archives: knowledge management

A Practitioner’s Guide: Best NoSQL databases to solve 9 real-time transaction challenges

There are a lot of articles out there praising the features and performance of one NoSQL database over another. However, as a practitioner of the principle “pick the right tool for the job”, I thought I would write an article on picking the right NoSQL database for the transactional challenge you are facing—whether you are a startup, mid-sized enterprise or Fortune-500 company.

Before I get started, here are some caveats. First, I have no affiliation with any the companies who provide, or serve as custodians of, the databases outlined in this post. (I have, however, used them all.) Second, I am a big proponent in open source software. This is NOT based on a philosophical bent, but instead decades of experience scaling platforms for hundreds of millions of transactions per day. Third, I am big believer in ecosystems. When you big a technology with a good ecosystem, you get many benefits: you can hire people who no it, you can find lots of tools and libraries to enhance your use of it, and you benefit from patches to solve problems others have already found for your. Ecosystem size factors pretty heavily in my recommendations, as you may not want to bet your company on an untried technology.

Finally, I should address the elephant in the room. You will probably have someone in your company say something like this:

SQL can do anything NoSQL can do. It’s too risky to use it. We lose ACID compliance. We have a huge learning curve, etc.

Yes, relational technology is great for many uses. However, there are many situations where NoSQL technology can do things bigger, faster, at lower cost, and with less effort. Not just by a little bit, but often factors of a 1000x or more. Simply using RDBMS technology for every challenge is akin to using one hammer to for any type of nail, screw, peg, etc.

With this out of the way, let’s get started.

Challenge 1: Search (and Wildcard Searches)

You want to create a place where people can search for content in your site (and handle the idiosyncrasies of misspelling, grammar, and interchange of words like “one” for “1”). Similarly, you want to allow back office and enterprise application users to perform wildcard searches (such as all orders that contain “Nike”) as quickly as they can search for content on Google.

To solve these problems, use a search engine (which is technically a NoSQL database of inverted indices paired with lots scoring and fast spell checking functions). My favorite is ElasticSearch(ES). It is really fast and gives you some interesting capabilities you can use for specialized search (see Challenge #: Recommendation).  A second choice is Apache SOLR. SOLR is quite a bit slower than ES. However, it is included natively in many NoSQL distributions (HortonWorks, Cloudera, DataStax, etc.) If you already have gone through the cost of implementing these it makes sense to stick with Apache SOLR to get more value out of your investment.

BTW, never give users the option of wildcard or text searches in non-Search databases. It is a performance and scalability nightmare.

Challenge 2: Managing Versioned Pages of Information

You have a content management system for writers, journalists, etc. and want to maintain versions of their articles. You have an information management system for life sciences such as electronic medical records (EMR) or eClinical systems, and need to keep a copy of each version of a medical records page (or a clinical case report form, a.k.a. CRF).

To solve these problems use a document-oriented database. Instead of tracking associations of records (or paragraphs) to data to versions, you can simple store the whole document (often in JSON, preserving markup and annotation data) as a single entry for each version. My favorite is Amazon’s DynamoDB, based on its ease of setup and scalability. A close second for me is MongoDB (especially if you have a requirement for on-premise management). MongoDB would benefit by making it easier to setup multi-node clusters with encrypted data transfer. This still takes too much DevOps work.

Caveat: While Document DBs are great for storing pages as documents, I would not use them to present documents as pages to high-volume end users (e.g., for a content management system). Rather than present the ‘current published’ version of content from a database you should use some sort of cache. The easiest solution to setup is Amazon’s CloudFront CDN. However, if your scale and team are big enough Varnish is a more cost-effective solution.

Challenge 3: Managing Streams of Data, Events, and Actions

You want to manage long streams of information. You want to store all the events in a customer’s life cycle (for customer lifetime value management) and that activity is frequent. You want to store sensor events and GPS position reads for an asset (e.g., movement of freight in a reefer container). This challenge is getting more frequent given the explosion of data for mobile, sensors and social sharing events

The best tool for this is a wide-column database (very different from OLAP columnar analytic databases). Wide column databases bring two advantages: they let you store sparse information very efficiently (imaging storing all the topics a person could Tweet about or all the variables a sensor could capture). Second – with some good engineering work – they scale like mad and let you fetch this information back on 1/1000th of the time as traditional relational databases.

All the major columnar databases are children of Google’s BigTable database (that tells you something). My favorite wide-column database is Cassandra, especially if you want to use it for real-time streaming analytics and complex event processing. A far second choice for me is HBase (IMHO, HBase is essentially Cassandra–but with overly complicated ops management). However, if you already have a large bundled Hadoop installation from a major provider (e.g., HortonWorks, Cloudera, MapR) you probably already have HBase installed. HBase would also be my first choice if you company or groups is primarily a batch analytics shop (i.e., a big MapReduce data warehousing shop).

Challenge 4: Recommendation Engines

You want to recommend the best thing given other things your customer has viewed, liked or purchased. You want to recommend the best business to connect with based on given customers relationships with similar businesses. You want to find potential new customers based on their similarity to existing customers.

In reality, the best tools for recommendation are machine learning based. However, graph databases can make things easier machine learning algorithms or more basic recommendation features (such “users who liked this also like these”) to get the right information. Graph databases are still a bit of nascent market. The biggest leader is Neo4J. However, if you already have a Cassandra, HBase or (I am not sure why) Oracle Berkley DB installation you could simply install Titan to use this data. This would be my number one choice as Cassandra and HBase are fantastic ways to store very sparse info on customer preferences (e.g., all the things the viewed, liked or bought) thanks to their ability to support hundreds of thousands of columns per row—and read quickly along a column down a single column across many, many rows.

If you are just getting your feet wet in the recommendation engine space, you can also use ElasticSearch’s boost feature to highlight content for recommendation based on item scores. (These scores can be simple post-transaction calculations or fully robust machine learning driven scores). I once used ES to setup a recommendation engine in less than two weeks that actually beat a Google Search result. ES (and Apache SOLR) also have the “more like this” recommendation feature available out-of-the-box.

Challenge 5: Relationship- and Attribute-based Exploration

You want to create a site that users can browse to find “things like this” or “things related to that.” One example is searching for restaurants based on confluence of style, ingredients or similarity to restaurants I like (this could be equally applied to wine exploration on Lot18 or art exploration on Art.sy). I very popular use of this is browsing sports statistics, such as using Pro-Football-Reference.com for Fantasy Football team assembly.

For these challenges I would go immediately to a graph database. Out of the box, Neo4J would be my first choice. If I already had a big Hadoop or Cassandra installation, I would go to Titan.

Challenge 6: Knowledge Base Exploration

You want to create knowledge base to find the right “how to” content to address a problem, e.g., fix my account, figure out how to perform a feature in an app, etc. You could do this from customer facing site, call center, support desk, etc.

I spent many years working with companies who spent tens of millions of dollars using exotic database solutions to solve these problems. However, none of them worked as well as Search Engines. Search Engines are fast, scalable and handle all the vagaries of user behavior (such as spelling and grammar errors, use of “one” vs. “1”, etc.). In addition, thanks to Google, people are now “trained” to use Search to find answers to their questions.

ElasticSearch’s boost feature makes an amazing knowledge base manager. Search finds the relevant content and attributes such as views, votes, helpfulness ratings, etc. drive the boost score to raise answers up or down. The best example for this is StackOverflow (the developers crib notes for just about any problem).

Challenge 7: Sorted Catalog Browsing

You want to let users search a catalog of items, then switch to sort by things like price and rating or switch to browser by category and sub-category. You want to allow users to look at broad categories and drill into sub-categories to find items to buy. Basically how we all start looking for something at Amazon.com.

The starter technology for this is Search (again ElasticSearch is my recommendation of choice). You can use ES’s aggregation features to search and browser by category. However, as you get really big (i.e., millions of items in your catalog) or you wish to allow users to pivot from search results into structured browsing or explicit assignment of SKU items to categories you would add a columnar store database such as Cassandra or HBase. To see this action, search on ‘shirts’ at Amazon. Notice the phrase ‘Choose a Department to Sort’? This forces you to move from a search result to a columnar-based query based on nested keys. (To be completely clear, Amazon uses its own home-built tech for this, not Cassandra or even the stuff they expose on AWS for the rest of us to use. However, the principles are similar).

Challenge 8: Capturing Logs for Analysis

You want to capture logs of information for later forensic analysis. This could be for debugging, performance analysis, security and penetration analysis, or simply for audit compliance.

My number one choice for this is Riak. It is fast and simple to use. (The Mozilla Foundation uses this to capture all crash report data). A second choice for this is MongoDB as it is already bundled with many log analysis applications. (However, Mongo is more costly to scale than Riak). Another choice is simply streaming this data to a file store such as Apache HDFS to later interrogate with HBase, Hive, PrestoDB, or ElasticSearch.

Challenge 9: State Machine and Session Management

Ok, this one a bit technical. However sometimes you need to keep track of info for rapid lookup for later. You may want to keep track of the pages a person recently viewed on your site. You may want to keep a pre-approval status. You might want to hold a ticket for concert in temporary lock before purchase. You may want to keep the state of a customer or item for complex rule processing (e.g., keep track that I am exercising for fitness target alerts).

The best types of databases for these challenges are key-value Stores. My favorite is Redis. It is ease to setup and super fast. However, Redis is optimized for in-memory option. It is less optimal if you need guaranteed persistence of the information you want to store for rapid lookup. If you need persistence (such as storing pre-calculated algorithm or pre-qualification models), and have no pre-existing infrastructure, I recommend MemcacheDB. However, if you already are using Cassandra, HBase or DynamoDB you can simply use that as rapid key-value store (with persistence). You may even find this more cost-effective than setting up a Redis cluster even if you do not need persistence


There are a lot of NoSQL technologies out there. The trick, like all things in life, is the pick the right tool for the job.

Using Web 2.0 to manage IFAQs (Interactive FAQs) for help desks

Why are FAQs applicable to the enterprise?

A lot of people think FAQs (especially if you pronounce this | faks |) are only for online digerati (i.e., online techies). However, at their root, FAQs are the best answers to the most common questions asked by your customers or staff. This is applied knowledge management for customer care and internal support—albeit at a manually intensive, low efficiency level. If you could make this knowledge easily accessible by your customers or staff, you would have far fewer calls to your contact center or help desk (saving you lots of money).

Why are FAQs better than other forms of knowledge management?

So we established that FAQs are nascent forms of knowledge management…so what? Why is this better than other forms of knowledge management (KM), such as inferential case management systems, expert artificial intelligence system, Bayesian knowledge trees or recommendation engines?

Every year, customer support organizations pour millions of dollars into systems and approaches to automate knowledge management to achieve the triple goal of reducing contact center calls, call time and callback rate. Most of these programs do their job of managing knowledge but rarely achieve these reduction calls for a simple reason: they manage knowledge in a way that is optimal for IT, not in a way that is optimal for how human beings obtain knowledge. (If you don’t believe me, contact me to discuss over $200-million USD of real-world examples using ALL of the alternate KM techniques listed above.)

When people call you for help, they start by asking a question. Once they ask this question, they do not want to proceed down a tree of 19 other questions (they case-based version of “Twenty Questions”), they simply want an answer. If you have ever studied contact center traffic, you will find that on a given day or week, the large people are calling about limited set of items. If you can pull together the answers to the questions and make them easily to access, you can address the large majority of your customers’ (or staff members’) needs. Basically, this is management and provision of FAQs for your staff and customers.

If FAQs are better, why are enterprises not widely using them?

If you ever managed a FAQ list, you know this it is intensely manual process:

  • First, you need to look at all your submitted questions
  • Next, you need to categorize them and combine similar questions into one
  • Then you need to write answers
  • Finally, you need to post all of this in an easy-to-use format

This is hard enough in a static world. It is next-to-impossible in a dynamic world where the question of the day or week is always changing—and at high volume. This is why few enterprises use FAQs for knowledge management and support.

Why IFAQs (a.k.a. FAQs 2.0) is the answer

Interactive FAQs allow enterprises to manage and scale use of FAQs to manage customer care and support knowledge. To see why, we first need to define what an IFAQ is:

I•FAQ |ˈī fak| (noun)

Abbreviation for: Interactive Frequently Asked Question(s)

Definition: Content organization in the form of questions and answers about the use or operation of a particular product or service that is managed using a two-way flow of information between those asking the questions and those providing the answers.

  1. Open IFAQs enable anyone to ask questions or provide answers. These fully leverage open network behaviors, both for maximum flow of information and maximum exposure to redirection
  2. Hybrid IFAQs enable anyone to ask questions but limit who is able to provide answers (usually to approved experts). Hybrid IFAQs respond more slowly but add the benefit of providing enterprise-certified answers

Both types of IFAQs are social networks. Managers of these networks can moderate their content to ensure user generated content remains focused on the core topics of the IFAQ

Alternate terminology: FAQ 2.0 |ˈfak toō pɔint ō|

With non-interactive knowledge management systems, the enterprise has to “guess” what knowledge it staff or customer needs. Enterprises typically do this by analyzing past requests then providing knowledge to their support staff – after the fact.
IFAQs turn this around entirely:

  • First and foremost, customers drive the process. They submit questions directly to the enterprise. (The enterprise responds by answering these questions.)
  • This dynamic builds the true list of most frequently asked questions. As this occurs, customers can look at the answers to these common questions (instead of asking new questions)
  • It is adaptive in real-time. When customers have a new concern, they immediate ask these questions. As soon as the enterprise answers the new question all other customers can see it. This can prevent call volume by detecting and publicly address of new issues as soon as they emerge

What makes a good IFAQ service?

To be effective an IFAQ service must have the following features and characteristics:

  1. Have an easy-to-use, intuitive interface. If you do not provide this, your customers will not want to use it
  2. Support multimedia content. A picture can be more useful than a thousand words; a video contains thousands of pictures (i.e., frames). Allowing both questions and answers to contain pictures, videos, audio and documents will make them much more compelling (compelling enough to get someone to use these vs. picking up the phone).
  3. Be able to organize questions around topics and calls-to-action. If you do not provide this you will get the “big pile of questions and answers” that makes it impossible for people to find what they need. I call social calls-to-action “social campaigns”
  4. Enable customers to submit, rate, and comment upon content. If you do not do this, you will not know if your answers are like, helpful or even correct. By letting users vote on your content, you can drive the best content to the top. But letting them expand upon it, you can learn better ways to present your answers to make them more effective.
  5. Provide rich moderation controls. You need to be able to manage what who can ask questions and who can answer them. You also need to be able to edit and remove duplicate or incorrect content. You also need to enable your users to report inappropriate content (and automatically remove content when a set number of people report this). If you do not provide this, you will lose control of your network
  6. Support enterprise integration. You will want to add this onto your enterprise. That means you will want to be able to integrate with employee directories and your CRM or ITIL Management systems. This is the difference between making an amateur or proof-of-concept IFAQ and an enterprise one.
  7. Integrate business intelligence. If you cannot analyze and report on how the IFAQ is used, you cannot measure its enterprise value.

Essentially, these are all principles of delivering a robust, purpose-driven social network focused on ideation.

How close are we to this?

To quote the old “Six Million Dollar Man” show, “we have the technology; we can build [it].” However, it is not a question of having the technology, it is more of positioning it use. Positioned correctly, you not only can use this for customer care but also to drive revenue: Imagine answering a question with a description, instructional video and hyperlink to purchasing a product or service to perform this. That turns a cost of support into a sales lead. (If this sounds unreal, a quick demonstration example can be found here).