Tag Archives: UGC

Big Web 2.0 Technology Challenges: Utility-class scaling of dynamic data

Managed scalability is one of those things that many do not appreciate having until they really needed it, i.e., when their site gets really popular. Once this happens, they start screaming for the need for scalability (look at AOL’s “Access Crisis” or Twitter’s “Fail Whale” if you need an example). By this time it too late: your reputation has taken a hit that is hard to forget and cost of emergency scaling is usually very high (often involving very large, expensive servers).

It is much more efficient to plan and design for scalability from the start. When you do this, you are ready to adjust and respond to traffic loads when they arrive. Of course, not many people will appreciate what you have achieved (ironic, isn’t it?)

Why the Focus on Scaling Dynamic Data?

Scaling static data is easy. By definition, it does not change very often (i.e., its Read/Write ratio is usually enormously high). As such it is easy to scale. The most common approach to this is to create cached copies of static information. This provides highly reliable, low-cost scaling.

Scaling dynamic data has always been more difficult. These data elements change all the time. As a result, scaling is much more difficult. The two most common approaches involve either management of near-real-time data replicates or horizontal partitioning of data:

web1scaling

Read-only replicates are useful when the amount of data per user is small (e.g., master file records, account profiles). However, their utility breaks down due to replication lag when the amount of data updates per user grows. In these situations the more complex horizontally portioning model is more useful.

Web 2.0 Needs Break Both of These Models

In Web 2.0, dynamic data pursues a much more distributed path:

web2

This essentially, requires you to scale in three dimensions:

  1. Enable many, many people to author (insert and edit) data from many places
  2. Enable a small group from a single place to moderate (edit) data from many authors
  3. Enable a single site to display (in once place) the most recent, highest rated, etc. data from many authors

Unfortunately, None of the single-approach scaling methods address three dimensions:

All Data in One Big System: This makes it easy to look at data from all perspectives (authoring, moderating and publishing). However, it creates a single-point of failure. When the system goes down, everyone gets a “Fail Whale”

Caching or Read-only Replicates: This does not enable you to scale the high volume of data writes without using many primaries (which essentially moves you to a horizontally-partitioned model). Also, maintaining data integrity when replicating from multiple sources becomes a nightmare.

Horizontal Partitioning: This approach is perfect when I only need to access my data. However, it falls apart when managing moderation and publication of live feeds. Essentially, you have to pull data from many different places to create single views and feeds. This is hard for publication (read-only) and extremely difficult for moderation (reading and updating)

What Model Should I Use?

There is no single “silver bullet” like there was in the Web 1.0 world. As a result you need to apply some engineering analysis:

  • Analyze the read/write ratios by dynamic content type (blog posts, forum thread posts, profiles, ideas, etc.)
  • Examine how people use this data, e.g., what data they request at the same time

After you have done this, you will find you need to use a balanced recursive combination of multiple techniques:

  1. Use caching and read-only replicates for items with low dynamism, e.g., profiles, blog titles (not blog posts)
  2. Next partition dynamic data by content-type. This allows you to split load without interfering with the ability to easily manage moderation (moderation is usually performed by content type)
  3. Now denormalize your dynamic data using a noun-adjective model (I know, your DBA just started screaming…) This allows you to further split load without interfering with the ability to moderate content by type and status.
  4. If this still does not give you enough scalability you need to either partition at the physical level (something many database management systems do well) or at the logical level (mostly likely by user or content ID). Logical partitioning will require you to use a grid computing model involving multiple, parallel calls to different databases. This is not a trivial exercise but enables massing scaling (I have done this both on commodity Linux and expensive mainframe architectures scaling to support tables with several billion rows of data).

Of course, now you are looking for an illustrative diagram to show how I have done this for high-volume, multi-tenant, cross-community social networking architectures. Unfortunately, if I shared this diagram I would be widely distributing a set of trade secrets. However, I am willing to discuss more details on this approach in environments less public than an open blog post (perhaps I will have chance to do so at Wharton later this month).

Four big technology challenges in a Web 2.0 world

Web 2.0 created a REAL change in how we use information

A lot of people call Web 2.0 an empty buzzword (there are some Buzzword Bingo apps out there that play on this. However, the product change that came with Web 2.0 drove real changes in how people use information.

To fully understand this, it makes sense to explain what I mean by Web 2.0. To do this I am going to repeat the explanation I give when asked what Web 2.0 is (those of you who read my posting on Social Collaboration can skip this section with my apologies)

Web 1.0 let organizations publish information that could be accessed easily by all of us when we needed it, at our convenience. This changed entirely how we read the news, looked up movie times and checked stock quotes or the weather. The problem with Web 1.0 was that it was biased towards making it easy for large organizations to share information. CNN could easily setup a web site to share news and opinion. However, if I wanted to share my information with many other people in this fashion, I had to setup my own web site, publish content, figure out how to control access to it, etc. This was too hard for the everyday person (who had non-technology things to worry about in his or her life).

Web 2.0 changed this by making it easy to share my views and information–and to control how I share it. Now I can use facebook to share my vacation pictures with my friends (but limit my contact information to my professional colleagues). I do not have to build a web site, administer it, share the URL with my friends and get them to bookmark and visit it. Instead, I can rely on the fact they will visit facebook as part of their normal life and see my updates there (the network effect). It makes it much easier for me.

Enterprise 2.0 and Government 2.0 do the same — but directly in support of the missions of public and private enterprises. They let stakeholders share information and views regarding how industry and government should work (instead of simply proclaiming, “I feel blue today.”) I think of this as Web 2.0 with a purpose of creating Measurable Enterprise Value (i.e., demonstrable business value or public benefit)

This has created new technology challenges

Web 2.0 effectively shifted the direction in how we need to manage creation, access, update, sharing and archive of information. It moved from a hub model to a true network model:

web1020

This shift is not trivial. It has created new challenges across a broad range of technology areas:

  1. Scaling dynamic data: In the Web 1.0 world we used two different approaches to scaling static vs. dynamic data. These approaches do not work in a world where data can be viewed by everyone can come from everywhere
  2. Moderating user-generated content: In the Web 1.0 world, user-generated data were viewed by relative small populations. In a Web 2.0 world, they can be viewed by everyone. This presents significant moderation challenges and new balances between openness, speed of publishing, control and safety
  3. Intellectual property management: Who owns the data on a social network? Web 2.0 stresses every aspect of intellectual property challenges—from ownership and revenue attribution to control and privacy
  4. Cross-platform media management: We live in a multimedia world. That means you have to manage both upload and download of media across hundreds of platforms (browser and operating system combinations). Anyone who has ever worked with CODECs can appreciate how difficult this is

Addressing these using elegant, high-scale, cost-efficient approaches is not trivial prospect. However it as exciting and stimulating as the challenges I learned when helping to build out the Web 1.0 world). Over my next few posts, I will blog about how I have tackled each of these and what I have learned along the way.