Big SQL

Experience MySQL scalability and high-availability

Big SQL

What is Big SQL (or BigSQL)?

Big SQL is the combination of a SQL interface (for ad hoc processing) and parallel processing for handling “big” quantities of data. This can be accomplished in a variety of ways: SQL+Hadoop, SQL+Hadoop +Postgres, Postgres+Hadoop, and MySQL+ScaleDB. Each of these combines BIG (distribution and parallel processing in a scale-out architecture) with SQL.

The phrase “Big Data” has gone a long way toward making Big synonymous with Hadoop. However, I take the more general perspective that Big = distributed scale-out architectures capable of handling large volumes of data. As traditional DBMS increasingly add such parallel processing, they will naturally fall in this same category and associating “big” exclusively with Hadoop is far too limiting.

While RESTful solutions like BigTable do not support a complete SQL language, they have added ad hoc interfaces that are a subset of SQL, in this case GQL. It appears that NoSQL is transitioning to “SomeSQL”. Of course, early implementations of SQL+Hadoop also support only a subset of the SQL language as well. In both cases, the products provide an ad hoc interface on top of distributed processing technology for handling large volumes of data, so this qualifies them for the Big SQL club.

Explosive Growth of Data Drives Big SQL Market

The sheer volume of data is exploding, driving the need for more scalable methods or storing, retrieving and processing large amounts of data. This flood of data includes structured data from databases, sensors, click streams and location data, as well as unstructured data like email, HTML, social data and images. Not too long ago, most data was created by human-to-human interactions. Now we have large amounts of data coming from human interactions with machines (click streams) and directly from machines (sensors). The bottom line is that traditional databases simply were not designed to handle this scale of data. This underlying trend in data growth is forcing companies to adopt new technologies like Hadoop to sift through the data. At the same time, companies like ScaleDB are utilizing Hadoop-like approaches to build clusters capable of large scale parallel queries.

Big SQL is an offshoot of the Big Data megatrend. This represents a blending of the two disciplines. As described by database genius James Hamilton, modern workloads will increasingly span large clusters of low-cost, low-power machines operating on the data in parallel.

Corporate Adoption of Hadoop

While Hadoop has impressed many with its ability to process massive amounts of data, corporate adoption has been crippled by the following challenges:

  1. lack of an ad hoc structured interface
  2. lack of an ecosystem (tools, applications, availability of trained experts, etc.)
  3. user-friendly deployment and management.

Cutting-edge (and highly paid) engineers in Silicon Valley have pioneered Hadoop adoption, but corporate adoption requires that Hadoop fit into the corporate IT mold.This requires addressing the three challenges above.

Software companies derive the bulk of their revenues from mainstream corporate adoption. As a result, there is a huge effort underway to address the barriers to this adoption. It is believed that adding a SQL interface to Hadoop will address issues number one and two above. This belief should be proven out one way or the other very soon, as both technology leaders and new startups are just now delivering SQL interfaces to Hadoop.

Adding a SQL interface to Hadoop makes the system operate in an ad hoc fashion, versus its traditional role as a batch processing system. Without writing and compiling complex code, existing SQL developers can quickly and efficiently query the Hadoop system using existing SQL tools. This will result in a large pool of low-cost developers able to exploit the power of the Hadoop system. This goes a long way toward building the ecosystem Hadoop needs to achieve broad corporate adoption—to cross the chasm.

The Big SQL umbrella also includes using Hadoop (actually HDFS) as a distribution mechanism to back-end databases (Hadapt). In addition, some of the NoSQL offerings use MapReduce technology to parallelize their processing (BigTable, H-Base, etc.). The number one request of NoSQL users is a SQL interface. The NoSQL companies are responding with SQL-like interfaces that offer support for a subset of the SQL language. In other words, they are morphing into “SomeSQL” key-value stores. These too fit into the Big SQL market.

The traditional DBMS companies are not sitting still. Oracle, IBM and Microsoft are already integrating Hadoop as well as proprietary parallel processing architectures. Oracle’s Exadata® already parallelizes queries to various storage nodes, just as ScaleDB does for MySQL. Look for more of this in response to the growing popularity of Hadoop-based systems.

Joseph Turian, among others, believes that we will see a blending between the traditional DBMSand the new Hadoop markets, as both of these technologies compete for a share of the Big SQL market. ScaleDB represents what may be the future response from database vendors as they add distributed parallel processing to their databases. Greenplum and Aster Data embraced Hadoop technology, albeit for analytics only, using the Postgres database. Look for a blending of the technologies, making for a very exciting Big SQL space in the coming months and years.

Products in the Big SQL Market

The Big SQL market is very fast moving, with venture capitalists funding new companies, while some of the biggest companies also jump into the fray. The market for SQL interfaces to Hadoop seems to be the hot market du jour, and it is quickly getting crowded. Other Big SQL sub-markets are still in their infancy, but they are starting to get attention. The following is a high-level categorization of the Big SQL market.

Figure 1: The Big SQL Market
(Click the product/company names to read more about them)

Big SQL Partnerships

As the market continues to heat up and the upstarts begin to overlap with the traditional leaders in data management, you can expect more partnerships to develop in this space. We have already seen Greenplum/EMC create the DCA (Data Computing Appliance). Hortonworks has partnered with Teradata, Rackspace, Talend and Microsoft. Cloudera has partnered with Oracle, IBM, HP, SAP and MicroStrategy, among others. While some of these may be Barney partnerships (yes the purple dinosaur who sang the song “I love you, you love me”…but nothing comes of them), and others may be a mechanism for the big guys to learn and monitor this “Hadoop thing” should it become something they need to embrace-and-extend; that remains to be seen.

This is a fast-moving field. If you have comments or edits to the information on this page, please email them to mike [at] scaledb [dot] com.