Large Database

Experience MySQL scalability and high-availability

Large Database

What is a Large Database?

The definition of large database is always changing, as IT collects more data (e.g. click stream data) and as hardware and software evolve to handle larger amounts of data. What was large 10 years ago is tiny today. The term Large Database can be quantified by a variety of criteria, see below:

Various Definitions of a Large Database

Small Database Medium Database Large Database
Fits in Memory Fits in Single Server Spread Over Multiple Servers
No DBA Required 1 DBA Required 2+ DBAs Required
<105 Records 105– 107 Records >107 Records
<10 GB of Data 10GB – 40GB of Data >40GB of Data
No Partitions Minimal Partitions Massively Sharded
--Performance is not a problem -- Performance is a Problem!

If I updated this list in 3-5 years, all of the fixed measurements would probably increase by a factor of 10X to 100X, maybe more. Probably the most relevant description of a “Large Database” is one that is causing you pain and forcing you to consider other options.

The Factors that Define a Large Database

The reason the term “Large Database” is a moving target is because technology continues to evolve. What is considered large when run from a hard disk may seem manageable when running on a Solid State Disk (SSD), when running in memory, or when running on an elastic database like NoSQL or ScaleDB for MySQL. As a result, we can consider database size to be the interplay of following four factors:

large database considerations

Data Volume: The amount of data as defined by the number of records, tables, terabytes/pedabytes, etc.

Hardware: Running even a small database on an extremely constrained server will seem like a large (problematic) database.

Throughput: This is database lingo for usage. If you have a small database but it services 10 million concurrent users, it will seem like a large database. Or you may have a single client but it is running billions of transactions, that too will seem large. Measurements of usage levels are considered throughput.

Software: This can be considered to include both the database management system (DBMS) you are using, as well as the implementation of the database itself. That implementation may make extensive use of I/O, network or CPU intensive processes such as joins or range scans. It might also depend on your use of optimizations such as indexes.

Your database is only as good as the weakest of these four factors, but you can also compensate for weakness. For example, if your hardware has a small disk you can use compression. Or, if you are short of RAM, instead of an in-memory DBMS, you’ll want one that is more RAM efficient.

Handling a Large Database: Scale-Up vs. Scale-Out

If you ask a DBA a general question, more often than not, the answer will be “It depends”. When you ask whether it is better to scale-up versus scale-out the answer is, of course, “It depends”.

Scaling-up has gotten a bad reputation for the simple reason that larger servers have a worseprice/performance ratio than commodity machines. In other words when purchasing high-end servers, the performance per dollar spent declines. So the obvious answer would seem to be scale-out. There are, however, two other considerations.

If scaling-up simply means purchasing more RAM or a faster disk, e.g. SSD, it may be more cost-effective to simply upgrade your machine, which is a low-cost form of scaling up. The second consideration is other costs, other than the server hardware. These other costs can include additional software licenses (database, application, etc.), rewriting the application to scale-out, maybe your switch doesn’t have an additional port and you would have to buy an expensive switch to handle a scale-out. You should consider all of these things when making your decision on how to scale your database.

Generally speaking, especially if you are using open source software, the decision to scale-out is the most cost-effective and most scalable solution. You can usually get more out of a large collection of commodity servers than one large specialty server. Furthermore, if you are running your database in the cloud—where the default is scale-out—you will want a scale-out database architecture. See Cloud Considerations for Large Databases for more detail on these considerations. Also, as mentioned above, consider the cost of rewriting your software to handle scaling-out. This is addressed in the following section.

DBMS Architecture Considerations

When dealing with a large database, the architecture of the DBMS—in conjunction with database design, which we address later—determines its scaling profile. There is an axiom in the DBMS world: The less the database does, the faster it can do it. So let us look at the DBMS architectures from that perspective:

  1. NoSQL: SQL provides a powerful language for manipulating data. NoSQL describes a class of DBMS also known as key-value stores. The application provides a key to the NoSQL database, which returns the value associated with that key. NoSQL pushes much of the processing to the application. This makes the database very scalable, but it shifts that load to the application. Unfortunately, there is no free lunch. For example, if you want to do a join, your application must pull all of the data into the application tier and do the join itself. If your application scales/performs well, then this is an approach that may work for you. It can result in a lot of data shipping, as opposed to processing the data where it resides. This can result in a performance penalty, depending upon the application. By pushing functionality into the application tier, it can also result in higher costs to code, debug and maintain your application. NoSQL also relaxes consistency—one of the ACID properties—but assuming that is not a problem for your application, NoSQL can provide nice horizontal scalability.

  2. Sharding: The basic principle behind sharding is to break a slow or unwieldy large database into a lot of quick little databases, each with dedicated compute and storage. Each database, or shard, is independent, meaning it has no knowledge of, or coordination with, the other shards. By eliminating any dependency between the shards, any processes that involve more than one shard must be handled in the application, much like the NoSQL example above. For example, a join or range scan would require loading the data into the application, and then having the application operate across that data, instead of allowing the database to handle that process. Sharding also requires that the application route each database request to the specific shard responsible for that data. As your data grows, it can suffer skew, causing you to have to move data among shards, a process called re-sharding. Re-sharding can be quite painful. There are tools such as DBShards, ScaleBase and ScaleArc to facilitate the sharding process.

  3. Database Virtualization: Another way to scale your large database is database virtualization. Database virtualization provides the application with a single logical database, but the actual database calls are handled by an elastic cluster of databases, also known as the compute tier. This compute tier relies on a scalable tier of storage nodes, the storage tier.

    Database virtualization acts as a single database, when it is in fact running across an elastic cluster of database servers or virtual instances. Unlike NoSQL and sharding, database virtualization enables traditional database functions such as counts, range scans, joins, etc. are handled seamlessly inside the database. Database virtualization also provides consistency, unlike NoSQL, which only provides “eventual” consistency. Database Virtualization also offers function shipping, moving processing to the storage nodes to reduce network traffic and latency. This provides parallel processing similar to Hadoop’sMapReduce.ScaleDB is a leader in database virtualization.

  4. The Memory Balancing Act: Memory can run about 100X faster than disk access, so you want to maximize your use of memory. Solid State Disks (SSD) reside somewhere in the middle in terms of performance; making them another option when attempting to boost performance. Some large databases run exclusively in memory. This means that both the data and indexes must reside in memory. Other databases allow the data to reside on disk—ideally caching as much as possible in memory—but require that all indexes reside in memory. Still others maximize memory for both indexes and data, but allow them both to overflow onto disk.

  5. When selecting a DBMS for your large database, be sure to consider how it uses memory and also the target platform. For example, if you plan to support a large database and you want to run it in a public cloud that has limits on the physical RAM available, you may find that your large database will swamp an in-memory database and it may even overwhelm a database that requires that all indexes reside in memory. Be sure to consider planned growth and RAM limitations when selecting a DBMS.

    If you are considering database virtualization, you will also want to explore local caching. When using database virtualization some databases (like MySQL Cluster) do not provide caching of data in memory on the compute tier. Others (like ScaleDB) do support local caching. The benefit of local caching is that database requests served from a local cache are much faster than disk access, by a factor of 10-100. As a result, caching at the compute tier can boost performance considerably.

DBAs: Worth Their Weight in Gold

A good DBA will have experience scaling databases. They can analyze the situation and determine the database bottlenecks, diagnosing and acting to optimize scaling for your large database. The following sections highlight considerations made by a DBA.

Defining and Addressing Performance Bottlenecks

The first step in resolving problems associated with a large database is to diagnose the causes of performance bottlenecks. As with all things database-related, there are tradeoffs. For example, if your database has a very high read/write ratio (does a lot more reads than writes) indexes can dramatically improve read performance. However, if your database has a very low read/write ratio (does a lot more writes than reads) then updating indexes with every write can reduce overall database performance. This same principle applies to the more granular table level as well. The decision whether or not to add indexes is just one of many such issues a DBA must consider when scaling a large database.

Cloud Considerations for Large Databases

When running on a public cloud, there are finite limits that you must consider. You cannot simply stuff more RAM in a box or add an SSD like you would with an in-house server. These constraints limit the degree to which you can scale-up,and at some point they force you to scale-out. It is important to consider this as you plan for growth and select your DBMS. These limits include RAM, CPU, network bandwidth and storage (e.g. Amazon EBS is limited at 1TB). If you are going to grow beyond the constraints of the largest instance on the cloud, you should be prepared to scale-out from the start.

Availability Considerations for Large Databases

Everyone would love to have a database that never fails, one that is highly-available. However, high-availability entails costs, both financially and in terms of performance. Highly-available DBMS typically charge a premium for this capability. But putting financial costs aside, they also impose a performance penalty. High-availability means writing to more than one place and waiting for the slowest of the two writes.

ScaleDB provides an optional configuration that allows writing to memory in two or more storage nodes (for high-availability) and then flushing to disk outside of the transaction. This mode actually increases performance because the slowest portion of the transaction—writing to disk—is handled outside of the transaction, so it doesn’t impact database performance.

Large Database Backup Considerations

When dealing with a large database, especially one that spans multiple servers, the backup complexity is compounded. Backup is a balance between your business’ data loss tolerance and its backup budget. If you can tolerate 24-hours of data loss, then a tape backup may prove sufficient and cost-effective. However, if you cannot handle any data loss, you may need to invest the money in a highly-available DBMS.

Another consideration, when developing a backup plan for your large database, is whether some tables can be backed-up less frequently than others. This can help lighten the load instead of applying a one-size-fits-all backup plan across the entire database. You can also consider backing up only the diffs (changes in the data since the last backup) to speed up the process, but then you need to consider the impact this will have in recovery time.

Finally, your backup plan should consider the time it takes to recover and go live after a failure. This period of time is known as the recovery window. The recovery window should allocate time for each of the following steps: (1) identify the failure, (2) plan recovery; (3) execute recovery plan; (4) actual time for the recovery process to complete and the recovered system to go live. Keep in mind that as your database grows, the time required for step #4 will also grow, since it is dependent on the amount of data and the complexity of the database. Add the time for all of these steps and you have your recovery window. Since the system is down during the recovery window, this plan should be blessed by the business people in advance of your first failure. Remember down-time directly equates to loss of revenuefor transactional businesses. If you find that the business has little or no tolerance for data loss and/or downtime, then you should seriously consider a highly-available database.

Large Database Conclusion

If you are dealing with a large database, or one that will become large over time, there are many considerations that will influence your selection of the DBMS and the design of the database. Is your challenge the hardware, DBMS, high volume of data, or high throughput? Any one of these things can make your database seem slow and unwieldy. Be sure to include business considerations in your database plan, including the business’ tolerance for data loss and downtime. Are you planning to run your database in the cloud? If so, plan to scale-out when you hit the limits of a single database server. And, of course, work with a top-notch DBA who can tune the database to extract maximum performance. If you want to minimize your large database scaling challenges, we hope that you will consider using ScaleDB. Regardless of which DBMS you use, we wish you good luck!