Effective Database Indexing: A Guide for Software Engineers

This article explores the vital role of database schema design in addressing performance issues for software engineers working on backend systems. It delves into common pitfalls like underindexing, overindexing, and issues with volatile columns, offering practical solutions such as thoughtful indexing, denormalization, and leveraging in-memory caching.

databases
data
software architecture

By Senne Dirkx

On Sun Jan 14 2024

Massive data warehouse building in a cyber punk city

If you are software engineer on the application level that often deals with backend systems, I can almost guarantee you have encountered performance issues where the root cause was poor database schema design. I’ve encountered several of those in my career. While navigating through these challenges, I made a few mistakes. However, each misstep became a valuable lesson. Today, I’m excited to share these insights with you.

Databases and performance

Let’s get it out of the way: databases are complex. There are hundreds (thousands?) of different database system products available. Each one has its problems, quirks and performance tradeoffs. There is a lot that could go wrong in a backend system that uses a database. Such as (but definitely not limited to) sending inefficient SQL queries, using object relational mappings (ORMs) in an unintended way (e.g. the N+1 query problem), inadequate amount of resources allocated and more. But definitely one of the harder types of problems to solve is poor database schema design.

Database schema design is an art form, in my opinion. There is so much that goes into it. You need to ask yourself the following questions when designing a schema for a feature: Which tables do I need and what should be the relations between them? Which tables contain which columns? What are the data types of those columns? What are the primary keys of those tables and what foreign keys am I using for the relations? Do I need other constraints to ensure the integrity of the data? And perhaps most importantly for performance: do I need indexes on my tables and, if so, on which columns?

This post will be focusing on the performance impact of indexes on database tables. Because even is this small subset of database schema design problems, mistakes are much more common than you may think.

Indexing pitfalls

Underindexing

The most obvious pitfall (and the easiest the solve) is underindexing. This is where the database table does not have the necessary indexes on certain columns to function well. There will be prominent query patterns with WHERE clauses, JOIN clauses or ORDER BY clauses on columns without an index.

The solution here is simple. Document the queries that target this database table and note the columns used in the clauses listed above. Carefully add indexes on these columns. However, we should also look if we can improve the queries themselves. As we’ll see in the next section, overdoing indexes causes other issues.

Overindexing

Indexes can often magically improve query performance by several orders of magnitude. That’s why many engineers will just blindly add them where they think they will come in handy at some point in the future. This is probably a worse practice than just forgetting to add them. You will not believe how many unused or underutilized indexes exist in this world. Indexes are not magic and have two major downsides.

The first one is that they consume disk storage space. Depending on the underlying indexed column(s) and included columns, they can become massive. I have encountered systems where the storage space required for the indexes was nearly as large as the space required for the underlying data.

The second downside is that they slow down your writes to the database table. A table with an index will have slower inserts, updates and deletes. With each row write, the database not only writes it in the table but also to a strategical spot within the indexes. You may say “this speed decrease is negligible” and you would be wrong. You can definitely measure it and especially on tables that are updated very frequently, it could become the leading cause for outages if you are not careful.

Both of these downsides translate to higher upkeep costs. Of course, if you need the indexes and they improve application performance, you will save money. Therefore the solution here is also quite simple. You should add indexes to a table if and only if you are absolutely certain you need them and can actually benefit off of them.

Improper column order

In most cases, indexes on a single column suffice. However, there are certainly query patterns that benefit massively from composite indexes. These are indexes that span multiple columns. The order of columns in a composite index is crucial, and if chosen incorrectly, it can lead to suboptimal performance. If you have a composite index on column A and column B (in that order), then queries that only filter on column B will not be able to make use of this index.

The solution here is to analyze your query patterns to understand which combinations of columns are frequently used together. Order the columns in composite indexes based on the most common query conditions. You might need additional indexes to cover all common queries if these patterns only use a subset of the indexed columns and this subset is not at the beginning of the composite index ordering.

Ignoring data distribution

We’ve already established that blindly adding indexes to columns is a terrible idea. I mentioned that it’s a great idea to add indexes only to columns that are very frequently used in WHERE, JOIN or ORDER BY (and other) clauses. However, it’s an even better idea to consider the data distribution of these columns as well.

Uneven data distribution across indexed columns can lead to poor index performance. If the data is not evenly distributed, the index might not effectively reduce the search space and only cause additional overhead.

A great example for this pitfall is when an index is placed on a Timestamp column in a monetary transactions table. Let’s say there exists a query that fetches old transactions for a user, defined as transactions that happened before the current month. Is it really a good idea to add an index to the Timestamp column for this scenario? No, because with time the percentage of old transactions to total transactions will continue to rise and the index will be useless for reducing the search space.

It’s not trivial to solve data distribution issues like the example above. Regular monitoring and maintenance of the indexes and their statistics is advised. In many of these cases, improvements can be made with partial indexes or partitioning.

Index on volatile columns

Adding an index on a column that endures frequent updates can be dangerous to database performance. When you update a column that is part of an index, the index needs to be maintained and updated as well, because the row’s initial placement within the index becomes quickly outdated. This can be very resource-intensive. On top of that, it can lead to page splits, a fundamental cause of index fragmentation.

Indexes on UUID columns are notorious for index fragmentation. This is because the inserted values are randomly generated and not ever-increasing, which some database engines have difficulty to effectively build indexes on. Regular index maintenance and rebuilding are strongly advised to mitigate this issue.

These indexes on volatile columns are not always avoidable. However, it’s very important to be aware of the behavior of database indexes in these scenarios. Plan scheduled maintenance windows for index rebuilding to avoid fragmentation. Some DBMSs provide features or tools to perform index maintenance during periods of low activity.

General recommendations

In navigating these pitfalls, it becomes evident that thoughtful consideration is crucial when deciding where to apply indexes. It’s not just about the benefits they bring but also about understanding the potential drawbacks and finding a balance that optimizes the common query access patterns. Now, let’s delve into some general recommendations.

Index columns involved in joins

Complex join operations are often a recipe for disaster. It is therefore often (but not always) a good idea to place indexes on columns on which the joins happen.

Denormalization

In some cases, in can be better to perform some form of denormalization of the database schema. This will introduce redundancy and duplicate data, but will allow you to avoid frequent complex joins and therefore need less indexes.

A (pretty drastic) example application of denormalization is to split up a database table into a read-only table with necessary (composite) indexes for fast lookups and a write-heavy update table with much less or no indexes at all. This is useful in the case where your data is only update-heavy for a certain period of time and afterwards becomes read-only. The data is transferred from the write-heavy table to the read-only table in bulk operations. This allows the write-heavy table to stay small and, with no indexes to slow it down, writes will be fast. And it will allow the application to employ efficient complex read queries on the read-only table that has the indexes.

Partial indexes

If updates are concentrated on a subset of the data, consider using partial indexes. These indexes cover only a specific subset of the table, reducing the maintenance overhead during updates.

In-memory cache

Caching is a great and (once set up) relatively easy-to-use mechanism to improve any data fetching operation. Tools like Redis and Memcached are exceptionally fast. A simple lookup in the cache will cost you nearly nothing. And if it’s a hit, you may completely skip the (much slower) database query and thus you are improving general performance of your application.

But like most things in life, there’s some downsides. And with in-memory caching, one of them is… well, memory. Memory is a finite resource, and while an in-memory cache is advantageous, it’s not limitless. To add on top of that, the bigger the cache, the slower it tends to be.

The second downside is that your session state in the cache will not always be in perfect sync with the data in the database. Depending on the cache writing policy, it may take some time for cache or database items to be fully updated. You often want to give the cache items a relatively small lifetime, so that the cache is updated using the database’s state fairly regularly. Striking a balance between cache lifetime and consistency is essential, especially for applications requiring strong consistency.

Because there will probably be loads of sessions and a limited in-memory cache with items with a short lifetime, cache misses will often occur. Which means, the application will fall back to our (slower) database queries. Thus, it’s crucial to complement caching with improvements to the underlying queries and database schema design.

Conclusion

In crafting an effective database schema, the pitfalls around index choices are clear – from underindexing to overindexing and everything in between. There’s no one-size-fits-all solution. Instead, success lies in continual analysis, adaptability and a commitment to creating resilient database schemas that evolve with the dynamic demands of our applications.