Exploring Memtables
In the realm of high-performance database management systems, Log-Structured Merge (LSM) Trees have risen to prominence for their ability to efficiently manage vast amounts of data. One of the key components of an LSM tree is the Memtable. Memtables are dynamic in-memory data structures that serve as the initial touchpoint for engaging with the underlying database storage engine.
At the outset, the storage engine employs memtables to buffer all of the incoming write requests. As the write requests accumulate and the memtable buffers fill up, all of the data residing in the memtables undergoes conversion into a format suitable for disk storage. Subsequently, this transformed data is transferred to secondary storage, where it takes the form of Sorted String Tables (SSTables).
Because memtables exclusively contain all of the most recently written data, they participate in every read operation that the storage engine performs as well. Before examining any of the SSTables located on disk, the storage engine always attempts to extract data from the currently available memtables first. It only resorts to the SSTables when no data can be located in the memtables.
In today's article, we will implement a fully functional Memtable component as part of a simple key-value database storage engine. In the process, we will apply knowledge from one of our previous articles, "Implementing a Skip List in Go", so please make sure to review that article before continuing further in order to fully comprehend the topic that we are about to discuss.
Without further ado, let's get started!
Laying the Groundwork
Let's begin by outlining the main components of our storage engine, starting with the main Memtable
struct. In most commercial databases (Apache Cassandra and CockroachDB, for example), memtables are typically built on top of skip lists. This choice is rooted in extensive benchmarking results, which consistently highlight that skiplist-based memtables provide good overall performance for both read and write operations regardless of whether sequential or random access patterns are used (a factor that will become more apparent later, when we discuss SSTables).
Consequently, we can just wrap the skip list that we developed in our prior article, "Implementing a Skip List in Go", into a new structure called Memtable
. Since the only constraint to the growth of our memtable would be the total amount of RAM available on the machine, it might be a good idea to establish an upper limit to signal when a memtable should stop expanding and be replaced by another memtable. This will also impact the on-disk file size of the SSTables that are eventually going to be created on secondary storage.
For this reason, we specify two additional attributes in our Memtable
struct called sizeUsed
and sizeLimit
. Our goal is to ensure that sizeUsed
never goes beyond sizeLimit
. If it is about to do so, we simply trigger the creation of a new memtable to prevent the old one from crossing the limit.
We define some additional methods to help us in achieving this requirement:
NewMemtable()
to construct a new memtable initialized with an empty skip list and an externally specified size limit.HasRoomForWrite()
to determine whether the memtable has adequate room for accommodating a specific key-value pair.Insert()
to wrap the actual data insertion into the underlying skip list, while properly incrementing thesizeUsed
counter.
Thus, our initial memtable implementation reads:
Moving on, we also outline our main DB
structure. This structure represents our database storage engine, exposing a public API for inserting, deleting, and retrieving data:
With both structures in place, we can continue working out the details of our implementation.
Inserting Data
To insert data, it is sufficient to instantiate a new DB
struct and call its Set()
method from our client code:
The Open()
method initializes the memtable component of our storage engine. As you might have noticed, the db.memtables
field holds two sub-fields: mutable
and queue
. The mutable
field points to the most recent memtable, which is also the only memtable capable of fulfilling write requests, while queue
holds pointers to all of the memtables currently in existence.
Remember that as older memtables fill up, we replace them with newer ones. Each memtable sits patiently in the queue
, waiting for its turn to be flushed to disk in the form of an SSTable. Only the mutable
memtable, however, is not at capacity and has room for more writes. All other memtables are full and are therefore read-only.
This becomes more apparent as we complete the implementation of the Set()
method and define its related helper method, rotateMemtables()
, listed below:
This gives us a memtable component fully capable of handling data insertion and allows us to move on with our implementation.
Retrieving Data
The process of retrieving stored information involves traversing the memtable queue in reverse order (from the newest to the oldest memtable) in order to determine whether an item matching the provided search key is present in any of the underlying skip lists or not. If a match is identified, we stop looking through the remainder of the queue and simply return the value of the discovered data item.
The code below expresses this idea:
Since none of our earlier listings included the body of the Get()
method from our Memtable
struct, we are listing its implementation below:
As you can see, the Get()
method simply searches the underlying skip list and returns the corresponding search result (if one is found).
We can then use the following code in our client in order to search for a specific key:
This code will look into all of the available memtables to find a particular key-value pair.
Introducing Tombstones
Everything seems great so far, but how do we handle deletions? Remember that only the mutable
memtable is open for modification, and all other entries in the memtable queue
are treated as read-only.
If a key resides in the mutable
memtable, we could theoretically delete it by simply removing the corresponding data item from the underlying skip list. However, if the same key is duplicated in one of the earlier memtables, we will be out of luck, and further retrieval requests will continue producing results, returning the older value recorded for that key (the one residing in the older memtable).
This is where tombstones come into play. Essentially, a tombstone is a marker indicating that older data records corresponding to a given key should be shadowed due to a requested deletion. So, instead of removing any actual data from any underlying skip lists, we simply insert a special tombstone record into the latest memtable, thus shadowing all previous values for the provided key.
Having to support tombstone records means that we need a way to distinguish between regular insertions and tombstones. To do this, we can augment the format of our data records so that instead of storing a simple key-value pair, we also somehow include metadata indicating the type of the record. This can be accomplished by modifying the value component of each key-value pair.
We can embed the metadata directly into the value component by using the following encoding schema:
We can then introduce a new set of structures to encapsulate the encoding operations:
To make our memtables function with the new values of type EncodedValue
rather than the simple []byte
arrays that we used before, we have to apply the following changes:
We also have to modify the Get()
method of our main DB
structure, so that it starts recognizing tombstones:
As you can see, the Get()
method will state that a key is not found when, in fact, the key is present, but a tombstone shadows its prior values.
The Delete()
method then becomes almost identical to the Insert()
method, the only difference being that we call m.InsertTombstone(key)
instead of m.Insert(key, val)
:
To get rid of any repeating code, we can narrow this down to:
With that, we have fully working deletion using tombstones to shadow old records.
Flushing Memtables to Disk
Even with generous amounts of RAM available, we cannot store an unlimited amount of memtables in our memtable queue. At some point, we have to persist the accumulated data on disk to both keep the memory usage under control and, more importantly, to allow our data to survive through server restarts. This process is known as flushing.
When performing a flush, we traverse all of the read-only memtables residing in the memtable queue, starting from the oldest one and moving towards the newest. We don't touch the mutable
memtable. During the traversal, we transform the data into a format that's more suitable for on-disk storage. Ultimately, the converted data is saved on disk in the form of SSTable files (binary files with an *.sst
extension).
We will only briefly touch on the intricacies of SSTables in today's article, but we can say that SSTables use a structured binary format that maintains persisted data in sorted order, while facilitating quick and efficient searching.
We can initiate a flush when a specific threshold is exceeded. For instance, we can specify that if the combined size of all of the memtables exceeds a particular threshold in bytes, the data in all of the read-only memtables should be flushed to disk. Then, each time a new record is added to a memtable, we can compare the overall size of all existing memtables against this defined threshold and trigger a flush if necessary.
We can use the following code to express this intention:
This calls for a small change in our Memtable
struct, namely the inclusion of a new method that makes it possible to obtain information about the current memtable size:
You may have noticed in the listing above that the body of the flushMemtables()
method, which performs the actual work, has been purposefully left empty. This is because we need to clarify a couple of things. First, we need to figure out how to scan our memtables sequentially. Second, we need to agree on the precise data format to use for building our SSTables.
The name SSTable stands for Sorted String Table and implies that data within an *.sst
file is stored sorted by key in ascending order. The skip lists that our memtables are based upon already maintain their data sorted in ascending order, so converting their data to the agreed-upon SST data format while maintaining the correct order is a relatively simple task, assuming that we have a mechanism in place that allows us to scan the skip lists sequentially.
Unfortunately, the skip list implementation that we discussed in the original "Implementing a Skip List in Go" article doesn't provide such mechanism, so we have to implement it. Knowing the fact that, by definition, the very first level of a skip list (Level 0) contains all of its items makes the implementation quite straightforward—to scan the list sequentially, we just need to iterate its first level.
Taking inspiration from the Iterator interface in Java, we end up with the following implementation:
In addition to that, we define a new method in our Memtable
struct that enables us to instantiate new iterators:
Using this method, we can now scan any memtable sequentially in the following way:
i := m.Iterator()
for i.HasNext() {
key, val := i.Next()
// do something with "key" and "val".
}
With the sequential scanning mechanism in place, the only thing left to address is the data format we'll employ for on-disk data storage.
Writing SSTables
We would like to keep this article focused on memtables. With that in mind, we are going to employ a very basic data format for our SSTables. Each *.sst
file will be a simple sequence of data blocks. Each data block represents a specific key-value pair and has the following structure:
To encapsulate the logic for transforming a memtable into an SSTable and persist everything on disk, we introduce a new struct called sstable.Writer
exporting the following methods:
NewWriter()
for constructing a newsstable.Writer
instance, accepting a writable file descriptor that's wrapped for performing buffered I/O in 4 KiB data blocks.Process()
for iterating a memtable sequentially and converting its data to SSTable data blocks (using thewriteDataBlock()
method, which performs the actual encoding).Close()
for flushing any remaining data from the data buffers and forcing the kernel to persist all changes to disk before closing the open file handles.
The full code reads:
We are now almost ready to implement our db.flushMemtables()
method. The only thing left to solve beforehand is to abstract our data storage.
Abstracting the Data Storage
Our sstable.Writer
is engineered to work with any abstraction satisfying the io.Writer
interface. However, to this point, we haven't produced any code for actually creating, writing, or reading files in the filesystem, and we really need this to be able to complete our implementation.
For that purpose, we introduce another new structure named storage.Provider
. The storage.Provider
manages the primary data folder of our storage engine. This folder contains all of the *.sst
files produced during the lifespan of our database. Upon startup, the storage.Provider
ensures that this folder exists and is accessible for reading and writing.
The storage.Provider
also provides methods for preparing and opening new files for writing, listing available files, and opening existing files for reading. Instead of working with raw os.File
file descriptors, it introduces a lightweight FileMetadata
data structure that abstracts away low-level information and only reports a number uniquely identifying the particular file inside the data folder managed by the storage.Provider
, as well as its file type (this allows us to skip non-SST files if we discover any in the data folder).
To be able to generate file numbers, the storage.Provider
maintains a global fileNum
counter, ensuring that each subsequently generated *.sst
file gets a unique identifier.
The complete implementation looks like this (it is a bit long, but hopefully clear enough):
To integrate the newly introduced storage.Provider
with our main DB
structure, we have to apply some small changes:
This also alters the way we initialize our storage engine in our client code; we now have to pass a folder name to the Open()
method:
With the sstable.Writer
and storage.Provider
finalized, we can finally implement the db.flushMemtables()
method:
The method iterates all of the read-only memtables, prepares new *.sst
files on disk for each individual memtable, and then fills each file with content.
Searching SSTables
Now that we are successfully writing SSTables to disk, we need to have them included in the search pipeline. Otherwise, we won't be able to look up data that is persisted on disk. For that purpose, we can introduce a new field in our main DB
structure named sstables
, which can be used for storing references to all SSTables available on disk. By slightly modifying the flushMemtables()
method that we just finished implementing, we can ensure that the newly introduced sstables
field is always kept up to date with metadata identifying each newly created SSTable:
This allows us to introduce an sstable.Reader
that lets us search SSTables for specific keys. The sstable.Reader
simply iterates the data blocks of any loaded *.sst
file and tries to find matches for a particular key. The implementation is quite straightforward:
We then need to adapt our Get()
method for scanning the SSTables right after it has finished searching all memtables without finding a matching result:
Our search procedure is now complete and works with both memtables and sstables.
Restoring the Database State
With the code that we have so far, you may observe that after restarting our database storage engine, data previously stored on disk becomes inaccessible. This is due to the fact that the db.sstables
field is only being populated by the flushMemtables()
procedure.
To address this issue, we simply need to scan the data folder after every restart and populate the db.sstables
field with the metadata retrieved from the data folder. We can do this by introducing the method loadSSTables()
and slightly modifying our db.Open()
method as follows:
Now, every time we restart our storage engine, we will still be able to access the data stored in its *.sst
files.
Final Words
We've covered a lot of ground, and as always you can find the full source code for this tutorial over here:
Yet there is so much more ground to cover and crucial questions to be answered:
Q: What happens to the data stored in the unflushed memtables if we abruptly terminate the database storage engine process?
A: We will unfortunately lose all of this data and currently have no recovery mechanism in place. We can resolve this by implementing a write-ahead log (WAL). We will cover this in a follow-up article, so stay tuned!
Q: What methods can we employ to optimize the search process within our SSTables and minimize their footprint on disk?
A: We traded off performance for simplicity. The sequential search that we are currently doing is far from ideal. We are also wasting a lot of disk space due to our inefficient data format. We will address all of this issues in a follow-up article specifically focused on SSTables. Stay tuned!
I hope you found some valuable information in this article. Please don't hesitate to get in touch if you have any questions or comments.
Until then, feel welcome to play around with the provided CLI:
References
I drew most of my inspiration for this article from tinkering with the Pebble storage engine created by Cockroach Labs.
You can find its source code at https://github.com/cockroachdb/pebble.