Definition

A Solid State Drive (SSD) is a storage device that do not have any kind of mechanical parts like HDDs, but it is based on NAND flash memory, which is a type of non-volatile memory made of transistors, that can store data even when the power is off.

The device is composed by a controller and a number of NAND chips. The controller is in charge of managing the communication between the NAND chips and the host system. The NAND chips are the actual storage units, where the data is stored. Like HDDs, SSDs used the same interface to connect to the host system, but they are much faster and more reliable than HDDs.

A NAND chip can store data in a cell, which can be of different types. The most common types of chips are:

  • SLC (Single Level Cell): each cell can store only one bit of data. This type of cell is the fastest and most reliable, but it is also the most expensive.
  • MLC (Multi Level Cell): each cell can store more than one bit of data. This type of cell is slower and less reliable than SLC, but it is also cheaper.
  • TLC (Triple Level Cell): each cell can store three bits of data. This type of cell is slower and less reliable than MLC, but it is also cheaper.

Other types of cells are also available, but they are less common, like QLC (Quad Level Cell) and PLC (Penta Level Cell). The number of bits that can be stored in a cell is called the cell level and it is an important factor that affects the performance and reliability of the SSD.

Internal organization

An SSD is composed by a number of NAND chips. Each chip is organized into Pages and Blocks.

Definition

  • A Page contains multiple logic block addresses (usually 512 bytes to 4KB).
  • A Block typically contains multiple pages (usually 64) with a total capacity of 128-256KB.

The block (or erase block) is the smallest unit that can be erased. It consists of multiple pages, and can be cleaned only as a whole using the ERASE operation. The page is the smallest unit that can be read or written. It is a sub-unit of an erase block and consists of the number of bytes which can be read/written in a single operation through the READ or PROGRAM operations. Each page can be in three different states:

  • Empty (or ERASED): the page do not contain any data.
  • Dirty (or INVALID): the page contains data, but this data is no longer in use.
  • In use (or VALID): the page contains data that can be actually read.

The Block/page terminology is different from the one used in HDDs. In HDDs, a block is a group of sectors, while in SSDs a block is a group of pages.

Performance

Considering the performance, SSDs are known to be much faster than HDDs: laboratory tests have shown that SSDs can be up to 15-20 times faster than HDDs. However, the actual performance of an SSD depends on many factors.

Example

Let’s consider an SSD with:

  • page size of 4KB
  • block size of 5 pages
  • drive size of 1 block
  • read speed of 2 KB/s
  • write speed of 1 KB/s

If I want to write a 4KB file on a brand new SSD, the write operation will take 4 seconds and it will involve only 1 page. If I want to write a 8KB photo on the same SSD, the write operation will take 8 seconds and it will involve 2 pages, the second and the third.

If the file write in the first page is not more needed and I want to write a new file of 12KB, the write operation will take 24 seconds, because there are not enough empty pages to write the new file. So the controller needs to:

  1. read block into cache
  2. delete page from the cache
  3. write the new file in the cache
  4. erase the old block (the file in the first page) on the SSD
  5. write the cache on the SSD

Reading 12KBs of data when in fact the SSD had to read 8KBs, and the write 20KBs, the entire block. The writing should have taken 12 seconds, but actually took seconds, resulting in a write speed of 0.5 KB/s.

Flash Translation Layer (FTL)

Definition

The Flash Translation Layer (FTL) is a software component that is used to manage the mapping between the logical and physical sectors of an SSD, to manage the wear leveling, and to perform the Garbage collection.

The FTL makes the SSD “look like” a HDD to the host system, because the physical mapping of SSDs do not directly match the logical mapping of the host system. The FTL is in charge of performing the following operations:

  • Data allocation and address translation
  • Garbage collection: this process is used to reuse the pages that are no longer in use, marked as dirty or invalid.
  • Wear leveling: this process is used to distribute the write operations evenly across the NAND chips, in order to increase the lifespan of the SSD.

Log-structured FTL

The Log-structured FTL is a type of FTL that is based on the log-structured file system. In this type of FTL, the data are written sequentially in the NAND chips, and the mapping between the logical and physical sectors is stored in a table called Log. The Log is stored in the NAND chips, and it is updated every time a new write operation is performed. It keeps track of the mapping between the logical and physical sectors, and to manage the wear leveling.

Example

Assuming that the SSD has a page size of 4KB and a block size of 4 pages, the Log will be a table with 4 columns and rows, where is the number of pages in the SSD. Each row will contain the following information:

  • Logical page number
  • Physical block number
  • Physical page number
  • Status (empty, dirty, in use)

Let’s considering the following writing operations with :

  • (100, a1)
  • (101, a2)
  • (2000, b1)
  • (2001, b2)
  • (100, c1)
  • (101, c2)

The initial state of the block is with all the pages marked as INVALID. The first operation to be performed, before any write operation, is the ERASE operation. The ERASE operation is used to clean the block. Then, the FTL programs pages in order, update the mapping information and write the data in the pages (first 4 write operations above). The last two write operations are updates, so the FTL needs to read the block, update the data and write the block again.

Garbage collection

When an existing page is updated the old data becomes obsolete. The old version of the data are called garbage and (sooner or later) useless pages must be reclaimed for new writes to take place.

Definition

Garbage collection is the process of finding garbage blocks and reclaim them.

It is a simple process for fully garbage blocks, but it is more complex for partially garbage blocks. In this case, the FTL needs to find suitable partial blocks, copy all the non-garbage pages somewhere else and erase the entire block for writing new data.

The garbage collection process is very expensive, because it requires to reading and rewriting live data. The ideal garbage collection is reclamation of a block that consists of only dead pages, but this is not always possible. Generally, the garbage collection process is performed in the background, when the SSD is not busy and takes 4 operations per page: read the block, copy the live data to a new block, erase the old block and write the live data to the old block.

The cost of the process depends on the amount of data blocks that have to be migrated, the number of blocks that have to be erased and the number of blocks that have to be written. To alleviate the problem, we can add more NAND chips to the SSD, so the cleaning process can be delayed, or run the process in the background, when the SSD is not busy.

When performing background garbage collection, the SSD assumes to knows which pages are invalid and which are valid. However, most file system don’t actually delete data (on Linux, for example, the “delete” functions is unlink() an it removes the metadata of the file, but not the data itself). This means that the garbage collection process can be very slow, because it copies even useless pages.

Note

In HDDs, the garbage collection is not needed, because the data is stored in a linear way, and the data is deleted when the file is deleted.

The new SATA command TRIM is used to inform the SSD that specific LBAs are invalid and may be garbage collected. The TRIM command is used to inform the SSD that the data is no longer needed, so the SSD can clean the block and reuse it. This command is available in many operating systems, from Windows 7 and later, MacOS Snow Leopard and later, and Linux kernel 2.6.33 and later.

Mapping table size

The size of the mapping table is an important factor that affects the performance of the SSD. The mapping table is used to keep track of the mapping between the logical and physical sectors, and to manage the wear leveling.

The size of a page-level mapping table is too large to be stored in the DRAM (with a 1TB SSD with a 4 bytes entry per 4KB page, the table would be 1GB to be stored in the DRAM). In order to reduce the cost of the mapping table, the FTL can use different techniques, like:

  • Block-level mapping table: the FTL can mapping at a block granularity to reduce the complexity of the mapping table. This technique reduce the performance of the SSD when we want to write a small file, because the FTL must read a large amount of live data from the old block and copy them into a new one.
  • Hybrid mapping table: the FTL maintains to tables, one at page level (log blocks) and one at block level (data blocks). When looking for a particular logical block, the FTL will consult the page mapping table and block mapping table in order.
  • Page mapping plus caching: cache the active part of the page-mapped FTL in the DRAM; if a given workload only accesses a small set of pages, the translations of those pages will be stored in the FTL memory. This solution provides high performance without high memory cost if the cache can contain the necessary working set.

Wear leveling

Definition

Wear leveling is the process of distributing the write operations evenly across the NAND chips, in order to increase the lifespan of the SSD.

Erase and write operations are limited in NAND flash memory. Skewness in the number of erase and write operations can lead to a significant reduction in the lifespan of the SSD. A log-structured FTL and garbage collection can help to reduce the skewness in the number of erase and write operations. However, a block may consist of cold data and the FTL must periodically read all the live data out of such blocks and re-write them to new blocks. This process increases the write amplification problem and decreases the performance of the SSD.

Comparison between HDDs and SSDs

The filesystem is the same for both HDDs and SSDs, but the way the data accessed is different. In HDDs, there is a 1:1 mapping between the read/write sector and the physical sector on the disk. In SSDs, while the filesystem still uses sectors, the SSD can retrieve data from any page in any block by using the READ, PROGRAM and ERASE operations. So, in SSDs there is a mismatch between the logical and physical sectors.

The SSD controller is in charge of managing the mapping between the logical and physical sectors. This mapping is stored in a table called FTL (Flash Translation Layer). The FTL is stored in the NAND chips, and it is updated every time a new write operation is performed. The FTL is used to keep track of the mapping between the logical and physical sectors, and to manage the wear leveling.

Definitions

There are two important metrics that are used to evaluate the performance of an SSD:

  • Unrecoverable Bit Error Rate (UBER): a metric for the rate of occurrence of data errors, equal to the number of data errors per bits read
  • Endurance rating: Terabytes Written (TBW) is the total amount of data that can be written into an SSD before it is likely to fail. The number of terabytes that may be written to the SSD while still meeting the requirements.

A NAND flash memory cell can accept data recording between 3,000 and 100,000 during its lifetime. Once the limit is reached, the cell “forgets” any new data written to it, becoming useless.

A typical TBW value for a consumer-grade SSD is between 60 and 150 TeraBytes of data written. This means that in order to overcome, for example, a TBW of 70 Terabytes, a user should write 190 GB every day for a year or fill his SSD on a daily basis for two thirds with new files for a whole year.