BSD Fast File System

From Computer History Wiki
Jump to: navigation, search

Until the creation of the Berkeley BSD Fast File System for BSD 4.1, all prior Unixes had used basically the file system designed for the early PDP-11 versions of Unix. While very simple (in on-disk structure, and thus implementation), it had some performance, functionality and robustness issues. The FFS fixed these, so successfully that the FFS and its descendants are still the file system of most Unix descendants today.

Issues

The performance issue were the most critical, but there were lesser functionality and robustness issue as well. The FFS fixed most of these issues (although without duplication of most meta-data, and all data, it was still somewhat susceptible to hardware failures such as head crashes).

Performance

The original Unix file system was too simple to be able to get good performance out of the disks of the era.

One factor was the small block size (which was to some extent ameliorated in systems prior to the FFS by using larger block sizes). Another was the fact that all the inodes were at the start of the disk, necessitating seeks (of average length of half the disk size) to get to the data. The last was the extreme non-locality of data placement, which was an endemic (albeit un-intended) feature of the earlier system.

Once the system had been running for a while, free blocks were scattered fairly randomly across the disk, and the blocks comprising a new file were thus widely scattered; reading the file resulted in a great deal of head motion (seeks). There was no cure for this state of affairs, other than dumping and re-building the disk - which would only solve the problem temporarily.

Functionality

Although the original file system did include so-called hard links, those were non-optimal in a lot of ways; among other issue, they were restricted to a single logical volume.

Another problematic limitation was the length of file names; although all ASCII characters could be use, the maximum length of any element (level) of a file name was only 14 characters.

Robustness

The old file system had only one copy of all the meta-data of a volume, including the super block. While some of the meta-data (such as the free list) could be re-constructed from other meta-data (for the free list, by scanning all the inodes), most could not.

Changes

Changes (some very major) were made in a number of areas, to attempt to deal with those issues:

Performance

The most fundamental changes to the file system were needed to increase the performance; in particular, to prevent the fragmentation which rendered high performance impossible.

The 'block size' was increased substantially, although to prevent excessive wastage of space when storing small files, all blocks were divisible into fragments - which are actually the true 'block size' of the new file system; FFS 'blocks' are better thought of as a method to aggregate contiguous groups of fragments into a unit. When possible, fragments were coalesced back into a block.

Instead of the old layout, with all the inodes at the start of the disk, the disk was divided into cylinder groups, each of which contains a header containing information specific to that particular cyclinder group, a tranche of inodes (along with associated meta-data, such as free inode and fragment bit maps), and a number of data blocks. The cylinder-group header records (among other data) the numbers of free inodes, data blocks, and fragments in that cylinder group, to help the system select a cylinder group, when allocating files.

Generally, an attempt is made to allocate all the storage for a file in the same cylinder group as its inode, so that seeks to retrieve the data are minimized. The blocks of a file are also placed on the disk in rotationally-optimized locations.

Both free inodes and free data space are tracked by bit maps; free data space is tracked in terms of fragments - a free block shows up as 4 contiguous free fragments. (The inode free map is apparently purely a performance improver, to avoid having to scan the array of inodes, looking for a free one.) These are both stored following the cylinder-group header (although technically the entire free inode bit-map is stored inside the cylinder-group header structure).

Each cylinder group is created with a generous allocation of inodes - enough inodes, in the typical case, to provide one inode for roughly every three fragments. (The number of inodes per cyclinder group is a fundamental constant, since the inode free map is part of the cylinder group header structure.)

To allow tuning of the file system to various applications, many of the 'constants' of the file system (such as the number of fragments in a block) were actually configuration parameters which could be set at the time the file-system was created; these values were stored in the 'super block'. So a particular FFS partition might be optimized for fast access, or for minimum wasted space.

The physical parameters of the disk (the number of sectors per track, and the number of tracks per cylinder) were also stored there; they were used by the file system to calculate the sector numbers (on the disk) where physical cylinders started.

The full scheme is far too complex to be fully explained here; consult the paper on the FFS for the details. It not only performed well, but over time the anti-fragmentation mechanisms built into it kept data contiguous.

Robustness

Since the file system would be incomprehensible if the configuration values stored in the super-block were lost, a copy of the super-block was stored at the start of each cylinder group. (In the FFS, little to none of the data in the super-block changed while the system was operating, so it was not necessary to update all the copies whenever things changed on the disk.)

Furthermore, the location of the copy is staggered in each cylinder group, so that there were on different platters (of a multi-platter disk pack), so that a single head crash could not take out all the copies.

Since the cylinder-group meta-data was stored immediately following the super-block copy, and the inodes following that, that information too was not all on a single platter, and vulnerable to a single hardware failure.

Both super-blocks and cylinder group headers contain a 'magic word'; this provides both robustness (since on reading what is thought to be a cylinder group header, it can be checked to make sure it really is, and there has not been some coding/etc error), and also the ability to locate them (by a search of the disks for blocks with the appropriate value in the appropriate location).

Functionality

A number of changes were made to the FFS to increase the functionality - unlike the performance improvements, which were generally invisible to the users, these were.

Long directory entries

In the original Unix file system, directories contained arrays of fixed-format directory entries, each of which contained an inode number, and (shortish) fixed-length name.

The FFS changed this so that directory entries also contained an 'entry length' field and a 'name length' field; the name was made variable length. From an entry, the start of the next entry could be found using the entry length field.

The separate name length field allowed deletion to be an easy process; the unused space of the now-empty entry was added to the previous entry.

Symbolic links, etc

When the inode format was changed, the number of inode types was increased. The old file system allowed for only 4 types (file, directory, and block and character devices).

By getting rid of the separate 'allocated' bit (an un-allocated inode is now indicated by a 0 type field), the number of types was increased to 7, and two new types were added: symbolic links, and 'sockets' (used by the networking code).

Implementation details

To start with, it is necessary to understand that the FFS has no less than four different 'block' sizes:

  • the hardware block size (512 bytes on DEC hardware), hereinafter 'sector' size;
  • the 'block device I/O block' size (normally statically defined as 2048 bytes - i.e. 4 sectors), which apparently was used for disk buffer allocation (hence the existence of the bdbtofsb() macro), but is not used in the file system anywhere;
  • the 'fragment' (minimum allocatable amount) size, also called the "file system block [number]", a per-disk configuration parameter;
  • and the 'block' (maximum allocatable unit) size, also per-disk configurable.

Alas, the code is not always as clear as it could/should be as to which of these senses is in use in a particular structure element, etc. Note especially that disk addresses (e.g. 'block' numbers in inodes) are usually given in terms of fragment numbers.

The disk starts with a hardware primary (1 sector) and secondary (15 sectors) bootstrap. The first copy of the super-block is immediately after that, at block 16.

From there on out, addresses may vary, depending on things such as the block-size parameter. A set of macros (cgbase(), cgstart(), etc) are provided to give the disk address (in units of fragments) of each part of a cylinder group; cgbase() indicates where the cylinder group starts, and cgstart() indicates where, within the cylinder group, the meta-data of that cylinder group resides.

For a file system with a fragment size of 2 sectors, and a block size of 8 sectors, the copy of the super-block for the first cylinder group is at sector 32, the cylinder group header for that cylinder-group is at sector 48, and the inodes start at sector 56.

Inodes contained the block numbers of the first 12 blocks, so that all the data of files up to a fairly good size could be reached directly; all except the last one were necessarily full blocks, but the last one might contain only one or more fragments. The number of blocks, and the number of tailing fragments, could easily be determined from the file's size.

Similarly to the old file system, for large files, an inode also held the block numbers of a single indirect block (i.e. a block containing an array of block numbers), a double-indirect block (i.e. that block contained the block numbers of a set of indirect blocks), and a triple-indirect block.

External links

Further reading

Marshall K. McKusick, William N. Joy, Samuel J. Leffler, Robert S. Fabry, A Fast File System for UNIX