Difference between revisions of "Repairing early UNIX file systems"

From Computer History Wiki
Jump to: navigation, search
(Link count zero: happens when the system was stopped with pipes open)
m (+cat)
Line 76: Line 76:
 
[[Category: File Systems]]
 
[[Category: File Systems]]
 
[[Category: UNIX]]
 
[[Category: UNIX]]
 +
[[Category: UNIX Practical Guides]]

Revision as of 11:48, 4 March 2023

Repairing early UNIX file systems when they are damaged is not trivial. Early versions of UNIX had almost no tools for automagically fixing UNIX file system corruption. To do it, one needs to:

  • understand how the file system is arranged (see the link below, but it's pretty simple);
  • understand what the few available tools (dcheck; icheck; clri) do;
  • dive in!

Different versions of UNIX came with different tools, but the V5 and V6 ones are generally both usable on the other. V7 used a different file system format from V5/V6 (they are very similar, but block numbers are 16 bits on V5/V6, and 32 bits on V7); use of V5/V6 tools on a V7 disk will thus trash the disk.

The tools are:

  • check (V5)
  • icheck (V6) - ensures that all blocks are assigned to at most one file, and that all unused blocks are in the free list
  • dcheck (V6) - ensures that the reference count in each inode matches the number of directory entries for it
  • clri (V6) - zeroes the contents of an inode
  • fcheck (V6) - a product of CMU, it included both the icheck and icheck functionality, along with other improvements
  • fsck (V7) - a later descendant of fcheck

The documentation for 'icheck' contains the following warning: "Notice also that the words in the super-block which indicate the size of the free list and of the i-list are believed. If the super-block has been curdled these words will have to be patched."

Manual patching and consistency issues

As indicated, some problems will require manual patching to repair errors; 'db' is the best tool to do this, so one will want to study up on the 'db' syntax so one knows how to use it to do that. (The '!' command is the one to examine. It's probably a good idea to practice this on an ordinary file first; one can use 'od' to see the results of one's attempts.) 'db's limited addressing capability may also require the use of 'dd' to extract a copy of the block on which one wants to operate, followed by its use again to put the repaired block back.

If using 'db', one will have to use the non-raw version of the disk, since the raw version can only read/write complete blocks. ('raw' devices use the device controller's DMA capability to transfers block contents direct to and from buffers in the process' address space.) Having made changes to the 'buffered' device, one must then judiciously use 'sync' to flush the updated blocks out to the 'physical' disk.

Contrariwise, there are cases where a tool has operated on the 'raw' device (such as use of 'icheck -s' to re-build the free list), and one has to ensure that the system does not overwrite one's repair by attempting to flush the 'bad' buffered, in-core copy of the changed block(s)' contents out to the disk. In such cases, one will have to stop the machine as soon as the operation is complete, and re-boot it. (There are some corner cases where data is stored elsewhere in the operating system, such as when one is patching the inode of an open file, but these are ignored for the moment, as they are rarely encountered.)

Many of these issues can be bypassed if one has another bootable disk that one can boot from; the damaged disk can then be examined and repaired at leisure, without it being 'mounted' (so the system will not have its own ideas of what the contents of the disk are). If patching by hand, which requires use of the non-raw disk, one still has to flush those changes out, but there will be no issues with the system having a contrary idea of the state of the disk.

Error types

A few words about common error types, including ones which can be safely ignored, and how to fix the ones which cannot be so disdained. It is possible that a damaged file system will contain more than one of these; in such cases, repair them one at a time (the worst one first), and check after each repair, since the repair may have created other problems (e.g. lost blocks after an operation which requires clearing an inode).

Lost blocks

A block which is not in any file, or the free list, can be safely ignored temporarily.

Duplicate blocks

A block being assigned to several different files generally means that all files' contents are likely damaged. A block appearing in the free list, and a file, is also likely to have caused damage, but is easier to repair; a block appearing in the free list two or more times is similarly easy to repair. Both of these latter problems should be fixed ASAP; using the disk is likely to cause further damage to the contents.

The latter cases can be repaired by re-building the free list ('-s'to check/icheck). The 'easy' way to fix the first case is i) copy the second file to somewhere else, ii) delete the original of the second file, iii) re-build the free list (because the duplicate block will now be in both the first file, and the free list), iv) examine both files, and see which one has the smashed contents.

(Note that check/icheck will not tell you what the first file is which is using a duplicate block, because it has already forgotten that by the time it discovers the second claimant - it only keeps a bit array of 'used' blocks.)

Link count too high

An inode with a link count higher than the number of links to it can be safely ignored temporarily. This, and similar errors below, will generally have to be corrected by hand.

Link count too low

An inode with a link count lower than the number of links to it will not cause an immediate problem (if none of the directory entries are deleted), but it should be repaired as soon as possible, to prevent a problem if that is done.

Link count zero

An inode which is marked as 'allocated', but has a zero link count, has several possible explanations; in general, it can be safely ignored temporarily. The most likely, if on the root file system, is that the system was stopped with pipes open. These may be cleared with 'clri', and any lost blocks retrieved with 'icheck -s'. Other than that, the inode was likely 'lost' as a result of a directory being smashed. A link to it must be created by hand, and the link count adjusted by hand; the contents of the file can then be examined, to see what it was.

Unallocated inode

An inode which is not marked as 'allocated', but which has links to it, is problematic. The exact solution depends on the state of the contents, but an 'easy' repair is to i) delete all the directory entries which link to it, ii) use 'clri' to zero the inode, iii) re-build the free list to re-capture any 'lost' blocks.

External links