Concepts Of Linux Programming – Files and Filesystem
The file is the most basic and fundamental abstraction in Linux. Linux follows the everything-is-a-file philosophy (although not as strictly as some other systems, such as Plan 9).Consequently, much interaction occurs via reading of and writing to files, even when the object in question is not what you would consider a normal file.
In order to be accessed, a file must first be opened. Files can be opened for reading, writing, or both. An open file is referenced via a unique descriptor, a mapping from the metadata associated with the open file back to the specific file itself. Inside the Linux kernel, this descriptor is handled by an integer (of the C type int) called the file descriptor, abbreviated fd. File descriptors are shared with user space, and are used directly by user programs to access files. A large part of Linux system programming consists of opening, manipulating, closing, and otherwise using file descriptors.
What most of us call “files” are what Linux labels regular files. A regular file contains bytes of data, organized into a linear array called a byte stream. In Linux, no further organization or formatting is specified for a file. The bytes may have any values, and they may be organized within the file in any way. At the system level, Linux does not enforce a structure upon files beyond the byte stream. Some operating systems, such as VMS, provide highly structured files, supporting concepts such as records. Linux does not.
Any of the bytes within a file may be read from or written to. These operations start at a specific byte, which is one’s conceptual “location” within the file. This location is called the file position or file offset. The file position is an essential piece of the metadata that the kernel associates with each open file. When a file is first opened, the file position is zero. Usually, as bytes in the file are read from or written to, byte-by-byte, the file position increases in kind. The file position may also be set manually to a given value, even a value beyond the end of the file. Writing a byte to a file position beyond the end of the file will cause the intervening bytes to be padded with zeros. While it is possible to write bytes in this manner to a position beyond the end of the file, it is not possible to write bytes to a position before the beginning of a file. Such a practice sounds nonsensical, and, indeed, would have little use. The file position starts at zero; it cannot be negative. Writing a byte to the middle of a file overwrites the byte previously located at that offset. Thus, it is not possible to expand a file by writing into the middle of it. Most file writing occurs at the end of the file. The file position’s maximum value is bounded only by the size of the C type used to store it, which is 64 bits on a modern Linux system.
The size of a file is measured in bytes and is called its length. The length, in other words, is simply the number of bytes in the linear array that make up the file. A file’s length can be changed via an operation called truncation. A file can be truncated to a new size smaller than its original size, which results in bytes being removed from the end of the file. Confusingly, given the operation’s name, a file can also be “truncated” to a new size larger than its original size. In that case, the new bytes (which are added to the end of the file) are filled with zeros. A file may be empty (that is, have a length of zero), and thus contain no valid bytes. The maximum file length, as with the maximum file position, is bounded only by limits on the sizes of the C types that the Linux kernel uses to manage files. Specific filesystems, however, may impose their own restrictions, imposing a smaller ceiling on the maximum length.
A single file can be opened more than once, by a different or even the same process. Each open instance of a file is given a unique file descriptor. Conversely, processes can share their file descriptors, allowing a single descriptor to be used by more than one process. The kernel does not impose any restrictions on concurrent file access. Multiple processes are free to read from and write to the same file at the same time. The results of such concurrent accesses rely on the ordering of the individual operations, and are generally unpredictable. User-space programs typically must coordinate amongst themselves to ensure that concurrent file accesses are properly synchronized.
Although files are usually accessed via filenames, they actually are not directly associated with such names. Instead, a file is referenced by an inode (originally short for information node), which is assigned an integer value unique to the filesystem (but not necessarily unique across the whole system). This value is called the inode number, often abbreviated as i-number or ino. An inode stores metadata associated with a file, such as its modification timestamp, owner, type, length, and the location of the file’s data—but no filename! The inode is both a physical object, located on disk in Unix-style filesystems, and a conceptual entity, represented by a data structure in the Linux kernel.
Directories and links
Accessing a file via its inode number is cumbersome (and also a potential security hole), so files are always opened from user space by a name, not an inode number. Directories are used to provide the names with which to access files. A directory acts as a mapping of human-readable names to inode numbers. A name and inode pair is called a link. The physical on-disk form of this mapping—for example, a simple table or a hash—is implemented and managed by the kernel code that supports a given filesystem. Conceptually, a directory is viewed like any normal file, with the difference that it contains only a mapping of names to inodes. The kernel directly uses this mapping to perform name-to-inode resolutions.
When a user-space application requests that a given filename be opened, the kernel opens the directory containing the filename and searches for the given name. From the filename, the kernel obtains the inode number. From the inode number, the inode is found. The inode contains metadata associated with the file, including the on-disk location of the file’s data.
Initially, there is only one directory on the disk, the root directory. This directory is usually denoted by the path /. But, as we all know, there are typically many directories on a system. How does the kernel know whichdirectory to look in to find a given filename?
As mentioned previously, directories are much like regular files. Indeed, they even have associated inodes. Consequently, the links inside of directories can point to the inodes of other directories. This means directories can nest inside of other directories, forming a hierarchy of directories. This, in turn, allows for the use of the pathnames with which all Unix users are familiar—for example,/home/blackbeard/concorde.png.
When the kernel is asked to open a pathname like this, it walks each directory entry (called a dentry inside of the kernel) in the pathname to find the inode of the next entry. In the preceding example, the kernel starts at /, gets the inode for home, goes there, gets the inode for blackbeard, runs there, and finally gets the inode for concorde.png. This operation is called directory or pathname resolution. The Linux kernel also employs a cache, called the dentry cache, to store the results of directory resolutions, providing for speedier lookups in the future given temporal locality.
A pathname that starts at the root directory is said to be fully qualified, and is called an absolute pathname. Some pathnames are not fully qualified; instead, they are provided relative to some other directory (for example, todo/plunder). These paths are called relative pathnames. When provided with a relative pathname, the kernel begins the pathname resolution in the current working directory. From the current working directory, the kernel looks up the directory todo. From there, the kernel gets the inode for plunder. Together, the combination of a relative pathname and the current working directory is fully qualified.
Although directories are treated like normal files, the kernel does not allow them to be opened and manipulated like regular files. Instead, they must be manipulated using a special set of system calls. These system calls allow for the adding and removing of links, which are the only two sensible operations anyhow. If user space were allowed to manipulate directories without the kernel’s mediation, it would be too easy for a single simple error to corrupt the filesystem.
Conceptually, nothing covered thus far would prevent multiple names resolving to the same inode. Indeed, this is allowed. When multiple links map different names to the same inode, we call them hard links.
Hard links allow for complex filesystem structures with multiple pathnames pointing to the same data. The hard links can be in the same directory, or in two or more different directories. In either case, the kernel simply resolves the pathname to the correct inode. For example, a specific inode that points to a specific chunk of data can be hard linked from /home/bluebeard/treasure.txtand /home/blackbeard/to_steal.txt.
Deleting a file involves unlinking it from the directory structure, which is done simply by removing its name and inode pair from a directory. Because Linux supports hard links, however, the filesystem cannot destroy the inode and its associated data on every unlink operation. What if another hard link existed elsewhere in the filesystem? To ensure that a file is not destroyed until all links to it are removed, each inode contains a link count that keeps track of the number of links within the filesystem that point to it. When a pathname is unlinked, the link count is decremented by one; only when it reaches zero are the inode and its associated data actually removed from the filesystem.
Hard links cannot span filesystems because an inode number is meaningless outside of the inode’s own filesystem. To allow links that can span filesystems, and that are a bit simpler and less transparent, Unix systems also implementsymbolic links (often shortened to symlinks).
Symbolic links look like regular files. A symlink has its own inode and data chunk, which contains the complete pathname of the linked-to file. This means symbolic links can point anywhere, including to files and directories that reside on different filesystems, and even to files and directories that do not exist. A symbolic link that points to a nonexistent file is called a broken link.
Symbolic links incur more overhead than hard links because resolving a symbolic link effectively involves resolving two files: the symbolic link and then the linked-to file. Hard links do not incur this additional overhead—there is no difference between accessing a file linked into the filesystem more than once and one linked only once. The overhead of symbolic links is minimal, but it is still considered a negative.
Symbolic links are also more opaque than hard links. Using hard links is entirely transparent; in fact, it takes effort to find out that a file is linked more than once! Manipulating symbolic links, on the other hand, requires special system calls. This lack of transparency is often considered a positive, as the link structure is explicitly made plain, with symbolic links acting more as shortcutsthan as filesystem-internal links.
Special files are kernel objects that are represented as files. Over the years, Unix systems have supported a handful of different special files. Linux supports four: block device files, character device files, named pipes, and Unix domain sockets. Special files are a way to let certain abstractions fit into the filesystem, continuing the everything-is-a-file paradigm. Linux provides a system call to create a special file.
Device access in Unix systems is performed via device files, which act and look like normal files residing on the filesystem. Device files may be opened, read from, and written to, allowing user space to access and manipulate devices (both physical and virtual) on the system. Unix devices are generally broken into two groups: character devices and block devices. Each type of device has its own special device file.
A character device is accessed as a linear queue of bytes. The device driver places bytes onto the queue, one by one, and user space reads the bytes in the order that they were placed on the queue. A keyboard is an example of a character device. If the user types “peg,” for example, an application would want to read from the keyboard device the p, the e, and, finally, the g, in exactly that order. When there are no more characters left to read, the device returns end-of-file (EOF). Missing a character, or reading them in any other order, would make little sense. Character devices are accessed via character device files.
A block device, in contrast, is accessed as an array of bytes. The device driver maps the bytes over a seekable device, and user space is free to access any valid bytes in the array, in any order—it might read byte 12, then byte 7, and then byte 12 again. Block devices are generally storage devices. Hard disks, floppy drives, CD-ROM drives, and flash memory are all examples of block devices. They are accessed via block device files.
Named pipes (often called FIFOs, short for “first in, first out”) are aninterprocess communication (IPC) mechanism that provides a communication channel over a file descriptor, accessed via a special file. Regular pipes are the method used to “pipe” the output of one program into the input of another; they are created in memory via a system call and do not exist on any filesystem. Named pipes act like regular pipes but are accessed via a file, called a FIFO special file. Unrelated processes can access this file and communicate.
Sockets are the final type of special file. Sockets are an advanced form of IPC that allow for communication between two different processes, not only on the same machine, but even on two different machines. In fact, sockets form the basis of network and Internet programming. They come in multiple varieties, including the Unix domain socket, which is the form of socket used for communication within the local machine. Whereas sockets communicating over the Internet might use a hostname and port pair for identifying the target of communication, Unix domain sockets use a special file residing on a filesystem, often simply called a socket file.
Filesystems and namespaces
Linux, like all Unix systems, provides a global and unified namespace of files and directories. Some operating systems separate different disks and drives into separate namespaces—for example, a file on a floppy disk might be accessible via the pathname A:\plank.jpg, while the hard drive is located atC:\. In Unix, that same file on a floppy might be accessible via the pathname/media/floppy/plank.jpg or even via /home/captain/stuff/plank.jpg, right alongside files from other media. That is, on Unix, the namespace is unified.
A filesystem is a collection of files and directories in a formal and valid hierarchy. Filesystems may be individually added to and removed from the global namespace of files and directories. These operations are calledmounting and unmounting. Each filesystem is mounted to a specific location in the namespace, known as a mount point. The root directory of the filesystem is then accessible at this mount point. For example, a CD might be mounted at/media/cdrom, making the root of the filesystem on the CD accessible at/media/cdrom. The first filesystem mounted is located in the root of the namespace, /, and is called the root filesystem. Linux systems always have a root filesystem. Mounting other filesystems at other mount points is optional.
Filesystems usually exist physically (i.e., are stored on disk), although Linux also supports virtual filesystems that exist only in memory, and network filesystems that exist on machines across the network. Physical filesystems reside on block storage devices, such as CDs, floppy disks, compact flash cards, or hard drives. Some such devices are partionable, which means that they can be divided up into multiple filesystems, all of which can be manipulated individually. Linux supports a wide range of filesystems—certainly anything that the average user might hope to come across—including media-specific filesystems (for example, ISO9660), network filesystems (NFS), native filesystems (ext4), filesystems from other Unix systems (XFS), and even filesystems from non-Unix systems (FAT).
The smallest addressable unit on a block device is the sector. The sector is a physical attribute of the device. Sectors come in various powers of two, with 512 bytes being quite common. A block device cannot transfer or access a unit of data smaller than a sector and all I/O must occur in terms of one or more sectors.
Likewise, the smallest logically addressable unit on a filesystem is the block. The block is an abstraction of the filesystem, not of the physical media on which the filesystem resides. A block is usually a power-of-two multiple of the sector size. In Linux, blocks are generally larger than the sector, but they must be smaller than the page size (the smallest unit addressable by the memory management unit, a hardware component). Common block sizes are 512 bytes, 1 kilobyte, and 4 kilobytes.
Historically, Unix systems have only a single shared namespace, viewable by all users and all processes on the system. Linux takes an innovative approach and supports per-process namespaces, allowing each process to optionally have a unique view of the system’s file and directory hierarchy. By default, each process inherits the namespace of its parent, but a process may elect to create its own namespace with its own set of mount points and a unique root directory.