NAME¶
aufs - advanced multi layered unification filesystem. version 3.12-20131111
DESCRIPTION¶
Aufs is a stackable unification filesystem such as Unionfs, which unifies
several directories and provides a merged single directory. In the early days,
aufs was entirely re-designed and re-implemented Unionfs Version 1.x series.
After many original ideas, approaches and improvements, it becomes totally
different from Unionfs while keeping the basic features. See Unionfs Version
1.x series for the basic features. Recently, Unionfs Version 2.x series begin
taking some of same approaches to aufs's.
MOUNT OPTIONS¶
At mount-time, the order of interpreting options is,
- \(bu
- simple flags, except xino/noxino and udba=notify
- \(bu
- branches
- \(bu
- xino/noxino
- \(bu
- udba=notify
At remount-time, the options are interpreted in the given order, e.g. left to
right.
- \(bu
- create or remove whiteout-base(.wh..wh.aufs) and whplink-dir(.wh..wh.plnk)
if necessary
- br:BRANCH[:BRANCH ...] (dirs=BRANCH[:BRANCH ...])
- Adds new branches. (cf. Branch Syntax).
Aufs rejects the branch which is an ancestor or a descendant of another
branch. It is called overlapped. When the branch is loopback-mounted
directory, aufs also checks the source fs-image file of loopback device.
If the source file is a descendant of another branch, it will be rejected
too.
After mounting aufs or adding a branch, if you move a branch under another
branch and make it descendant of another branch, aufs will not work
correctly. By default, aufs prohibits such operation internally, but there
left a way to do. (cf. Branch Syntax).
- [ add | ins ]:index:BRANCH
- Adds a new branch. The index begins with 0. Aufs creates
whiteout-base(.wh..wh.aufs) and whplink-dir(.wh..wh.plnk) if necessary.
If there is the same named file on the lower branch (larger index), aufs
will hide the lower file. You can only see the highest file. You will be
confused if the added branch has whiteouts (including diropq), they may or
may not hide the lower entries. (cf. DIAGNOSTICS).
Even if a process have once mapped a file by mmap(2) with MAP_SHARED and the
same named file exists on the lower branch, the process still refers the
file on the lower(hidden) branch after adding the branch. If you want to
update the contents of a process address space after adding, you need to
restart your process or open/mmap the file again. (cf. Branch
Syntax).
- del:dir
- Removes a branch. Aufs does not remove whiteout-base(.wh..wh.aufs) and
whplink-dir(.wh..wh.plnk) automatically. For example, when you add a RO
branch which was unified as RW, you will see whiteout-base or whplink-dir
on the added RO branch.
If a process is referencing the file/directory on the deleting branch (by
open, mmap, current working directory, etc.), aufs will return an error
EBUSY. In this case, a script ‘aubusy’ (in aufs-util.git and
aufs2-util.git) is useful to identify which process (and which file) makes
the branch busy.
- mod:BRANCH
- Modifies the permission flags of the branch. Aufs creates or removes
whiteout-base(.wh..wh.aufs) and/or whplink-dir(.wh..wh.plnk) if necessary.
If the branch permission is been changing ‘rw’ to
‘ro’, and a process is mapping a file by mmap(2) on the
branch, the process may or may not be able to modify its mapped memory
region after modifying branch permission flags. Additioanlly when you
enable CONFIG_IMA (in linux-2.6.30 and later), IMA may produce some wrong
messages. But this is equivalent when the filesystem is changed
‘ro’ in emergency. (cf. Branch Syntax).
- append:BRANCH
- equivalent to ‘add:(last index + 1):BRANCH’. (cf. Branch
Syntax).
- prepend:BRANCH
- equivalent to ‘add:0:BRANCH.’ (cf. Branch Syntax).
- xino=filename
- Use external inode number bitmap and translation table. When
CONFIG_AUFS_EXPORT is enabled, external inode generation table too. It is
set to <FirstWritableBranch>/.aufs.xino by default, or
/tmp/.aufs.xino. Comma character in filename is not allowed.
The files are created per an aufs and per a branch filesystem, and unlinked.
So you cannot find this file, but it exists and is read/written frequently
by aufs. When the specified file already exists, then mount(8) returns an
error. (cf. External Inode Number Bitmap, Translation Table and Generation
Table).
If you enable CONFIG_SYSFS, the path of xino files are not shown in
/proc/mounts (and /etc/mtab), instead it is shown in
<sysfs>/fs/aufs/si_<id>/xi_path. Otherwise, it is shown in
/proc/mounts unless it is not the default path.
- noxino
- Stop using external inode number bitmap and translation table.
If you use this option, Some applications will not work correctly. (cf.
External Inode Number Bitmap, Translation Table and Generation
Table).
- trunc_xib
- Truncate the external inode number bitmap file. The truncation is done
automatically when you delete a branch unless you do not specify
‘notrunc_xib’ option. (cf. External Inode Number Bitmap,
Translation Table and Generation Table).
- notrunc_xib
- Stop truncating the external inode number bitmap file when you delete a
branch. (cf. External Inode Number Bitmap, Translation Table and
Generation Table).
- trunc_xino_path=BRANCH | itrunc_xino=INDEX
- Truncate the external inode number translation table per branch. The
branch can be specified by path or index (its origin is 0). Sometimes the
size of a xino file for tmpfs branch grows very big. If you don't like
such situation, try "mount -o remount,trunc_xino_path=BRANCH
/your/aufs" (or itrunc_xino=INDEX). It will shrink the xino file for
BRANCH. These options are one time actions. So the size may grow again. In
order to make it work automatically when necessary, try trunc_xino option.
These options are already implemented, but its design is not fixed (cf.
External Inode Number Bitmap, Translation Table and Generation
Table).
- trunc_xino | notrunc_xino
- Enable (or disable) the automatic truncation of xino files. The truncation
is done by discarding the internal "hole" (unused blocks). The
default is notrunc_xino. These options are already implemented, but its
design is not fixed (cf. External Inode Number Bitmap, Translation Table
and Generation Table).
TODO: costomizable two values for upper limit
- create_policy | create=CREATE_POLICY
-
- copyup_policy | copyup | cpup=COPYUP_POLICY
- Policies to select one among multiple writable branches. The default
values are ‘create=tdp’ and ‘cpup=tdp’.
link(2) and rename(2) systemcalls have an exception. In aufs, they try
keeping their operations in the branch where the source exists. (cf.
Policies to Select One among Multiple Writable Branches).
- dio
- Enable Direct I/O support (including Linux AIO), and always make open(2)
with O_DIRECT success. But if your branch filessystem doesn't support it,
then the succeeding I/O will fail (cf, Direct I/O).
- nodio
- Disable Direct I/O (including Linux AIO), and always make open(2) with
O_DIRECT fail. This is default value (cf, Direct I/O).
- verbose | v
- Print some information. Currently, it is only busy file (or inode) at
deleting a branch.
- noverbose | quiet | q | silent
- Disable ‘verbose’ option. This is default value.
- sum
- df(1)/statfs(2) returns the total number of blocks and inodes of all
branches. When the block size of all branches are not equal, aufs choses
the smallest one and calculate the number of blocks (including bavail and
bfree). Note that there are cases that systemcalls may return ENOSPC, even
if df(1)/statfs(2) shows that aufs has some free space/inode.
- nosum
- Disable ‘sum’ option. This is default value.
- dirwh=N
- Watermark to remove a dir actually at rmdir(2) and rename(2).
If the target dir which is being removed or renamed (destination dir) has a
huge number of whiteouts, i.e. the dir is empty logically but physically,
the cost to remove/rename the single dir may be very high. It is required
to unlink all of whiteouts internally before issuing rmdir/rename to the
branch. To reduce the cost of single systemcall, aufs renames the target
dir to a whiteout-ed temporary name and invokes a pre-created kernel
thread to remove whiteout-ed children and the target dir. The rmdir/rename
systemcall returns just after kicking the thread.
When the number of whiteout-ed children is less than the value of dirwh,
aufs remove them in a single systemcall instead of passing another thread.
This value is ignored when the branch is NFS. The default value is 3.
- rdblk=N
- Specifies a size of internal VDIR block which is allocated at a time in
byte. The VDIR block will be allocated several times when necessary. If
your directory has millions of files, you may want to expand this size.
The default value is defined as 512. The size has to be lager than
NAME_MAX (usually 255) and kmalloc-able (the maximum limit depends on your
system. at least 128KB is available for every system). If you set it to
zero, then the internal estimation for the directory size becomes ON, and
aufs sets the value for the directory individually. Sometimes the
estimated value may be inappropriate since the estimation is not so
clever. Setting zero is useful when you use RDU (cf. VDIR/readdir(3) in
user-space (RDU). Otherwise it may be a pressure for kernel memory space.
Anytime you can reset the value to default by specifying rdblk=def. (cf.
Virtual or Vertical Directory Block).
- rdhash=N
- Specifies a size of internal VDIR hash table which is used to compare the
file names under the same named directory on multiple branches. The VDIR
hash table will be allocated in readdir(3)/getdents(2), rmdir(2) and
rename(2) for the existing target directory. If your directory has
millions of files, you may want to expand this size. The default value is
defined as 32. The size has to be lager than zero, and it will be
multiplied by 4 or 8 (for 32-bit and 64-bit respectively, currently). The
result must be kmalloc-able (the maximum limit depends on your system. at
least 128KB is available for every system). If you set it to zero, then
the internal estimation for the directory becomes ON, and aufs sets the
value for the directory individually. Sometimes the estimated value may be
inappropriate since the estimation is not so clever. Setting zero is
useful when you use RDU (cf. VDIR/readdir(3) in user-space (RDU).
Otherwise it may be a pressure for kernel memory space. Anytime you can
reset the value to default by specifying rdhash=def. (cf. Virtual or
Vertical Directory Block).
- plink
-
- noplink
- Specifies to use ‘pseudo link’ feature or not. The default
is ‘plink’ which means use this feature. (cf. Pseudo
Link)
- clean_plink
- Removes all pseudo-links in memory. In order to make pseudo-link
permanent, use ‘auplink’ utility just before one of these
operations, unmounting aufs, using ‘ro’ or
‘noplink’ mount option, deleting a branch from aufs, adding
a branch into aufs, or changing your writable branch as readonly. If you
installed both of /sbin/mount.aufs and /sbin/umount.aufs, and your
mount(8) and umount(8) support them, ‘auplink’ utility will
be executed automatically and flush pseudo-links. (cf. Pseudo Link)
- udba=none | reval | notify
- Specifies the level of UDBA (User's Direct Branch Access) test. (cf.
User's Direct Branch Access and Inotify Limitation).
- diropq=whiteouted | w | always | a
- Specifies whether mkdir(2) and rename(2) dir case make the created
directory ‘opaque’ or not. In other words, to create
‘.wh..wh..opq’ under the created or renamed directory, or
not to create. When you specify diropq=w or diropq=whiteouted, aufs will
not create it if the directory was not whiteouted or opaqued. If the
directory was whiteouted or opaqued, the created or renamed directory will
be opaque. When you specify diropq=a or diropq==always, aufs will always
create it regardless the directory was whiteouted/opaqued or not. The
default value is diropq=w, it means not to create when it is
unnecessary.
- warn_perm
-
- nowarn_perm
- Adding a branch, aufs will issue a warning about uid/gid/permission of the
adding branch directory, when they differ from the existing branch's. This
difference may or may not impose a security risk. If you are sure that
there is no problem and want to stop the warning, use
‘nowarn_perm’ option. The default is
‘warn_perm’ (cf. DIAGNOSTICS).
- shwh
-
- noshwh
- By default (noshwh), aufs doesn't show the whiteouts and they just hide
the same named entries in the lower branches. The whiteout itself also
never be appeared. If you enable CONFIG_AUFS_SHWH and specify
‘shwh’ option, aufs will show you the name of whiteouts with
keeping its feature to hide the lowers. Honestly speaking, I am rather
confused with this ‘visible whiteouts.’ But a user who
originally requested this feature wrote a nice how-to document about this
feature. See Tips file in the aufs CVS tree.
Module Parameters¶
- brs=1 | 0
- Specifies to use the branch path data file under sysfs or not.
If the number of your branches is large or their path is long and you meet
the limitation of mount(8) ro /etc/mtab, you need to enable CONFIG_SYSFS
and set aufs module parameter brs=1.
When this parameter is set as 1, aufs does not show ‘br:’ (or
dirs=) mount option through /proc/mounts (and /etc/mtab). So you can keep
yourself from the page limitation of mount(8) or /etc/mtab. Aufs shows
branch paths through <sysfs>/fs/aufs/si_XXX/brNNN. Actually the file
under sysfs has also a size limitation, but I don't think it is harmful.
There is one more side effect in setting 1 to this parameter. If you rename
your branch, the branch path written in /etc/mtab will be obsoleted and
the future remount will meet some error due to the unmatched parameters
(Remember that mount(8) may take the options from /etc/mtab and pass them
to the systemcall). If you set 1, /etc/mtab will not hold the branch path
and you will not meet such trouble. On the other hand, the entries for the
branch path under sysfs are generated dynamically. So it must not be
obsoleted. But I don't think users want to rename branches so often.
If CONFIG_SYSFS is disable, this parameter is always set to 0.
- sysrq=key
- Specifies MagicSysRq key for debugging aufs. You need to enable both of
CONFIG_MAGIC_SYSRQ and CONFIG_AUFS_DEBUG. Currently this is for developers
only. The default is ‘a’.
- debug= 0 | 1
- Specifies disable(0) or enable(1) debug print in aufs. This parameter can
be changed dynamically. You need to enable CONFIG_AUFS_DEBUG. Currently
this is for developers only. The default is ‘0’ (disable).
Entries under Sysfs and Debugfs¶
See linux/Documentation/ABI/*/{sys,debug}fs-aufs.
Branch Syntax¶
- dir_path[ =permission [ + attribute ] ]
-
- permission := rw | ro | rr
-
- attribute := wh | nolwh
- dir_path is a directory path. The keyword after ‘dir_path=’
is a permission flags for that branch. Comma, colon and the permission
flags string (including ‘=’)in the path are not allowed.
Any filesystem can be a branch, But some are not accepted such like sysfs,
procfs and unionfs. If you specify such filesystems as an aufs branch,
aufs will return an error saying it is unsupported.
Cramfs in linux stable release has strange inodes and it makes aufs
confused. For example,
$ mkdir -p w/d1 w/d2
$ > w/z1
$ > w/z2
$ mkcramfs w cramfs
$ sudo mount -t cramfs -o ro,loop cramfs /mnt
$ find /mnt -ls
76 1 drwxr-xr-x 1 jro 232 64 Jan 1 1970 /mnt
1 1 drwxr-xr-x 1 jro 232 0 Jan 1 1970 /mnt/d1
1 1 drwxr-xr-x 1 jro 232 0 Jan 1 1970 /mnt/d2
1 1 -rw-r--r-- 1 jro 232 0 Jan 1 1970 /mnt/z1
1 1 -rw-r--r-- 1 jro 232 0 Jan 1 1970 /mnt/z2
All these two directories and two files have the same inode with one as
their link count. Aufs cannot handle such inode correctly. Currently, aufs
involves a tiny workaround for such inodes. But some applications may not
work correctly since aufs inode number for such inode will change
silently. If you do not have any empty files, empty directories or special
files, inodes on cramfs will be all fine.
A branch should not be shared as the writable branch between multiple aufs.
A readonly branch can be shared.
The maximum number of branches is configurable at compile time (127 by
default).
When an unknown permission or attribute is given, aufs sets ro to that
branch silently.
Permission¶
- rw
- Readable and writable branch. Set as default for the first branch. If the
branch filesystem is mounted as readonly, you cannot set it
‘rw.’
- ro
- Readonly branch and it has no whiteouts on it. Set as default for all
branches except the first one. Aufs never issue both of write operation
and lookup operation for whiteout to this branch.
- rr
- Real readonly branch, special case of ‘ro’, for natively
readonly branch. Assuming the branch is natively readonly, aufs can
optimize some internal operation. For example, if you specify
‘udba=notify’ option, aufs does not set fsnotify or inotify
for the things on rr branch. Set by default for a branch whose fs-type is
either ‘iso9660’, ‘cramfs’ or
‘romfs’ (and ‘squashfs’ for linux-2.6.29 and
later).
When your branch exists on slower device and you have some capacity on your
hdd, you may want to try ulobdev tool in ULOOP sample. It can cache the
contents of the real devices on another faster device, so you will be able
to get the better access performance. The ulobdev tool is for a generic
block device, and the ulohttp is for a filesystem image on http server. If
you want to spin down your hdd to save the battery life or something, then
you may want to use ulobdev to save the access to the hdd, too. See
$AufsCVS/sample/uloop in detail.
Attribute¶
- wh
- Readonly branch and it has/might have whiteouts on it. Aufs never issue
write operation to this branch, but lookup for whiteout. Use this as
‘<branch_dir>=ro+wh’.
- nolwh
- Usually, aufs creates a whiteout as a hardlink on a writable branch. This
attributes prohibits aufs to create the hardlinked whiteout, including the
source file of all hardlinked whiteout (.wh..wh.aufs.) If you do not like
a hardlink, or your writable branch does not support link(2), then use
this attribute. But I am afraid a filesystem which does not support
link(2) natively will fail in other place such as copy-up. Use this as
‘<branch_dir>=rw+nolwh’. Also you may want to try
‘noplink’ mount option, while it is not recommended.
- unpin
- By default, aufs sets ‘pin’ to the branch dir, which means
that users cannot remove nor rename the branch top dir as if it were a
mount-point. In some cases and some users may need to rename the branch
top dir. So this attribute is implemented. If you specify
‘unpin’ as a branch attribute, it stops behaving as a
mount-point and you can rename the branch top dir. Needless to say, if you
remove the branch top dir, then aufs cannot work.
External Inode Number Bitmap, Translation Table and Generation Table (xino)¶
Aufs uses one external bitmap file and one external inode number translation
table files per an aufs and per a branch filesystem by default. Additionally
when CONFIG_AUFS_EXPORT is enabled, one external inode generation table is
added. The bitmap (and the generation table) is for recycling aufs inode
number and the others are a table for converting an inode number on a branch
to an aufs inode number. The default path is ‘first writable
branch’/.aufs.xino. If there is no writable branch, the default path
will be /tmp/.aufs.xino.
If you enable CONFIG_SYSFS, the path of xino files are not shown in /proc/mounts
(and /etc/mtab), instead it is shown in
<sysfs>/fs/aufs/si_<id>/xi_path. Otherwise, it is shown in
/proc/mounts unless it is not the default path.
Those files are always opened and read/write by aufs frequently. If your
writable branch is on flash memory device, it is recommended to put xino files
on other than flash memory by specifying ‘xino=’ mount option.
The maximum file size of the bitmap is, basically, the amount of the number of
all the files on all branches divided by 8 (the number of bits in a byte). For
example, on a 4KB page size system, if you have 32,768 (or 2,599,968) files in
aufs world, then the maximum file size of the bitmap is 4KB (or 320KB).
The maximum file size of the table will be ‘max inode number on the
branch x size of an inode number’. For example in 32bit environment,
$ df -i /branch_fs
/dev/hda14 2599968 203127 2396841 8% /branch_fs
and /branch_fs is an branch of the aufs. When the inode number is assigned
contiguously (without ‘hole’), the maximum xino file size for
/branch_fs will be 2,599,968 x 4 bytes = about 10 MB. But it might not be
allocated all of disk blocks. When the inode number is assigned
discontinuously, the maximum size of xino file will be the largest inode
number on a branch x 4 bytes. Additionally, the file size is limited to
LLONG_MAX or the s_maxbytes in filesystem's superblock (s_maxbytes may be
smaller than LLONG_MAX). So the support-able largest inode number on a branch
is less than 2305843009213693950 (LLONG_MAX/4-1). This is the current
limitation of aufs. On 64bit environment, this limitation becomes more strict
and the supported largest inode number is less than LLONG_MAX/8-1. In order to
estimate the size of the table for your readonly branch fs, try
$ echo $((4 * $(sudo find /branch_fs -xdev -printf "%i\n" |
sort -n | tail -n 1)))
For 64bit environment, replace 4 by 8 in above equation.
The xino files are always hidden, i.e. removed. So you cannot do ‘ls -l
xino_file’. If you enable CONFIG_DEBUG_FS, you can check these
information through <debugfs>/aufs/<si_id>/{xib,xi[0-9]*,xigen}.
xib is for the bitmap file, xi0 ix for the first branch, and xi1 is for the
next. xigen is for the generation table. xib and xigen are in the format of,
<blocks>x<block size> <file size>
Note that a filesystem usually has a feature called pre-allocation, which means
a number of blocks are allocated automatically, and then deallocated silently
when the filesystem thinks they are unnecessary. You do not have to be
surprised the sudden changes of the number of blocks, when your filesystem
which xino files are placed supports the pre-allocation feature.
The rests are hidden xino file information in the format of,
<file count>, <blocks>x<block size> <file size>
If the file count is larger than 1, it means some of your branches are on the
same filesystem and the xino file is shared by them. Note that the file size
may not be equal to the actual consuming blocks since xino file is a sparse
file, i.e. a hole in a file which does not consume any disk blocks.
Once you unmount aufs, the xino files for that aufs are totally gone. It means
that the inode number is not permanent across umount or shutdown.
The xino files should be created on the filesystem except NFS. If your first
writable branch is NFS, you will need to specify xino file path other than
NFS. Also if you are going to remove the branch where xino files exist or
change the branch permission to readonly, you need to use xino option before
del/mod the branch.
The bitmap file and the table can be truncated. For example, if you delete a
branch which has huge number of files, many inode numbers will be recycled and
the bitmap will be truncated to smaller size. Aufs does this automatically
when a branch is deleted. You can truncate it anytime you like if you specify
‘trunc_xib’ mount option. But when the accessed inode number was
not deleted, nothing will be truncated. The truncation is essentially
equivalent to
$ cp --sparse=always <current xino file> <new xino file> &&
rm <current xino file>
It means that you have two xino files during the copy, and you should pay
attention to the free space of the filesystem where the xino file is located.
If the free space is not large enough to hold two xino files temporary during
the copy, then the truncation fails and the xino file will go on growing. For
such case, you should move the xino file to another larger partition, and move
it back to where it was (if you want). To do this, use ‘xino=’
mount option. During this move, the xino file is truncated automatically.
If you do not want to truncate it (it may be slow) when you delete a branch,
specify ‘notrunc_xib’ after ‘del’ mount option.
For the table, see trunc_xino_path=BRANCH, itrunc_xino=INDEX, trunc_xino and
notrunc_xino option.
If you do not want to use xino, use noxino mount option. Use this option with
care, since the inode number may be changed silently and unexpectedly anytime.
For example, rmdir failure, recursive chmod/chown/etc to a large and deep
directory or anything else. And some applications will not work correctly. If
you want to change the xino default path, use xino mount option.
After you add branches, the persistence of inode number may not be guaranteed.
At remount time, cached but unused inodes are discarded. And the newly
appeared inode may have different inode number at the next access time. The
inodes in use have the persistent inode number.
When aufs assigned an inode number to a file, and if you create the same named
file on the upper branch directly, then the next time you access the file,
aufs may assign another inode number to the file even if you use xino option.
Some applications may treat the file whose inode number has been changed as
totally different file.
Pseudo Link (hardlink over branches)¶
Aufs supports ‘pseudo link’ which is a logical hard-link over
branches (cf.
ln(1) and
link(2)). In other words, a copied-up file by
link(2)
and a copied-up file which was hard-linked on a readonly branch filesystem.
When you have files named fileA and fileB which are hardlinked on a readonly
branch, if you write something into fileA, aufs copies-up fileA to a writable
branch, and
write(2) the originally requested thing to the copied-up fileA. On
the writable branch, fileA is not hardlinked. But aufs remembers it was
hardlinked, and handles fileB as if it existed on the writable branch, by
referencing fileA's inode on the writable branch as fileB's inode.
Once you unmount aufs, the plink info for that aufs kept in memory are totally
gone. It means that the pseudo-link is not permanent. If you want to make
plink permanent, try ‘auplink’ utility just before one of these
operations, unmounting your aufs, using ‘ro’ or
‘noplink’ mount option, deleting a branch from aufs, adding a
branch into aufs, or changing your writable branch to readonly.
This utility will reproduces all real hardlinks on a writable branch by linking
them, and removes pseudo-link info in memory and temporary link on the
writable branch. Since this utility access your branches directly, you cannot
hide them by ‘mount --bind /tmp /branch’ or something.
If you are willing to rebuild your aufs with the same branches later, you should
use auplink utility before you umount your aufs. If you installed both of
/sbin/mount.aufs and /sbin/umount.aufs, and your
mount(8) and
umount(8)
support them, ‘auplink’ utility will be executed automatically
and flush pseudo-links.
During this utility is running, it puts aufs into the pseudo-link maintenance
mode. In this mode, only the process which began the maintenance mode (and its
child processes) is allowed to operate in aufs. Some other processes which are
not related to the pseudo-link will be allowed to run too, but the rest have
to return an error or wait until the maintenance mode ends. If a process
already acquires an inode mutex (in VFS), it has to return an error.
Due to the fact that the pseudo-link maintenance mode is operated via procfs,
the pseudo-link feature itself (including the related mount options) depends
upon CONFIG_PROC_FS too.
# auplink /your/aufs/root flush
# umount /your/aufs/root
or
# auplink /your/aufs/root flush
# mount -o remount,mod:/your/writable/branch=ro /your/aufs/root
or
# auplink /your/aufs/root flush
# mount -o remount,noplink /your/aufs/root
or
# auplink /your/aufs/root flush
# mount -o remount,del:/your/aufs/branch /your/aufs/root
or
# auplink /your/aufs/root flush
# mount -o remount,append:/your/aufs/branch /your/aufs/root
The plinks are kept both in memory and on disk. When they consumes too much
resources on your system, you can use the ‘auplink’ utility at
anytime and throw away the unnecessary pseudo-links in safe.
Additionally, the ‘auplink’ utility is very useful for some
security reasons. For example, when you have a directory whose permission
flags are 0700, and a file who is 0644 under the 0700 directory. Usually, all
files under the 0700 directory are private and no one else can see the file.
But when the directory is 0711 and someone else knows the 0644 filename, he
can read the file.
Basically, aufs pseudo-link feature creates a temporary link under the directory
whose owner is root and the permission flags are 0700. But when the writable
branch is NFS, aufs sets 0711 to the directory. When the 0644 file is
pseudo-linked, the temporary link, of course the contents of the file is
totally equivalent, will be created under the 0711 directory. The filename
will be generated by its inode number. While it is hard to know the generated
filename, someone else may try peeping the temporary pseudo-linked file by his
software tool which may try the name from one to MAX_INT or something. In this
case, the 0644 file will be read unexpectedly. I am afraid that leaving the
temporary pseudo-links can be a security hole. It makes sense to execute
‘auplink /your/aufs/root flush’ periodically, when your writable
branch is NFS.
When your writable branch is not NFS, or all users are careful enough to set
0600 to their private files, you do not have to worry about this issue.
If you do not want this feature, use ‘noplink’ mount option.
The behaviours of plink and noplink¶
This sample shows that the ‘f_src_linked2’ with
‘noplink’ option cannot follow the link.
none on /dev/shm/u type aufs (rw,xino=/dev/shm/rw/.aufs.xino,br:/dev/shm/rw=rw:/dev/shm/ro=ro)
$ ls -li ../r?/f_src_linked* ./f_src_linked* ./copied
ls: ./copied: No such file or directory
15 -rw-r--r-- 2 jro jro 2 Dec 22 11:03 ../ro/f_src_linked
15 -rw-r--r-- 2 jro jro 2 Dec 22 11:03 ../ro/f_src_linked2
22 -rw-r--r-- 2 jro jro 2 Dec 22 11:03 ./f_src_linked
22 -rw-r--r-- 2 jro jro 2 Dec 22 11:03 ./f_src_linked2
$ echo abc >> f_src_linked
$ cp f_src_linked copied
$ ls -li ../r?/f_src_linked* ./f_src_linked* ./copied
15 -rw-r--r-- 2 jro jro 2 Dec 22 11:03 ../ro/f_src_linked
15 -rw-r--r-- 2 jro jro 2 Dec 22 11:03 ../ro/f_src_linked2
36 -rw-r--r-- 2 jro jro 6 Dec 22 11:03 ../rw/f_src_linked
53 -rw-r--r-- 1 jro jro 6 Dec 22 11:03 ./copied
22 -rw-r--r-- 2 jro jro 6 Dec 22 11:03 ./f_src_linked
22 -rw-r--r-- 2 jro jro 6 Dec 22 11:03 ./f_src_linked2
$ cmp copied f_src_linked2
$
none on /dev/shm/u type aufs (rw,xino=/dev/shm/rw/.aufs.xino,noplink,br:/dev/shm/rw=rw:/dev/shm/ro=ro)
$ ls -li ../r?/f_src_linked* ./f_src_linked* ./copied
ls: ./copied: No such file or directory
17 -rw-r--r-- 2 jro jro 2 Dec 22 11:03 ../ro/f_src_linked
17 -rw-r--r-- 2 jro jro 2 Dec 22 11:03 ../ro/f_src_linked2
23 -rw-r--r-- 2 jro jro 2 Dec 22 11:03 ./f_src_linked
23 -rw-r--r-- 2 jro jro 2 Dec 22 11:03 ./f_src_linked2
$ echo abc >> f_src_linked
$ cp f_src_linked copied
$ ls -li ../r?/f_src_linked* ./f_src_linked* ./copied
17 -rw-r--r-- 2 jro jro 2 Dec 22 11:03 ../ro/f_src_linked
17 -rw-r--r-- 2 jro jro 2 Dec 22 11:03 ../ro/f_src_linked2
36 -rw-r--r-- 1 jro jro 6 Dec 22 11:03 ../rw/f_src_linked
53 -rw-r--r-- 1 jro jro 6 Dec 22 11:03 ./copied
23 -rw-r--r-- 2 jro jro 6 Dec 22 11:03 ./f_src_linked
23 -rw-r--r-- 2 jro jro 6 Dec 22 11:03 ./f_src_linked2
$ cmp copied f_src_linked2
cmp: EOF on f_src_linked2
$
If you add a branch which has fileA or fileB, aufs does not follow the pseudo
link. The file on the added branch has no relation to the same named file(s)
on the lower branch(es). If you use noxino mount option, pseudo link will not
work after the kernel shrinks the inode cache.
This feature will not work for squashfs before version 3.2 since its inode is
tricky. When the inode is hardlinked, squashfs inodes has the same inode
number and correct link count, but the inode memory object is different.
Squashfs inodes (before v3.2) are generated for each, even they are
hardlinked.
User's Direct Branch Access (UDBA)¶
UDBA means a modification to a branch filesystem manually or directly, e.g.
bypassing aufs. While aufs is designed and implemented to be safe after UDBA,
it can make yourself and your aufs confused. And some information like aufs
inode will be incorrect. For example, if you rename a file on a branch
directly, the file on aufs may or may not be accessible through both of old
and new name. Because aufs caches various information about the files on
branches. And the cache still remains after UDBA.
Aufs has a mount option named ‘udba’ which specifies the test
level at access time whether UDBA was happened or not.
- udba=none
- Aufs trusts the dentry and the inode cache on the system, and never test
about UDBA. With this option, aufs runs fastest, but it may show you
incorrect data. Additionally, if you often modify a branch directly, aufs
will not be able to trace the changes of inodes on the branch. It can be a
cause of wrong behaviour, deadlock or anything else.
It is recommended to use this option only when you are sure that nobody
access a file on a branch. It might be difficult for you to achieve real
‘no UDBA’ world when you cannot stop your users doing
‘find / -ls’ or something. If you really want to forbid all
of your users to UDBA, here is a trick for it. With this trick, users
cannot see the branches directly and aufs runs with no problem, except
‘auplink’ utility. But if you are not familiar with aufs,
this trick may make yourself confused.
# d=/tmp/.aufs.hide
# mkdir $d
# for i in $branches_you_want_to_hide
> do
> mount -n --bind $d $i
> done
When you unmount the aufs, delete/modify the branch by remount, or you want
to show the hidden branches again, unmount the bound /tmp/.aufs.hide.
# umount -n $branches_you_want_to_unbound
If you use FUSE filesystem as an aufs branch which supports hardlink, you
should not set this option, since FUSE makes inode objects for each
hardlinks (at least in linux-2.6.23). When your FUSE filesystem maintains
them at link/unlinking, it is equivalent to ‘direct branch
access’ for aufs.
- udba=reval
- Aufs tests only the existence of the file which existed. If the existed
file was removed on the branch directly, aufs discard the cache about the
file and re-lookup it. So the data will be updated. This test is at
minimum level to keep the performance and ensure the existence of a file.
This is default and aufs runs still fast.
This rule leads to some unexpected situation, but I hope it is harmless.
Those are totally depends upon cache. Here are just a few examples.
- \(bu
- If the file is cached as negative or not-existed, aufs does not test it.
And the file is still handled as negative after a user created the file on
a branch directly. If the file is not cached, aufs will lookup normally
and find the file.
- \(bu
- When the file is cached as positive or existed, and a user created the
same named file directly on the upper branch. Aufs detects the cached
inode of the file is still existing and will show you the old (cached)
file which is on the lower branch.
- \(bu
- When the file is cached as positive or existed, and a user renamed the
file by rename(2) directly. Aufs detects the inode of the file is still
existing. You may or may not see both of the old and new files. Todo: If
aufs also tests the name, we can detect this case.
If your outer modification (UDBA) is rare and you can ignore the temporary and
minor differences between virtual aufs world and real branch filesystem, then
try this mount option.
- udba=notify
- Aufs sets either ‘fsnotify’ or ‘inotify’ to
all the accessed directories on its branches and receives the event about
the dir and its children. It consumes resources, cpu and memory. And I am
afraid that the performance will be hurt, but it is most strict test
level. There are some limitations of linux inotify, see also Inotify
Limitation. So it is recommended to leave udba default option usually, and
set it to notify by remount when you need it.
When a user accesses the file which was notified UDBA before, the cached
data about the file will be discarded and aufs re-lookup it. So the data
will be updated. When an error condition occurs between UDBA and aufs
operation, aufs will return an error, including EIO. To use this option,
you need to enable CONFIG_INOTIFY and CONFIG_AUFS_HINOTIFY. In
linux-2.6.31, CONFIG_FSNOTIFY was introduced and CONFIG_INOTIFY was listed
in Documentation/feature-removal-schedule.txt. In aufs2-31 and later
(until CONFIG_INOTIFY is removed actually), you can choose either
‘fsnotify’ or ‘inotify’ in configuration.
Whichever you choose, specify ‘udba=notify’, and aufs
interprets it as an abstract name.
To rename/rmdir a directory on a branch directory may reveal the same named
directory on the lower branch. Aufs tries re-lookuping the renamed
directory and the revealed directory and assigning different inode number
to them. But the inode number including their children can be a problem.
The inode numbers will be changed silently, and aufs may produce a
warning. If you rename a directory repeatedly and reveal/hide the lower
directory, then aufs may confuse their inode numbers too. It depends upon
the system cache.
When you make a directory in aufs and mount other filesystem on it, the
directory in aufs cannot be removed expectedly because it is a mount
point. But the same named directory on the writable branch can be removed,
if someone wants. It is just an empty directory, instead of a mount point.
Aufs cannot stop such direct rmdir, but produces a warning about it.
If the pseudo-linked file is hardlinked or unlinked on the branch directly,
its inode link count in aufs may be incorrect. It is recommended to flush
the pseudo-links by auplink script.
Linux Inotify Limitation¶
Unfortunately, current inotify (linux-2.6.18) has some limitations, and aufs
must derive it.
IN_DELETE, removing file on NFS¶
When a file on a NFS branch is deleted directly, inotify may or may not fire
IN_DELETE event. It depends upon the status of dentry (DCACHE_NFSFS_RENAMED
flag). In this case, the file on aufs seems still exists. Aufs and any user
can see the file.
IN_IGNORED, deleted rename target¶
When a file/dir on a branch is unlinked by
rename(2) directly, inotify fires
IN_IGNORED which means the inode is deleted. Actually, in some cases, the
inode survives. For example, the rename target is linked or opened. In this
case, inotify watch set by aufs is removed by VFS and inotify. And aufs cannot
receive the events anymore. So aufs may show you incorrect data about the
file/dir.
Virtual or Vertical Directory Block (VDIR)¶
In order to provide the merged view of file listing, aufs builds internal
directory block on memory. For readdir, aufs performs readdir() internally for
each dir on branches, merges their entries with eliminating the whiteout-ed
ones, and sets it to the opened file (dir) object. So the file object has its
entry list until it is closed. The entry list will be updated when the file
position is zero (by
rewinddir(3)) and becomes obsoleted.
The merged result is cached in the corresponding inode object and maintained by
a customizable life-time option. Note: the mount option
‘rdcache=<sec>’ is still under considering and its
description is hidden from this manual.
Some people may call it can be a security hole or invite DoS attack since the
opened and once readdir-ed dir (file object) holds its entry list and becomes
a pressure for system memory. But I would say it is similar to files under
/proc or /sys. The virtual files in them also holds a memory page (generally)
while they are opened. When an idea to reduce memory for them is introduced,
it will be applied to aufs too.
The dynamically allocated memory block for the name of entries has a unit of 512
bytes by default. During building dir blocks, aufs creates hash list (hashed
and divided by 32 by default) and judging whether the entry is whiteouted by
its upper branch or already listed.
These values are suitable for normal environments. But you may have millions of
files or very long filenames under a single directory. For such cases, you may
need to customize these values by specifying rdblk= and rdhash= aufs mount
options.
For instance, there are 97 files under my /bin, and the total name length is 597
bytes.
$ \ls -1 /bin | wc
97 97 597
Strictly speaking, 97 end-of-line codes are included. But it is OK since aufs
VDIR also stores the name length in 1 byte. In this case, you do not need to
customize the default values. 597 bytes filenames will be stored in 2 VDIR
memory blocks (597 < 512 x 2). And 97 filenames are distributed among 32
lists, so one list will point 4 names in average. To judge the names is
whiteouted or not, the number of comparison will be 4. 2 memory allocations
and 4 comparison costs low (even if the directory is opened for a long time).
So you do not need to customize.
If your directory has millions of files, the you will need to specify rdblk= and
rdhash=.
$ ls -U /mnt/rotating-rust | wc -l
1382438
In this case, assuming the average length of filenames is 6, in order to get
better time performance I would recommend to set $((128*1024)) or $((64*1024))
for rdblk, and $((8*1024)) or $((4*1024)) for rdhash. You can change these
values of the active aufs mount by "mount -o remount".
This customization is not for reducing the memory space, but for reducing time
for the number of memory allocation and the name comparison. The larger value
is faster, in general. Of course, you will need system memory. This is a
generic "time-vs-space" problem.
Using libau.so¶
There is a dynamic shared object library called libau.so in aufs-util or
aufs2-util GIT tree. This library provides several useful functions which wrap
the standard library functions such as,
- \(bu
- readdir, readdir_r, closedir
- \(bu
- pathconf, fpathconf
To use libau.so,
- \(bu
- install by "make install_ulib" under aufs-util (or aufs2-util)
GIT tree
- \(bu
- set the environment variable "LD_PRELOAD=libau.so", or configure
/etc/ld.so.preload
- \(bu
- set the environment variable "LIBAU=all"
- \(bu
- and run your application.
If you use
pathconf(3)/
fpathconf(3) with _PC_LINK_MAX for aufs, you need to use
libau.so.
VDIR/readdir(3) in user-space (RDU)¶
If you have a directory which has millions of files, aufs VDIR consumes much
memory. You may meet "out of memory" message due to the memory
fragmentation or real starvation. In this case, RDU (readdir(3) in user-space)
may help you. Because the kernel memory space cannot be swappable and
consuming much can be pure memory pressure, while it is not true in
user-space.
If you enable CONFIG_AUFS_RDU at compiling aufs, install libau.so, and set some
environment variables, then you can use RDU. Just simply run your application.
The dynamic link library libau.so implements another readdir routine, and all
readdir(3) calls in your application will be handled by libau.so.
When you call
readdir(3), the dynamic linker calls readdir in libau.so. If it
finds the passed dir is NOT aufs, it calls the usual
readdir(3). It the dir is
aufs, then libau.so gets all filenames under the dir by aufs specific
ioctl(2)s, instead of regular
readdir(3), and merges them by itself. In other
words, libau.so moves the memory consumption in kernel-space to user-space.
While it is good to stop consuming much memory in kernel-space, sometimes the
speed performance may be damaged a little as a side effect. It is just a
little, I hope. At the same time, I won't be surprised if
readdir(3) runs
faster.
It is recommended to specify rdblk=0 when you use this library.
If your directory is not so huge and you don't meet the out of memory situation,
probably you don't need this library. The original VDIR in kernel-space is
still alive, and you can live without libau.so.
pathconf(_PC_LINK_MAX)¶
Since some implementation of
pathconf(3) (and
fpathconf(3)) for _PC_LINK_MAX
decides the target filesystem type and returns the pre-defined constant value,
when aufs is unknown to the library, it will return the default value (127).
Actually the maximum number of the link count in aufs inherits the topmost
writable branch filesystem's. But the standard
pathconf(3) will not return the
correct value.
To support such case, libau.so provides a wrapper for
pathconf(3) (and
fpathconf(3)). When the parameter is _PC_LINK_MAX, the wrapper checks whether
the given parameter refers aufs or not. If it is aufs, then it will get the
maximum link count from the topmost writable branch internally. Otherwise, it
behaves as normal
pathconf(3) transparently.
Note¶
Since this is a dynamically linked library, it is unavailable if your
application is statically linked. And
ld.so(8) ignores LD_PRELOAD when the
application is setuid/setgid-ed unless the library is not setuid/setgid-ed. It
is a generic rule of dynamically linked library. Additionally the functions in
libau.so are unavailable in these cases too.
Copy On Write, or aufs internal copyup and copydown¶
Every stackable filesystem which implements copy-on-write supports the copyup
feature. The feature is to copy a file/dir from the lower branch to the upper
internally. When you have one readonly branch and one upper writable branch,
and you append a string to a file which exists on the readonly branch, then
aufs will copy the file from the readonly branch to the writable branch with
its directory hierarchy. It means one
write(2) involves several
logical/internal
mkdir(2),
creat(2),
read(2),
write(2) and
close(2)
systemcalls before the actual expected
write(2) is performed. Sometimes it may
take a long time, particularly when the file is very large. If
CONFIG_AUFS_DEBUG is enabled, aufs produces a message saying `copying a large
file.'
You may see the message when you change the xino file path or truncate the
xino/xib files. Sometimes those files can be large and may take a long time to
handle them.
Policies to Select One among Multiple Writable Branches¶
Aufs has some policies to select one among multiple writable branches when you
are going to write/modify something. There are two kinds of policies, one is
for newly create something and the other is for internal copy-up. You can
select them by specifying mount option ‘create=CREATE_POLICY’ or
‘cpup=COPYUP_POLICY.’ These policies have no meaning when you
have only one writable branch. If there is some meaning, it must hurt the
performance.
Exceptions for Policies¶
In every cases below, even if the policy says that the branch where a new file
should be created is /rw2, the file will be created on /rw1.
- \(bu
- If there is a readonly branch with ‘wh’ attribute above the
policy-selected branch and the parent dir is marked as opaque, or the
target (creating) file is whiteouted on the ro+wh branch, then the policy
will be ignored and the target file will be created on the nearest upper
writable branch than the ro+wh branch.
/aufs = /rw1 + /ro+wh/diropq + /rw2
/aufs = /rw1 + /ro+wh/wh.tgt + /rw2
- \(bu
- If there is a writable branch above the policy-selected branch and the
parent dir is marked as opaque or the target file is whiteouted on the
branch, then the policy will be ignored and the target file will be
created on the highest one among the upper writable branches who has
diropq or whiteout. In case of whiteout, aufs removes it as usual.
/aufs = /rw1/diropq + /rw2
/aufs = /rw1/wh.tgt + /rw2
- \(bu
- link(2) and rename(2) systemcalls are exceptions in every policy. They try
selecting the branch where the source exists as possible since copyup a
large file will take long time. If it can't be, ie. the branch where the
source exists is readonly, then they will follow the copyup policy.
- \(bu
- There is an exception for rename(2) when the target exists. If the rename
target exists, aufs compares the index of the branches where the source
and the target are existing and selects the higher one. If the selected
branch is readonly, then aufs follows the copyup policy.
Policies for Creating¶
- create=tdp | top-down-parent
- Selects the highest writable branch where the parent dir exists. If the
parent dir does not exist on a writable branch, then the internal copyup
will happen. The policy for this copyup is always
‘bottom-up.’ This is the default policy.
- create=rr | round-robin
- Selects a writable branch in round robin. When you have two writable
branches and creates 10 new files, 5 files will be created for each
branch. mkdir(2) systemcall is an exception. When you create 10 new
directories, all are created on the same branch.
- create=mfs[:second] | most-free-space[:second]
- Selects a writable branch which has most free space. In order to keep the
performance, you can specify the duration (‘second’) which
makes aufs hold the index of last selected writable branch until the
specified seconds expires. The seconds is upto 3600 seconds. The first
time you create something in aufs after the specified seconds expired,
aufs checks the amount of free space of all writable branches by internal
statfs call and the held branch index will be updated. The default value
is 30 seconds.
- create=mfsrr:low[:second]
- Selects a writable branch in most-free-space mode first, and then
round-robin mode. If the selected branch has less free space than the
specified value ‘low’ in bytes, then aufs re-tries in
round-robin mode. Try an arithmetic expansion of shell which is defined by
POSIX. For example, $((10 * 1024 * 1024)) for 10M. You can also specify
the duration (‘second’) which is equivalent to the
‘mfs’ mode.
- create=pmfs[:second]
- Selects a writable branch where the parent dir exists, such as tdp mode.
When the parent dir exists on multiple writable branches, aufs selects the
one which has most free space, such as mfs mode.
Policies for Copy-Up¶
- cpup=tdp | top-down-parent
- Equivalent to the same named policy for create. This is the default
policy.
- cpup=bup | bottom-up-parent
- Selects the writable branch where the parent dir exists and the branch is
nearest upper one from the copyup-source.
- cpup=bu | bottom-up
- Selects the nearest upper writable branch from the copyup-source,
regardless the existence of the parent dir.
Exporting Aufs via NFS¶
Aufs is supporting NFS-exporting. Since aufs has no actual block device, you
need to add NFS ‘fsid’ option at exporting. Refer to the manual
of NFS about the detail of this option.
There are some limitations or requirements.
- \(bu
- The branch filesystem must support NFS-exporting.
- \(bu
- NFSv2 is not supported. When you mount the exported aufs from your NFS
client, you will need to some NFS options like v3 or nfsvers=3, especially
if it is nfsroot.
- \(bu
- If the size of the NFS file handle on your branch filesystem is large,
aufs will not be able to handle it. The maximum size of NFSv3 file handle
for a filesystem is 64 bytes. Aufs uses 24 bytes for 32bit system, plus 12
bytes for 64bit system. The rest is a room for a file handle of a branch
filesystem.
- \(bu
- The External Inode Number Bitmap, Translation Table and Generation Table
(xino) is required since NFS file handle is based upon inode number. The
mount option ‘xino’ is enabled by default. The external
inode generation table and its debugfs entry
(<debugfs>/aufs/si_*/xigen) is created when CONFIG_AUFS_EXPORT is
enabled even if you don't export aufs actually. The size of the external
inode generation table grows only, never be truncated. You might need to
pay attention to the free space of the filesystem where xino files are
placed. By default, it is the first writable branch.
- \(bu
- The branch filesystems must be accessible, which means ‘not
hidden.’ It means you need to ‘mount --move’ when you
use initramfs and switch_root(8), or chroot(8).
- \(bu
- Since aufs has several filename prefixes reserved, the maxmum filename
length is shorter than ordinary 255. Actually 242 (defined as
${AUFS_MAX_NAMELEN}). This value should be specified as
‘namlen=’ when you mount NFS. The name of the branch top
directory has another limit. When you set the module parameter
‘brs’ to 1 (default), then you can see the branch pathname
via /sys/fs/aufs/si_XXX/brNNN. Here it is printed with its branch
attributes such as ‘=rw’ or ‘=ro+wh’. Since
all the sysfs entries have the size limit of 4096 bytes, the length of the
branch path becomes shorter than 4096. Actually you can specify any branch
with much longer names, but you will meet some troubles when you remount
later because remounting runs the aufs mount helper internally and it
tries reading /sys/fs/aufs/si_XXX/brNNN.
Direct I/O¶
The Direct I/O (including Linux AIO) is a filesystem (and its backend block
device) specific feature. And there is a minor problem around the aufs
internal copyup. If you have two branches, lower RO ext2 and upper RW tmpfs.
As you know ext2 supports Direct I/O, but tmpfs doesn't. When a
‘fileA’ exists in the lower ext2, and you write something into
after opening it with O_DIRECT, then aufs behaves like this if the mount
option ‘dio’ is specified.
- \(bu
- The application issues open(O_DIRECT);
Aufs opens the file in the lower ext2 and succeeds.
- \(bu
- The application issues write("something");
Aufs copies-up the file from the lower ext2 to the upper tmpfs, and re-opens
the file in tmpfs with O_DIRECT. It fails and returns an error.
This behaviour may be a problem since application expects the error should be
returned from the first
open(2) instead of the later
write(2), when the
filesystem doesn't support Direct I/O. (But, in real world, I don't think
there is an application which doesn't check the error from
write(2). So it
won't be a big problem actually).
If the file exists in the upper tmpfs, the first
open(2) will fail expectedly.
So there is no problem in this case. But the problem may happen when the
internal copyup happens and the behaviour of the branch differs from each
other. As long as the feature depends upon the filesystem, this problem will
not be solved. So aufs sets `nodio` by default, which means all Direct I/O are
disabled, and oepn(2) with O_DIRECT always fails. If you want to use Direct
I/O AND all your writeble branches support it, then specify
‘dio’ option to make it in effect.
Possible problem of the inode number in TMPFS¶
Although it is rare to happen, TMPFS has a problem about its inode number
management. Actually TMPFS does not maintain the inode number at all. Linux
kernel has a global 32bit number for general use of inode number, and TMPFS
uses it while most of (real) filesystem maintains its inode number by itself.
The global number can wrap around regardless the inode number is still in use.
This MAY cause a problem.
For instance, when /your/tmpfs/fileA has 10 as its inode number, the same value
(10) may be assigned to a newly created file /your/tmpfs/fileB. Some
applications do not care the duplicated inode numbers, but others, including
AUFS, will be really confused by this situation.
If your writable branch FS is TMPFS and the inode number wraps around, aufs will
not work correctly. It is recommended to use one of FS on HDD, ramdisk+ext2 or
tmpfs+FSimage+loopback mount, as your writable branch FS.
Dentry and Inode Caches¶
If you want to clear caches on your system, there are several tricks for that.
If your system ram is low, try ‘find /large/dir -ls >
/dev/null’. It will read many inodes and dentries and cache them. Then
old caches will be discarded. But when you have large ram or you do not have
such large directory, it is not effective.
If you want to discard cache within a certain filesystem, try ‘mount -o
remount /your/mntpnt’. Some filesystem may return an error of EINVAL or
something, but VFS discards the unused dentry/inode caches on the specified
filesystem.
Compatible/Incompatible with Unionfs Version 1.x Series¶
Ignoring ‘delete’ option, and to keep filesystem consistency, aufs
tries writing something to only one branch in a single systemcall. It means
aufs may copyup even if the copyup-src branch is specified as writable. For
example, you have two writable branches and a large regular file on the lower
writable branch. When you issue
rename(2) to the file on aufs, aufs may copyup
it to the upper writable branch. If this behaviour is not what you want, then
you should
rename(2) it on the lower branch directly.
And there is a simple shell script ‘unionctl’ under sample
subdirectory, which is compatible with unionctl(8) in Unionfs Version 1.x
series, except --query action. This script executes
mount(8) with
‘remount’ option and uses add/del/mod aufs mount options. If you
are familiar with Unionfs Version 1.x series and want to use unionctl(8), you
can try this script instead of using mount -o remount,... directly. Aufs does
not support
ioctl(2) interface. This script is highly depending upon
mount(8)
in util-linux-2.12p package, and you need to mount /proc to use this script.
If your
mount(8) version differs, you can try modifying this script. It is
very easy. The unionctl script is just for a sample usage of aufs remount
interface.
Aufs uses the external inode number bitmap and translation table by default.
The default branch permission for the first branch is ‘rw’, and
the rest is ‘ro.’
The whiteout is for hiding files on lower branches. Also it is applied to stop
readdir going lower branches. The latter case is called ‘opaque
directory.’ Any whiteout is an empty file, it means whiteout is just an
mark. In the case of hiding lower files, the name of whiteout is
‘.wh.<filename>.’ And in the case of stopping readdir, the
name is ‘.wh..wh..opq’. All whiteouts are hardlinked, including
‘<writable branch top dir>/.wh..wh.aufs.’
The hardlink on an ordinary (disk based) filesystem does not consume inode
resource newly. But in linux tmpfs, the number of free inodes will be
decremented by
link(2). It is recommended to specify nr_inodes option to your
tmpfs if you meet ENOSPC. Use this option after checking by ‘df
-i.’
When you rmdir or rename-to the dir who has a number of whiteouts, aufs rename
the dir to the temporary whiteouted-name like
‘.wh..wh.<dir>.<4-digits hex>.’ Then remove it after
actual operation. cf. mount option ‘dirwh.’
Incompatible with an Ordinary Filesystem¶
stat(2) returns the inode info from the first existence inode among the
branches, except the directory link count. Aufs computes the directory link
count larger than the exact value usually, in order to keep UNIX filesystem
semantics, or in order to shut
find(1) mouth up. The size of a directory may
be wrong too, but it has to do no harm. The timestamp of a directory will not
be updated when a file is created or removed under it, and it was done on a
lower branch.
The test for permission bits has two cases. One is for a directory, and the
other is for a non-directory. In the case of a directory, aufs checks the
permission bits of all existing directories. It means you need the correct
privilege for the directories including the lower branches. The test for a
non-directory is more simple. It checks only the topmost inode.
statfs(2) returns the information of the first branch info except namelen when
‘nosum’ is specified (the default). The namelen is decreased by
the whiteout prefix length. Although the whiteout prefix is essentially
‘.wh.’, to support
rmdir(2) and
rename(2) (when the target
directory already existed), the namelen is decreased more since the name will
be renamed to ‘.wh..wh.<dir>.<4-digits hex>’ as
previously described. And the block size may differ from st_blksize which is
obtained by
stat(2).
The whiteout prefix (.wh.) is reserved on all branches. Users should not handle
the filename begins with this prefix. In order to future whiteout, the maximum
filename length is limited by the longest value - 4 * 2 - 1 - 4 = 242. It
means you cannot handle such long name in aufs, even if it surely exists on
the underlying branch fs. The
readdir(3)/
getdents(2) call show you such name,
but the d_type is set to DT_UNKNOWN. It may be a violation of POSIX.
Remember,
seekdir(3) and
telldir(3) are not defined in POSIX. They may not work
as you expect. Try
rewinddir(3) or re-open the dir.
If you dislike the difference between the aufs entries in /etc/mtab and
/proc/mounts, and if you are using
mount(8) in util-linux package, then try
./mount.aufs utility. Copy the script to /sbin/mount.aufs. This simple utility
tries updating /etc/mtab. If you do not care about /etc/mtab, you can ignore
this utility. Remember this utility is highly depending upon
mount(8) in
util-linux-2.12p package, and you need to mount /proc.
Since aufs uses its own inode and dentry, your system may cache huge number of
inodes and dentries. It can be as twice as all of the files in your union. It
means that unmounting or remounting readonly at shutdown time may take a long
time, since
mount(2) in VFS tries freeing all of the cache on the target
filesystem.
When you open a directory, aufs will open several directories internally. It
means you may reach the limit of the number of file descriptor. And when the
lower directory cannot be opened, aufs will close all the opened upper
directories and return an error.
The sub-mount under the branch of local filesystem is ignored. For example, if
you have mount another filesystem on /branch/another/mntpnt, the files under
‘mntpnt’ will be ignored by aufs. It is recommended to mount the
sub-mount under the mounted aufs. For example,
# sudo mount /dev/sdaXX /ro_branch
# d=another/mntpnt
# sudo mount /dev/sdbXX /ro_branch/$d
# mkdir -p /rw_branch/$d
# sudo mount -t aufs -o br:/rw_branch:/ro_branch none /aufs
# sudo mount -t aufs -o br:/rw_branch/${d}:/ro_branch/${d} none /aufs/another/$d
There are several characters which are not allowed to use in a branch directory
path and xino filename. See detail in Branch Syntax and Mount Option.
The file-lock which means
fcntl(2) with F_SETLK, F_SETLKW or F_GETLK,
flock(2)
and
lockf(3), is applied to virtual aufs file only, not to the file on a
branch. It means you can break the lock by accessing a branch directly. TODO:
check ‘security’ to hook locks, as inotify does.
The I/O to the named pipe or local socket are not handled by aufs, even if it
exists in aufs. After the reader and the writer established their connection
if the pipe/socket are copied-up, they keep using the old one instead of the
copied-up one.
The
fsync(2) and
fdatasync(2) systemcalls return 0 which means success, even if
the given file descriptor is not opened for writing. I am afraid this
behaviour may violate some standards. Checking the behaviour of
fsync(2) on
ext2, aufs decided to return success.
If you want to use disk-quota, you should set it up to your writable branch
since aufs does not have its own block device.
When your aufs is the root directory of your system, and your system tells you
some of the filesystem were not unmounted cleanly, try these procedure when
you shutdown your system.
# mount -no remount,ro /
# for i in $writable_branches
# do mount -no remount,ro $i
# done
If your xino file is on a hard drive, you also need to specify
‘noxino’ option or ‘xino=/your/tmpfs/xino’ at
remounting root directory.
To
rename(2) directory may return EXDEV even if both of src and tgt are on the
same aufs. When the rename-src dir exists on multiple branches and the lower
dir has child(ren), aufs has to copyup all his children. It can be recursive
copyup. Current aufs does not support such huge copyup operation at one time
in kernel space, instead produces a warning and returns EXDEV. Generally,
mv(1) detects this error and tries
mkdir(2) and
rename(2) or copy/unlink
recursively. So the result is harmless. If your application which issues
rename(2) for a directory does not support EXDEV, it will not work on aufs.
Also this specification is applied to the case when the src directory exists
on the lower readonly branch and it has child(ren).
If a sudden accident such like a power failure happens during aufs is
performing, and regular fsck for branch filesystems is completed after the
disaster, you need to extra fsck for aufs writable branches. It is necessary
to check whether the whiteout remains incorrectly or not, eg. the real
filename and the whiteout for it under the same parent directory. If such
whiteout remains, aufs cannot handle the file correctly. To check the
consistency from the aufs' point of view, you can use a simple shell script
called /sbin/auchk. Its purpose is a fsck tool for aufs, and it checks the
illegal whiteout, the remained pseudo-links and the remained aufs-temp files.
If they are found, the utility reports you and asks whether to delete or not.
It is recommended to execute /sbin/auchk for every writable branch filesystem
before mounting aufs if the system experienced crash.
EXAMPLES¶
The mount options are interpreted from left to right at remount-time. These
examples shows how the options are handled. (assuming /sbin/mount.aufs was
installed)
# mount -v -t aufs -o br:/day0:/base none /u
none on /u type aufs (rw,xino=/day0/.aufs.xino,br:/day0=rw:/base=ro)
# mount -v -o remount,\
prepend:/day1,\
xino=/day1/xino,\
mod:/day0=ro,\
del:/day0 \
/u
none on /u type aufs (rw,xino=/day1/xino,br:/day1=rw:/base=ro)
# mount -t aufs -o br:/rw none /u
# mount -o remount,append:/ro /u
different uid/gid/permission, /ro
# mount -o remount,del:/ro /u
# mount -o remount,nowarn_perm,append:/ro /u
#
(there is no warning)
When you use aufs as root filesystem, it is recommended to consider to exclude
some directories. For example, /tmp and /var/log are not need to stack in many
cases. They do not usually need to copyup or to whiteout. Also the swapfile on
aufs (a regular file, not a block device) is not supported. In order to
exclude the specific dir from aufs, try bind mounting.
And there is a good sample which is for network booted diskless machines. See
sample/ in detail.
DIAGNOSTICS¶
When you add a branch to your union, aufs may warn you about the privilege or
security of the branch, which is the permission bits, owner and group of the
top directory of the branch. For example, when your upper writable branch has
a world writable top directory, a malicious user can create any files on the
writable branch directly, like copyup and modify manually. I am afraid it can
be a security issue.
When you mount or remount your union without -o ro common mount option and
without writable branch, aufs will warn you that the first branch should be
writable.
When you set udba other than notify and change something on your branch
filesystem directly, later aufs may detect some mismatches to its cache. If it
is a critical mismatch, aufs returns EIO.
When an error occurs in aufs, aufs prints the kernel message with
‘errno.’ The priority of the message (log level) is ERR or
WARNING which depends upon the message itself. You can convert the
‘errno’ into the error message by
perror(3),
strerror(3) or
something. For example, the ‘errno’ in the message ‘I/O
Error, write failed (-28)’ is 28 which means ENOSPC or ‘No space
left on device.’
When CONFIG_AUFS_BR_RAMFS is enabled, you can specify ramfs as an aufs branch.
Since ramfs is simple, it does not set the maximum link count originally. In
aufs, it is very dangerous, particularly for whiteouts. Finally aufs sets the
maximum link count for ramfs. The value is 32000 which is borrowed from ext2.
After you prepend a branch which already has some entires, aufs may report an
I/O Error with "brabra should be negative" or something. For
instance, you are going to
open(2) a regular file in aufs and
write(2)
something to it. If you prepend a branch between
open(2) and
write(2), and the
added branch already has a same named entry other than a regular file, then
you get a conflict.
- \(bu
- a regular file FOO exists in aufs.
- \(bu
- open the file FOO.
- \(bu
- add a branch which has FOO but it is a directory, and change the permssion
of the old branch to RO.
- \(bu
- write to the file FOO.
- \(bu
- aufs tries copying-up FOO to the upper writable branch which was recently
added.
- \(bu
- aufs finds a directory FOO on the upper branch, and returns an error.
In this situation, aufs keeps returning an error during FOO is cached in memory
because it remembers that FOO is a regular file instead of a directory. When
the system discards the cache about FOO, then you will see the directory FOO.
In other words, you will not be able to see the directory FOO on the newly
added branch during the file FOO on the lower branch is in use. This situation
may invite more complicated issue. If you
unlink(2) the opened file FOO, then
aufs will create a whiteout on the upper writable branch. And you get another
conflict which is coexisting a whiteout and a real entry on the same branch.
In this case, aufs also keeps returning an error when you try using FOO.
COPYRIGHT¶
Copyright © 2005-2011 Junjiro R. Okajima
AUTHOR¶
Junjiro R. Okajima