While working on the cluster, you may need to perform operations such as copy, delete, sync, find, etc. on a very large number of files or on files that are very large in size. While we do recommend to first have a look whether you can reduce the number of files you need to handle e.g. by packing them (because both the file system Lustre that is driving BIGWORK and the controllers of the disk drives have limits how many files they can handle at once. That means that if you need more IOPs, the overall performance you get may deteriorate significantly), we also realize that this frequently is more a long-term project. For this reason, we provide some parallel tools provided by the MPI-based package mpiFileUtils. The standard Unix commands like
find are often comparatively slow, as they are implemented as single process applications.
As a typical example, consider copying directories containing a large number of files from your $BIGWORK to the local $TMPDIR storage on the compute nodes you allocated for further processing by your job, or/and transferring computation results back to your $BIGWORK.
Another example would be a quick freeing up of space in your $BIGWORK by first copying files to your $PROJECT storage and then deleting them from $BIGWORK.
Also, you could utilize the command
dsync, if you use your $PROJECT storage for backing up directories on your $BIGWORK.
Below we will look at some of the mpiFileUtils tools in these and other practical examples.
In order to speed up the recursive transfer of the contents of the directory
$TMPDIR/dest on a compute node, put the lines below in your job script or enter them at the command prompt of your interactive job - we assume that you've requested 8 cores for your batch job:
module load GCC/8.3.0 OpenMPI/3.1.4 mpifileutils/0.11 mpirun -np 8 dcp $BIGWORK/source/ $TMPDIR/dest
Please note a trailing slash (
/) on the source path, which means “copy the contents of the source directory”. If the
$TMPDIR/dest directory does not exist before copying, it will be created.
dcp (distributed copy) in parallel with 8 MPI processes.
$BIGWORK/dir and its contents can be removed quickly using the
drm (distributed remove) command:
mpirun -np 8 drm $BIGWORK/dir
Note: here and below we assume that the module
mpifileutils/0.11 is already loaded.
drm supports the option
--match allowing to delete files selectively. See
man drm for more information.
The next useful command is
dfind - a parallel version of the unix command
find. In this example we find all files on $BIGWORK larger than 1GB and write them to a file:
mpirun -np 8 dfind -v --output files_1GB.txt --size +1GB $BIGWORK
To learn more about other
dfind options, type
If you want to synchronize the directory
$PROJECT/dest such that the directory
has content, ownership, timestamps and permissions of
mpirun -np 8 dsync -D $BIGWORK/source $PROJECT/dest
Note that for this example the
dsync command has to be launched on the login node where both $BIGWORK and $PROJECT are available.
The last mpiFileUtils tools we consider in this section are
dbz2. The following creates a compressed archive
mydir.tar.dbz2 of the directory
mpirun -np 8 dtar -c -f mydir.tar mydir mpirun -np 8 dbz2 -z mydir.tar
Please note: If the directory to be archived is located on your $HOME, the archive file itself should be placed on $BIGWORK.
Please note: Transferring a large number of files from the cluster to an external server or vice versa as a single (compressed) tar archive is much more efficient than copying files individually.
Some other useful commands are:
dstripe- restripe(lustre) files in paths
dwalk- list, sort, summarize files
ddup– find duplicate files
A complete list of mpiFileUtils utilities and their description can be found at http://mpifileutils.readthedocs.io/.