====== Handling large datasets within the cluster ====== ---- While working on the cluster, you may need to perform operations such as //copy//, //delete//, //sync//, //find//, etc. on a very large number of files or on files that are very large in size. While we do recommend to first have a look whether you can reduce the number of files you need to handle e.g. by packing them (because both the file system Lustre that is driving BIGWORK and the controllers of the disk drives have limits how many files they can handle at once. That means that if you need more IOPs, the overall performance you get may deteriorate significantly), we also realize that this frequently is more a long-term project. For this reason, we provide some parallel tools provided by the MPI-based package [[https://hpc.github.io/mpifileutils/|mpiFileUtils]]. The standard Unix commands like ''cp'', ''rm'' or ''find'' are often comparatively slow, as they are implemented as single process applications. As a typical example, consider copying directories containing a large number of files from your $BIGWORK to the local $TMPDIR storage on the compute nodes you allocated for further processing by your job, or/and transferring computation results back to your $BIGWORK. Another example would be a quick freeing up of space in your $BIGWORK by first copying files to your $PROJECT storage and then deleting them from $BIGWORK. Also, you could utilize the command ''dsync'', if you use your $PROJECT storage for backing up directories on your $BIGWORK. Below we will look at some of the mpiFileUtils tools in these and other practical examples. In order to speed up the recursive transfer of the **contents** of the directory ''$BIGWORK/source'' to ''$TMPDIR/dest'' on a compute node, put the lines below in your job script or enter them at the command prompt of your interactive job - we assume that you've requested 8 cores for your batch job: module load GCC/8.3.0 OpenMPI/3.1.4 mpifileutils/0.11 mpirun -np 8 dcp $BIGWORK/source/ $TMPDIR/dest Please note a trailing slash (''/'') on the source path, which means "**copy the contents of the source directory**". If the ''$TMPDIR/dest'' directory does not exist before copying, it will be created. The command ''mpirun'' launches ''dcp'' (distributed copy) in parallel with 8 MPI processes. The directory ''$BIGWORK/dir'' and its contents can be removed quickly using the ''drm'' (distributed remove) command: mpirun -np 8 drm $BIGWORK/dir **Note**: here and below we assume that the module ''mpifileutils/0.11'' is already loaded. The command ''drm'' supports the option ''--match'' allowing to delete files selectively. See ''man drm'' for more information. The next useful command is ''dfind'' - a parallel version of the unix command ''find''. In this example we find all files on $BIGWORK larger than 1GB and write them to a file: mpirun -np 8 dfind -v --output files_1GB.txt --size +1GB $BIGWORK To learn more about other ''dfind'' options, type ''man dfind''. If you want to synchronize the directory ''$BIGWORK/source'' to ''$PROJECT/dest'' such that the directory ''$PROJECT/dest'' has content, ownership, timestamps and permissions of ''$BIGWORK/source'', execute: mpirun -np 8 dsync -D $BIGWORK/source $PROJECT/dest Note that for this example the ''dsync'' command has to be launched on the login node where both $BIGWORK and $PROJECT are available. The last mpiFileUtils tools we consider in this section are ''dtar'' and ''dbz2''. The following creates a compressed archive ''mydir.tar.dbz2'' of the directory ''mydir'': mpirun -np 8 dtar -c -f mydir.tar mydir mpirun -np 8 dbz2 -z mydir.tar Please note: If the directory to be archived is located on your $HOME, the archive file itself should be placed on $BIGWORK. Please note: Transferring a large number of files from the cluster to an external server or vice versa as a single (compressed) tar archive is much more efficient than copying files individually. Some other useful commands are: * ''dstripe'' - restripe(lustre) files in paths * ''dwalk'' - list, sort, summarize files * ''ddup'' – find duplicate files A complete list of mpiFileUtils utilities and their description can be found at [[http://mpifileutils.readthedocs.io/|http://mpifileutils.readthedocs.io/]]. You might also consider alternative tools like [[https://www.gnu.org/software/parallel/parallel_tutorial.html|GNU parallel]], [[https://github.com/madler/pigz|pigz]] or [[https://github.com/wtsi-ssg/pcp|pcp]]. They are all available as modules on the cluster.