Logo

HPC @ Uni.lu

High Performance Computing in Luxembourg

Directories such as $HOME, $WORK or $SCRATCH are shared among the nodes of the cluster that you are using (including the front-end) via shared filesystems (NFS, Lustre) meaning that:

  • every file/directory pushed or created on the front-end is available on the computing nodes
  • every file/directory pushed or created on the computing nodes is available on the front-end

The two most common commands you can use for data transfers over SSH:

  • scp: for the full transfer of files and directories (only works fine for single files or directories of small/trivial size)
  • rsync: a software application which synchronizes files and directories from one location to another while minimizing data transfer as only the outdated or inexistent elements are transferred (practically required for lengthy complex transfers, which are more likely to be interrupted in the middle).

Of both, normally the second approach should be preferred, as more generic; note that, both ensure a secure transfer of the data, within an encrypted tunnel.

N.B. There are many alternative ways to transfer files in HPC platforms and you should check your options according to the problem at hand. Please do not miss consulting the bb* family of tools, mentioned in the previous link, because these tools are available on the HPC platforms via modules.

Windows and OS X users may wish to transfer files from their systems to the clusters’ frontends with easy-to-use GUI applications such as:

These applications will need to be configured to connect to the frontends with the same parameters as discussed on the SSH access page.

Using scp

scp (see scp(1) ) or secure copy is probably the easiest of all the methods. The basic syntax is as follows:

scp [-P 8022] [-Cr] source_path destination_path
  • the -P option specifies the SSH port to use (in this case 8022)
  • the -C option activates the compression (actually, it passes the -C flag to ssh(1) to enable compression).
  • the -r option states to recursively copy entire directories (in this case, scp follows symbolic links encountered in the tree traversal). Please note that in this case, you must specify the source file as a directory for this to work.

The syntax for declaring a remote path is as follows on the cluster:
yourlogin@chaos-cluster:path/from/homedir

Transfer from your local machine to the remote cluster front-end

For instance; let’s assume you have a local directory ~/devel/myproject you want to transfer to the cluster, in your remote homedir.

$> scp -r ~/devel/myproject chaos-cluster:

Transfer from your local machine to the remote cluster front-end

This will transfer recursively your local directory ~/devel/myproject on the cluster front-end (in your homedir). Note that if you configured (as advised elsewhere) the SSH connection in your ~/.ssh/config file, you can use a much simpler syntax:

$> scp -r ~/devel/myproject chaos-cluster:

Transfer from the remote cluster front-end to your local machine

Conversely, let’s assume you want to retrieve the files ~/experiments/parallel_run/*

$> scp -P 8022 yourlogin@chaos-cluster:experiments/parallel_run/* /path/to/local/directory

Transfer from the remote cluster front-end to your local machine

Again, if you configured the SSH connection in your ~/.ssh/config file, you can use a simpler syntax:

$> scp chaos-cluster:experiments/parallel_run/* /path/to/local/directory

See the scp(1) man page for more details.

WARNING: scp SHOULD NOT be used in the following cases:

  • When you are copying more than a few files, as scp spawns a new process for each file and can be quite slow and resource intensive when copying a large number of files.
  • When using the -r switch, scp does not know about symbolic links and will blindly follow them, even if it has already made a copy of the file. That can lead to scp copying an infinite amount of data and can easily fill up your hard disk (or, worse, a system shared disk), so be careful.

Using rsync

The clever alternative to scp is rsync, which has the advantage of transferring only the files which differ between the source and the destination. This feature is often referred to as fast incremental file transfer. Additionally, symbolic links can be preserved. The typical syntax of rsync (see rsync(1) ) for the cluster is similar to the one of scp:

rsync --rsh='ssh -p 8022' -avzu source_path destination_path
  • the --rsh option specifies the connector to use (here SSH on port 8022)
  • the -a option corresponds to the “Archive” mode. Most likely you should always keep this on as it preserves file permissions and does not follow symlinks.
  • the -v option enables the verbose mode
  • the -z option enable compression, this will compress each file as it gets sent over the pipe. This can greatly decrease time, depending on what sort of files you are copying.
  • the -u option (or --update) corresponds to an updating process which skips files that are newer on the receiver. At this level, you may prefer the more dangerous option --delete that deletes extraneous files from dest dirs. Just like scp, the syntax for qualifying a remote path is as follows on the cluster: yourlogin@chaos-cluster:path/from/homedir

Transfer from your local machine to the remote cluster front-end

Coming back to the previous examples, let’s assume you have a local directory ~/devel/myproject you want to transfer to the cluster, in your remote homedir. In that case:

$> rsync --rsh='ssh -p 8022' -avzu ~/devel/myproject chaos-cluster:

This will synchronize your local directory ~/devel/myproject on the cluster front-end (in your homedir). Note that if you configured (as advised above) you SSH connection in your ~/.ssh/config file, you can use a simpler syntax:

Transfer from your local machine to the remote cluster front-end

$> rsync -avzu ~/devel/myproject chaos-cluster:

Transfer from the remote cluster front-end to your local machine

Conversely, let’s assume you want to synchronize (retrieve) the remote files ~/experiments/parallel_run/* on your local machine:

$> rsync --rsh='ssh -p 8022' -avzu chaos-cluster:experiments/parallel_run /path/to/local/directory

Transfer from the remote cluster front-end to your local machine

Again, if you configured the SSH connection in your ~/.ssh/config file, you can use a simpler syntax:

$> rsync -avzu chaos-cluster:experiments/parallel_run /path/to/local/directory

As always, see the man page for more details.

Synchronizing data between the cluster

We have created a simple wrapper script on each cluster to facilitate the synchronization of your files and directories in your home directory between each cluster.

  • On gaia, use the script gaia_sync_home
  • On chaos, use the script chaos_sync_home

Run the --help to better understand how it behaves.

Example:

  svarrette@access(gaia-cluster) ~>  gaia_sync_home --help
  NAME

  gaia_sync_home -- Synchronize the homedir (or any item relative to your homedir)
            between the two UL clusters site (i.e. either from gaia to chaos,
            or from chaos to gaia)

  SYNOPSIS
    gaia_sync_home [-V | -h]
    gaia_sync_home [--debug] [-v] [-n] [--delete] [--retrieve|--push] item1...

DESCRIPTION
    gaia_sync_home synchronize your files and directory (inside your homedir) between
    the UL clusters using rsync

OPTIONS
    --delete
        Causes gaia_sync_home to delete files on the target if absent in the original
        directory.  This ensure an exact replica but you may loose files
        so use this option with caution.
    --debug
        Debug mode. Causes gaia_sync_home to print debugging messages.
    -h --help
        Display a help screen and quit.
    -n --dry-run
        Simulation mode.
    -p ---push
        Push mode (default). rsync local homedir on the remote cluster.
    -r --retrieve
        Retrieve mode. rsync from the remote cluster into your local homedir.
    -v --verbose
        Verbose mode.
    -V --version
        Display the version number then quit.

EXAMPLES
    To synchronize your full homedir (from gaia to chaos) except hidden files/dirs:

        gaia_sync_home *

    To retrieve your full homedir (from chaos to gaia) execpt hidden files/dirs:

        gaia_sync_home --retrieve *

    To synchonize only the '.ssh' folder and the 'WORK/myProject' directory (from gaia to chaos):

        gaia_sync_home .ssh WORK/myProject

AUTHOR
    Sebastien Varrette <Sebastien.Varrette@uni.lu>
    Web page: http://varrette.gforge.uni.lu

REPORTING BUGS
    Please report bugs to <Sebastien.Varrette@uni.lu>

COPYRIGHT
    This is free software; see the source for copying conditions.  There is
    NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR
    PURPOSE.

SEE ALSO
    Other scripts are available on my web site http://varrette.gforge.uni.lu

Using MobaXterm (Windows)

If you are under Windows and you have MobaXterm installed and configured, you probably want to use it to transfer your files to the clusters. Here are the steps to use rsync inside MobaXterm in Windows.

Enable MobaXterm SSH Agent. It will manage he SSH key for you.

MobaXterm SSH Agent. It will manage he SSH key for you.

  • Go in Settings > SSH Tab

  • In SSH agents section, check Use internal SSH agent “MobAgent”

    MobAgent
  • Click on the + button on the right

  • Select your private key file. If you have several keys, you can add them by doing steps above again.

  • Click on “Show keys currently loaded in MobAgent. An advertisement window may appears asking if you want to run MobAgent. Click on “Yes”.

  • Check that your key(s) appears in the windows.

    MobAgent
  • Close window.

  • Click on OK. Restart MobaXterm.

Using a local bash, transfer your files

  • Open a local “bash” shell. Click on Start local terminal on the welcome page of MobaXterm.

  • Find the location of the files you want to transfer. They should be located under /drives/<name of your disk>. You will have to use the Linux command line to move from one directory to the other. cd command is used to change the current directory. ls to list files. For example, if your files are under C:\\Users\cparisot\Downloads\ you should then go to /drives/c/Users/cparisot/Downloads/ with this command:

1
cd /drives/c/Users/cparisot/Downloads/

Then list the files with ls command. You should see the list of your data files.

  • When you have retrieved the location of your files, we can begin the transfer with rsync. For example /drives/c/Users/cparisot/Downloads/ for the example (watch out, there is no / character at the end of the path, it is important).

  • Launch the command rsync with this parameters to transfer all the content of Downloads directory to the /isilon/projects/market_data/ directory on gaia-cluster: (the syntax is very important, be careful)

1
rsync -avzpP -e "ssh -p 8022" /drives/c/Users/cparisot/Downloads/ access-gaia.uni.lu:/isilon/projects/market_data/
  • You should see the output of transfer in progress. Wait for it to finish (it can be very long)

    MobAgent

Interrupt and resume a transfer in progress

  • If you want to interrupt the transfer to resume it later, press Ctrl-C and exit MobaXterm.

  • To resume a transfer, execute go in the right location and execute the rsync command again. Only the files than haven’t been transferred will be transfered again.

Alternative approaches

You can also consider alternative approaches to synchronize data with the cluster front-end:

  • rely on a versioning system such as Subversion or (better) GIT; this approach works well for source code trees.
  • mount your remote homedir by SSHFS. On Mac OS X, you should consider installing MacFusion for this purpose - on classical Linux system, just use the command-line sshfs or, mc.

Special transfers

Sometimes you may have the case that a lot of files need to go from point A to B over a Wide Area Network (eg. across the Atlantic); since packet latency and other factors on the network will naturally slow down the transfers, you need to find workarounds, typically with either rsync or tar.

Here is a case study you might find useful in this context.

Troubleshooting

Reserve resources with OAR