You may notice that leibniz is not always faster than hopper, and this is a trend that we expect to continue for the following clusters also. In the past five years, individual cores did not become much more efficient on a instructions-per-clock cycle basis. Instead, faster chips were build by including more cores, though at a lower clock speed to stay within the power budget for a socket, and new vectori instructions.
Compared to hopper,
- The clock speed of each core is a bit lower (2.4 GHz base frequency instead of 2.8 GHz), and this is not for all applications compensated by the slightly higher instructions-per-clock,
- But there are now 14 cores per socket rather than 10 (so 28 per node rather than 20),
- And there are some new vector instructions that were not present on hopper (AVX2 with Fused Multiply-Add rather than AVX)..
For programs that manage to use all of this, the peak performance of a node is effectively about twice as high as for a node on hopper. But single core jobs with code that does not use vectorization may very well run slower.
We use different software for managing modules on leibniz (Lmod instead of TCL-based modules). The new software supports the same commands as the old software, and more.
- Lmod does not allow loading multiple modules with the same name, while in the TCL-based modules it was up to the writer of each module file to determine if this was possible or not. As a consequence, while on hopper one could in principle customize the module path by loading multiple hopper modules, this is not the case on leibniz. Instead, we are providing leibniz modules for loading a specific toolchain (e.g., leibniz/2017a), one to load only software compiled against OS libraries (leibniz/centos7) and one to load all supported modules (leibniz/all).
As we are still experimenting with the setup of the module system, the safest thing to do is to always explicitly load the right version of the leibniz module.
- An interesting new command is "module spider Python" which will show all modules named Python, or with Python as part of their name. Moreover, this search is not case-sensitive so it is a very good way to figure out which module name to use if you are not sure about the capitalization.
One important change is that the new version of the operating system (CentOS 7.3, based on Red Hat 7) combined with our job management software allows much better control of the amount of memory that a job uses. Hence we can better protect the cluster against jobs that use more memory than requested. This is particularly important since leibniz does not support swapping on the nodes. This choice was made deliberatly as swapping to hard disk slows down a node to a crawl while SSDs that are robust enough to be used for swapping also cost a lot of money (memory cells on cheap SSDs can only be written a few 100 times, sometimes as little as 150 times). Instead, we increased the amount of memory available to each core. The better protection of jobs against each other may also allow us to consider to set apart some nodes for jobs that cannot fill a node and then allow multiple users on that node, rather than have those nodes used very inefficiently while other users are waiting for resources as is now the case.
- Torque distinguishes between two kinds of memory:
- Resident memory, essentially RAM, which is requested through pmem and mem.
- Virtual memory, which is the total amount of memory space requested, consisting of both resident memory and swap space, is requested through pvmem and vmem.
- mem and vmem specify memory for the job as a whole. The Torque manual discourages to use it for multi-node jobs though in the current version it works fine in most but not all cases, and evenly distributes the requested memory pool across cores.
- pmem and pvmem specify memory per core.
- It is better not to mix pmem/pvmem with mem/vmem as this can lead to confusing situations, though it does work.Torque will generally use the least restrictive of (mem,pmem) and of (vmem,pvmem) respectively for resident and virtual memory.
- Of course pvmem should not be smaller than pmem, and vmem should not be smaller than mem. Otherwise the job will be refused.
- We will set small defaults for users who do not specify this (to protect the system) but we have experienced that qsub will hang if a user makes a request that conflicts with the defaults (e.g., if we set a default for pvmem and a user uses a value for pmem which is larger than this value but does not specify pvmem, qsub will hang without producing the error message that one would expect).
- Our advise for now: Use both pmem and pvmem and set both to the same value, as we do not support swapping anyway.
- We are still experiencing problems with Open MPI (FOSS toolchain).
- With respect to Intel MPI: We are experimenting with a different way to let Intel MPI start programs (though through the same commands as before). The problem with Intel MPI on hopper was that processes on other nodes than the start node did not run under the control of the job management system. As a result, the CPU times and efficiencies computed by Torque and Moab were wrong, cleanup of failed jobs did not always fully work and resource use in general was not properly monitored. However, we are not sure yet that the new mechanism is robust enough, so let us know if large jobs do not start so that we can investigate what happened.
Technical note: The basic idea is that we let mpirun start processes on other nodes through the Torque job management library and not through ssh, but this is all accomplished through a number of environment variables set in the intel modules.