Revisions of slurm
Egbert Eich (eeich)
committed
(revision 233)
- testsuite: on later SUSE versions claim ownership of directory
Egbert Eich (eeich)
accepted
request 1068316
from
Egbert Eich (eeich)
(revision 232)
+ Fixed GpuFreqDef option. When set in slurm.conf, it will be used if --gpu-freq was not explicitly set by the job step. + topology/tree - Add new TopologyParam=SwitchAsNodeRank option to reorder nodes based on switch layout. This can be useful if the naming convention for the nodes does not natually map to the network topology. + Removed the default setting for GpuFreqDef. If unset, no attempt to change the GPU frequency will be made if --gpu-freq is not set for the step.
Egbert Eich (eeich)
accepted
request 1067475
from
Egbert Eich (eeich)
(revision 231)
- updated to 23.02.0-0rc1 * Highlights + slurmctld - Add new RPC rate limiting feature. This is enabled through SlurmctldParameters=rl_enable, otherwise disabled by default. + Make scontrol reconfigure and sending a SIGHUP to the slurmctld behave the same. If you were using SIGHUP as a 'lighter' scontrol reconfigure to rotate logs please update your scripts to use SIGUSR2 instead. + Change cloud nodes to show by default. PrivateData=cloud is no longer needed. + sreport - Count planned (FKA reserved) time for jobs running in IGNORE_JOBS reservations. Previously was lumped into IDLE time. + job_container/tmpfs - Support running with an arbitrary list of private mount points (/tmp and /dev/shm are the default, but not required). + job_container/tmpfs - Set more environment variables in InitScript. + Make all cgroup directories created by Slurm owned by root. This was the behavior in cgroup/v2 but not in cgroup/v1 where by default the step directories ownership were set to the user and group of the job. + accounting_storage/mysql - change purge/archive to calculate record ages based on end time, rather than start or submission times. + job_submit/lua - add support for log_user() from slurm_job_modify(). + Run the following scripts in slurmscriptd instead of slurmctld: ResumeProgram, ResumeFailProgram, SuspendProgram, ResvProlog, ResvEpilog, and RebootProgram (only with SlurmctldParameters=reboot_from_controller). + Only permit changing log levels with 'srun --slurmd-debug' by root or SlurmUser. + slurmctld will fatal() when reconfiguring the job_submit plugin fails. + Add PowerDownOnIdle partition option to power down nodes after nodes become idle. + Add "[jobid.stepid]" prefix from slurmstepd and "slurmscriptd" prefix from slurmcriptd to Syslog logging. Previously was only happening when
buildservice-autocommit
accepted
request 1063957
from
Egbert Eich (eeich)
(revision 230)
baserev update by copy to link target
Egbert Eich (eeich)
accepted
request 1063954
from
Egbert Eich (eeich)
(revision 229)
- testsuite: on laster SUSE versions claim ownership of directory /etc/security/limits.d.
buildservice-autocommit
accepted
request 1042071
from
Egbert Eich (eeich)
(revision 228)
baserev update by copy to link target
Egbert Eich (eeich)
accepted
request 1039957
from
Egbert Eich (eeich)
(revision 227)
- Move the ext_sensors/rrd plugin to a separate package: this plugin requires librrd which in turn requires huge parts of the client side X Window System stack. There is probably no use in cluttering up a system for a plugin that probably only used by a few.
buildservice-autocommit
accepted
request 1031255
from
Egbert Eich (eeich)
(revision 226)
baserev update by copy to link target
Egbert Eich (eeich)
committed
(revision 225)
* Improve setup-testsuite.sh: copy ssh fingerprints from all nodes.
Egbert Eich (eeich)
committed
(revision 224)
- Test Suite fixes: * Update README_Testsuite.md. * Clean up left over files when de-installing test suite. * Adjustment to test suite package: for SLE mark the openmpi4 devel package and slurm-hdf5 optional. * Add -ffat-lto-objects to the build flags when LTO is set to make sure the object files we ship with the test suite still work correctly.
Egbert Eich (eeich)
committed
(revision 223)
- Adjustment to test suite package: only recommend openmpi4
Egbert Eich (eeich)
accepted
request 1030610
from
Egbert Eich (eeich)
(revision 222)
- Update README_Testsuite.md. - Make hdf5 package optional for test suite. - Clean up left over files when de-installing test suite. - set environment variable SUSE_ZNOW to 0 in %build to avoid module load failures due to unresolved symbols as module take advantage of lazy bindings (bsc#1200030).
buildservice-autocommit
accepted
request 1030432
from
Egbert Eich (eeich)
(revision 221)
baserev update by copy to link target
Egbert Eich (eeich)
accepted
request 1010642
from
Christian Goll (mslacken)
(revision 220)
- updated to 22.05.5 - NOTE: Slurm validates that libraries are of the same version. Unfortunately, due to an oversight, we failed to notice that the slurmstepd loads the hash_k12 library only after a job has completed. This means that if the hash_k12 library is upgraded before a job finishes, the slurmstepd will load the new library when the job finishes, and will fail due to a mismatch of versions. This results in nodes with slurmstepd processes stuck indefinitely. These processes require manual intervention to clean up. There is no clean way to resolve these hung slurmstepd processes. The only recommended way to upgrade between minor versions of 22.05 with RPM’s or upgrades that replace current binaries and libraries is to drain the nodes of running jobs first. - Fixes a number of moderate severity issues, noteable are: * Load hash plugin at slurmstepd launch time to prevent issues loading the plugin at step completion if the Slurm installation is upgraded. * Update nvml plugin to match the unique id format for MIG devices in new Nvidia drivers. * Fix multi-node step launch failure when nodes in the controller aren't in natural order. This can happen with inconsistent node naming (such as node15 and node052) or with dynamic nodes which can register in any order. * job_container/tmpfs - cleanup containers even when the .ns file isn't mounted anymore. * Wait up to PrologEpilogTimeout before shutting down slurmd to allow prolog and epilog scripts to complete or timeout. Previously, slurmd waited 120 seconds before timing out and killing prolog and epilog scripts.
buildservice-autocommit
accepted
request 1006180
from
Egbert Eich (eeich)
(revision 219)
baserev update by copy to link target
Egbert Eich (eeich)
accepted
request 1005746
from
Egbert Eich (eeich)
(revision 218)
- Do not deduplicate files of testsuite Slurm configuration. This directory is supposed to be mounted over /etc/slurm therefore it must not contain softlinks to the files in this directory. - Improve .a and .o file collection for test suite: find these files even if there are multiple ones in a single line.
buildservice-autocommit
accepted
request 1005247
from
Egbert Eich (eeich)
(revision 217)
baserev update by copy to link target
Egbert Eich (eeich)
accepted
request 1005246
from
Egbert Eich (eeich)
(revision 216)
- Fix build for older product version.
buildservice-autocommit
accepted
request 992362
from
Egbert Eich (eeich)
(revision 215)
baserev update by copy to link target
Egbert Eich (eeich)
accepted
request 992353
from
Egbert Eich (eeich)
(revision 214)
- Fix a potential security vulnerability in the test package (bsc#1201674, CVE-2022-31251). - Patch NOFILE Limit in the slurmd.service copy for the testsuite.
Displaying revisions 61 - 80 of 293