File _patchinfo of Package patchinfo.30856

<patchinfo incident="30856">
<issue tracker="bnc" id="1215437">[RN, Slurm] Release notes for an update to 23.02.5</issue>
<packager>eeich</packager>
<rating>moderate</rating>
<category>recommended</category>
<summary>Recommended update for slurm_23_02</summary>
<description>This update for slurm_23_02 fixes the following issues:

- Updated to version 23.02.5 with the following changes:

* Bug Fixes:

+ Revert a change in 23.02 where `SLURM_NTASKS` was no longer set in the
job's environment when `--ntasks-per-node` was requested.
The method that is is being set, however, is different and should be more
accurate in more situations.
+ Change pmi2 plugin to honor the `SrunPortRange` option. This matches the
new behavior of the pmix plugin in 23.02.0. Note that neither of these
plugins makes use of the `MpiParams=ports=` option, and previously
were only limited by the systems ephemeral port range.
+ Fix regression in 23.02.2 that caused slurmctld -R to crash on startup if
a node features plugin is configured.
+ Fix and prevent reoccurring reservations from overlapping.
+ `job_container/tmpfs` - Avoid attempts to share BasePath between nodes.
+ With `CR_Cpu_Memory`, fix node selection for jobs that request gres and
`--mem-per-cpu`.
+ Fix a regression from 22.05.7 in which some jobs were allocated too few
nodes, thus overcommitting cpus to some tasks.
+ Fix a job being stuck in the completing state if the job ends while the
primary controller is down or unresponsive and the backup controller has
not yet taken over.
+ Fix `slurmctld` segfault when a node registers with a configured
`CpuSpecList` while `slurmctld` configuration has the node without
`CpuSpecList`.
+ Fix cloud nodes getting stuck in `POWERED_DOWN+NO_RESPOND` state after
not registering by `ResumeTimeout`.
+ `slurmstepd` - Avoid cleanup of `config.json-less` containers spooldir
getting skipped.
+ Fix scontrol segfault when 'completing' command requested repeatedly in
interactive mode.
+ Properly handle a race condition between `bind()` and `listen()` calls
in the network stack when running with SrunPortRange set.
+ Federation - Fix revoked jobs being returned regardless of the
`-a`/`--all` option for privileged users.
+ Federation - Fix canceling pending federated jobs from non-origin
clusters which could leave federated jobs orphaned from the origin
cluster.
+ Fix sinfo segfault when printing multiple clusters with `--noheader`
option.
+ Federation - fix clusters not syncing if clusters are added to a
federation before they have registered with the dbd.
+ `node_features/helpers` - Fix node selection for jobs requesting
changeable.
features with the `|` operator, which could prevent jobs from
running on some valid nodes.
+ `node_features/helpers` - Fix inconsistent handling of `&amp;` and `|`,
where an AND'd feature was sometimes AND'd to all sets of features
instead of just the current set. E.g. `foo|bar&amp;baz` was interpreted
as `{foo,baz}` or `{bar,baz}` instead of how it is documented:
`{foo} or {bar,baz}`.
+ Fix job accounting so that when a job is requeued its allocated node
count is cleared. After the requeue, sacct will correctly show that
the job has 0 `AllocNodes` while it is pending or if it is canceled
before restarting.
+ `sacct` - `AllocCPUS` now correctly shows 0 if a job has not yet
received an allocation or if the job was canceled before getting one.
+ Fix intel OneAPI autodetect: detect the `/dev/dri/renderD[0-9]+` GPUs,
and do not detect `/dev/dri/card[0-9]+`.
+ Fix node selection for jobs that request `--gpus` and a number of
tasks fewer than GPUs, which resulted in incorrectly rejecting these
jobs.
+ Remove `MYSQL_OPT_RECONNECT` completely.
+ Fix cloud nodes in `POWERING_UP` state disappearing (getting set
to `FUTURE`)
when an `scontrol reconfigure` happens.
+ `openapi/dbv0.0.39` - Avoid assert / segfault on missing coordinators
list.
+ `slurmrestd` - Correct memory leak while parsing OpenAPI specification
templates with server overrides.
+ Fix overwriting user node reason with system message.
+ Prevent deadlock when `rpc_queue` is enabled.
+ `slurmrestd` - Correct OpenAPI specification generation bug where
fields with overlapping parent paths would not get generated.
+ Fix memory leak as a result of a partition info query.
+ Fix memory leak as a result of a job info query.
+ For step allocations, fix `--gres=none` sometimes not ignoring gres
from the job.
+ Fix `--exclusive` jobs incorrectly gang-scheduling where they shouldn't.
+ Fix allocations with `CR_SOCKET`, gres not assigned to a specific
socket, and block core distribion potentially allocating more sockets
than required.
+ Revert a change in 23.02.3 where Slurm would kill a script's process
group as soon as the script ended instead of waiting as long as any
process in that process group held the stdout/stderr file descriptors
open. That change broke some scripts that relied on the previous
behavior. Setting time limits for scripts (such as
`PrologEpilogTimeout`) is strongly encouraged to avoid Slurm waiting
indefinitely for scripts to finish.
+ Fix `slurmdbd -R` not returning an error under certain conditions.
+ `slurmdbd` - Avoid potential NULL pointer dereference in the mysql
plugin.
+ Fix regression in 23.02.3 which broken X11 forwarding for hosts when
MUNGE sends a localhost address in the encode host field. This is caused
when the node hostname is mapped to 127.0.0.1 (or similar) in
`/etc/hosts`.
+ `openapi/[db]v0.0.39` - fix memory leak on parsing error.
+ `data_parser/v0.0.39` - fix updating qos for associations.
+ `openapi/dbv0.0.39` - fix updating values for associations with null
users.
+ Fix minor memory leak with `--tres-per-task` and licenses.
+ Fix cyclic socket cpu distribution for tasks in a step where
`--cpus-per-task` &lt; usable threads per core.
+ `slurmrestd` - For `GET /slurm/v0.0.39/node[s]`, change format of
node's energy field `current_watts` to a dictionary to account for
unset value instead of dumping 4294967294.
+ `slurmrestd` - For `GET /slurm/v0.0.39/qos`, change format of QOS's
field "priority" to a dictionary to account for unset value instead of
dumping 4294967294.
+ slurmrestd - For `GET /slurm/v0.0.39/job[s]`, the 'return code'
code field in `v0.0.39_job_exit`_code will be set to -127 instead of
being left unset where job does not have a relevant return code.

* Other Changes:

+ Remove --uid / --gid options from salloc and srun commands. These options
did not work correctly since the CVE-2022-29500 fix in combination with
some changes made in 23.02.0.
+ Add the `JobId` to `debug()` messages indicating when
`cpus_per_task/mem_per_cpu` or `pn_min_cpus` are being automatically
adjusted.
+ Change the log message warning for rate limited users from verbose to
info.
+ `slurmstepd` - Cleanup per task generated environment for containers in
spooldir.
+ Format batch, extern, interactive, and pending step ids into strings that
are human readable.
+ `slurmrestd` - Reduce memory usage when printing out job CPU frequency.
+ `data_parser/v0.0.39` - Add `required/memory_per_cpu` and
`required/memory_per_node` to `sacct --json` and `sacct --yaml` and
`GET /slurmdb/v0.0.39/jobs` from slurmrestd.
+ `gpu/oneapi` - Store cores correctly so CPU affinity is tracked.
+ Allow `slurmdbd -R` to work if the root assoc id is not 1.
+ Limit periodic node registrations to 50 instead of the full `TreeWidth`.
Since unresolvable `cloud/dynamic` nodes must disable fanout by setting
`TreeWidth` to a large number, this would cause all nodes to register at
once.
</description>
</patchinfo>

Places

File _patchinfo of Package patchinfo.30856

Places