File _patchinfo of Package patchinfo.30855
<patchinfo incident="30855">
  <issue tracker="bnc" id="1215437">[RN, Slurm] Release notes for an update to 23.02.5</issue>
  <packager>eeich</packager>
  <rating>moderate</rating>
  <category>recommended</category>
  <summary>Recommended update for slurm_23_02</summary>
  <description>This update for slurm_23_02 fixes the following issues:
- Updated to version 23.02.5 with the following changes:
  * Bug Fixes:
    + Revert a change in 23.02 where `SLURM_NTASKS` was no longer set in the
      job's environment when `--ntasks-per-node` was requested.
      The method that is is being set, however, is different and should be more
      accurate in more situations.
    + Change pmi2 plugin to honor the `SrunPortRange` option. This matches the
      new behavior of the pmix plugin in 23.02.0. Note that neither of these
      plugins makes use of the `MpiParams=ports=` option, and previously
      were only limited by the systems ephemeral port range.
    + Fix regression in 23.02.2 that caused slurmctld -R to crash on startup if
      a node features plugin is configured.
    + Fix and prevent reoccurring reservations from overlapping.
    + `job_container/tmpfs` - Avoid attempts to share BasePath between nodes.
    + With `CR_Cpu_Memory`, fix node selection for jobs that request gres and
      `--mem-per-cpu`.
    + Fix a regression from 22.05.7 in which some jobs were allocated too few
      nodes, thus overcommitting cpus to some tasks.
    + Fix a job being stuck in the completing state if the job ends while the
      primary controller is down or unresponsive and the backup controller has
      not yet taken over.
    + Fix `slurmctld` segfault when a node registers with a configured
      `CpuSpecList` while `slurmctld` configuration has the node without
      `CpuSpecList`.
    + Fix cloud nodes getting stuck in `POWERED_DOWN+NO_RESPOND` state after
      not registering by `ResumeTimeout`.
    + `slurmstepd` - Avoid cleanup of `config.json-less` containers spooldir
      getting skipped.
    + Fix scontrol segfault when 'completing' command requested repeatedly in
      interactive mode.
    + Properly handle a race condition between `bind()` and `listen()` calls
      in the network stack when running with SrunPortRange set.
    + Federation - Fix revoked jobs being returned regardless of the
      `-a`/`--all` option for privileged users.
    + Federation - Fix canceling pending federated jobs from non-origin
      clusters which could leave federated jobs orphaned from the origin
      cluster.
    + Fix sinfo segfault when printing multiple clusters with `--noheader`
      option.
    + Federation - fix clusters not syncing if clusters are added to a
      federation before they have registered with the dbd.
    + `node_features/helpers` - Fix node selection for jobs requesting
      changeable.
      features with the `|` operator, which could prevent jobs from
      running on some valid nodes.
    + `node_features/helpers` - Fix inconsistent handling of `&` and `|`,
      where an AND'd feature was sometimes AND'd to all sets of features
      instead of just the current set. E.g. `foo|bar&baz` was interpreted
      as `{foo,baz}` or `{bar,baz}` instead of how it is documented:
      `{foo} or {bar,baz}`.
    + Fix job accounting so that when a job is requeued its allocated node
      count is cleared. After the requeue, sacct will correctly show that
      the job has 0 `AllocNodes` while it is pending or if it is canceled
      before restarting.
    + `sacct` - `AllocCPUS` now correctly shows 0 if a job has not yet
      received an allocation or if the job was canceled before getting one.
    + Fix intel OneAPI autodetect: detect the `/dev/dri/renderD[0-9]+` GPUs,
      and do not detect `/dev/dri/card[0-9]+`.
    + Fix node selection for jobs that request `--gpus` and a number of
      tasks fewer than GPUs, which resulted in incorrectly rejecting these
      jobs.
    + Remove `MYSQL_OPT_RECONNECT` completely.
    + Fix cloud nodes in `POWERING_UP` state disappearing (getting set
      to `FUTURE`)
      when an `scontrol reconfigure` happens.
    + `openapi/dbv0.0.39` - Avoid assert / segfault on missing coordinators
      list.
    + `slurmrestd` - Correct memory leak while parsing OpenAPI specification
      templates with server overrides.
    + Fix overwriting user node reason with system message.
    + Prevent deadlock when `rpc_queue` is enabled.
    + `slurmrestd` - Correct OpenAPI specification generation bug where
      fields with overlapping parent paths would not get generated.
    + Fix memory leak as a result of a partition info query.
    + Fix memory leak as a result of a job info query.
    + For step allocations, fix `--gres=none` sometimes not ignoring gres
      from the job.
    + Fix `--exclusive` jobs incorrectly gang-scheduling where they shouldn't.
    + Fix allocations with `CR_SOCKET`, gres not assigned to a specific
      socket, and block core distribion potentially allocating more sockets
      than required.
    + Revert a change in 23.02.3 where Slurm would kill a script's process
      group as soon as the script ended instead of waiting as long as any
      process in that process group held the stdout/stderr file descriptors
      open. That change broke some scripts that relied on the previous
      behavior. Setting time limits for scripts (such as
      `PrologEpilogTimeout`) is strongly encouraged to avoid Slurm waiting
      indefinitely for scripts to finish.
    + Fix `slurmdbd -R` not returning an error under certain conditions.
    + `slurmdbd` - Avoid potential NULL pointer dereference in the mysql
      plugin.
    + Fix regression in 23.02.3 which broken X11 forwarding for hosts when
      MUNGE sends a localhost address in the encode host field. This is caused
      when the node hostname is mapped to 127.0.0.1 (or similar) in
      `/etc/hosts`.
    + `openapi/[db]v0.0.39` - fix memory leak on parsing error.
    + `data_parser/v0.0.39` - fix updating qos for associations.
    + `openapi/dbv0.0.39` - fix updating values for associations with null
      users.
    + Fix minor memory leak with `--tres-per-task` and licenses.
    + Fix cyclic socket cpu distribution for tasks in a step where
      `--cpus-per-task` < usable threads per core.
    + `slurmrestd` - For `GET /slurm/v0.0.39/node[s]`, change format of
      node's energy field `current_watts` to a dictionary to account for
      unset value instead of dumping 4294967294.
    + `slurmrestd` - For `GET /slurm/v0.0.39/qos`, change format of QOS's
      field "priority" to a dictionary to account for unset value instead of
      dumping 4294967294.
    + slurmrestd - For `GET /slurm/v0.0.39/job[s]`, the 'return code'
      code field in `v0.0.39_job_exit`_code will be set to -127 instead of
      being left unset where job does not have a relevant return code.
  * Other Changes:
    + Remove --uid / --gid options from salloc and srun commands. These options
      did not work correctly since the CVE-2022-29500 fix in combination with
      some changes made in 23.02.0.
    + Add the `JobId` to `debug()` messages indicating when
      `cpus_per_task/mem_per_cpu` or `pn_min_cpus` are being automatically
      adjusted.
    + Change the log message warning for rate limited users from verbose to
      info.
    + `slurmstepd` - Cleanup per task generated environment for containers in
      spooldir.
    + Format batch, extern, interactive, and pending step ids into strings that
      are human readable.
    + `slurmrestd` - Reduce memory usage when printing out job CPU frequency.
    + `data_parser/v0.0.39` - Add `required/memory_per_cpu` and
      `required/memory_per_node` to `sacct --json` and `sacct --yaml` and
      `GET /slurmdb/v0.0.39/jobs` from slurmrestd.
    + `gpu/oneapi` - Store cores correctly so CPU affinity is tracked.
    + Allow `slurmdbd -R` to work if the root assoc id is not 1.
    + Limit periodic node registrations to 50 instead of the full `TreeWidth`.
      Since unresolvable `cloud/dynamic` nodes must disable fanout by setting
      `TreeWidth` to a large number, this would cause all nodes to register at
      once.
  </description>
</patchinfo>