AMBER GPU Compilation Notes
===========================

These notes document the local AMBER GPU patches used for BATTER runs.  Apply
the AMBER26 ``netfrc`` patch first because it affects both NVIDIA CUDA and AMD
HIP GPU GTI/TI runs.  The AMD section below is limited to GTI/HIP runtime
stability patches.

AMBER26 ``netfrc`` Patch
------------------------

This is an AMBER26 GPU GTI/TI issue and is not AMD-only.  It can affect both
NVIDIA CUDA and AMD HIP builds.

``netfrc`` is the PME net-force correction switch in the ``&ewald`` namelist:

* ``netfrc = 1`` removes the average nonbonded/PME force offset;
* ``netfrc = 0`` leaves the forces exactly as computed.

AMBER24 defaulted to ``netfrc = 1`` for MD runs whenever ``imin = 0``.  AMBER26
changed the default so that restrained runs with ``ntr = 1`` use
``netfrc = 0``:

.. code-block:: fortran

   if (imin .eq. 0 .and. ntr .eq. 0) then
     netfrc = 1
   else
     netfrc = 0
   end if

For GPU GTI softcore/TI, this can destabilize the force finalization path.  In
the failing BATTER case, the run eventually reported an illegal memory access
while copying the 42-term energy buffer, but the trigger was the AMBER26
``ntr = 1`` default selecting ``netfrc = 0``.

The local AMBER26 patch is in:

``pmemd26_src/src/pmemd/src/mdin_ewald_dat.F90``

It keeps the AMBER26 default for non-GTI and non-TI restrained runs, but restores
the AMBER24-style MD default for CUDA/HIP GTI TI runs:

.. code-block:: fortran

   if (netfrc .lt. 0) then
   #if defined(CUDA) && defined(GTI)
     ! GPU GTI force finalization is sensitive to disabling the PME net-force
     ! correction.  Keep the normal MD default for TI even when ntr is set.
     if (imin .eq. 0 .and. (ntr .eq. 0 .or. icfe .ne. 0)) then
   #else
     if (imin .eq. 0 .and. ntr .eq. 0) then
   #endif
       netfrc = 1
     else
       netfrc = 0
     end if
   end if

The warning for ``netfrc == 1`` with restraints is also narrowed so GPU GTI TI
does not warn for this intentional default, while minimization and non-TI
restrained runs still warn.

If using an unpatched AMBER26 build, add this namelist to affected BATTER input
files as a workaround:

.. code-block:: fortran

   &ewald
     netfrc = 1,
   /

AMD GPU GTI/HIP Runtime Patches
-------------------------------

These patches are for ROCm/HIP stability in GTI/TI runs.  The observed symptoms
were HSA memory aperture violations, illegal memory accesses, or follow-on
``hipGetDevice`` failures during TI/softcore simulations.

``gti_cuda.cu``
   In ``ik_BuildTINBList``, keep the launch shape for the GTI neighbor-list
   phases at:

   .. code-block:: c++

      threadsPerBlock = 128;
      blocksToUse = gpu->blocks;

   This matches the older working pmemd24 HIP behavior.  On Frontier/ROCm, the
   larger architecture-dependent launch shapes could fail later in kernels such
   as ``kCalculateTIKineticEnergy_kernel`` even though the original fault came
   from GTI neighbor-list construction.

``gti_general_kernels.cu``
   In ``vec_sync``, keep the ``combinedMode`` cases split explicitly.  The
   ROCm-sensitive case is ``combinedMode == 2``: copy ``a0`` into temporaries
   before writing ``a1``.

   .. code-block:: c++

      T vx = pVector[a0];
      T vy = pVector[a0 + cSim.stride];
      T vz = pVector[a0 + cSim.stride2];
      pVector[a1] = vx;
      pVector[a1 + cSim.stride] = vy;
      pVector[a1 + cSim.stride2] = vz;

   This preserves the intended ``V0 -> V1`` sync while avoiding the old shared
   branch that faulted in ``kgSyncVector_kernel`` under ROCm.

These AMD/HIP runtime patches should not be applied blindly to NVIDIA-specific
launch tuning.  The ``netfrc`` patch above is the cross-vendor AMBER26 fix; the
GTI/HIP launch and ``vec_sync`` changes are ROCm stability fixes.