p.enthalabs

amd-strix-halo-vllm-toolboxes/rdma_cluster/setup_guide.md at main · kyuz0/amd-strix-halo-vllm-toolboxes

AMD Strix Halo RDMA Cluster Setup Guide

[](https://github.com/kyuz0/amd-strix-halo-vllm-toolboxes/blob/main/rdma_cluster/setup_guide.md#amd-strix-halo-rdma-cluster-setup-guide) This guide details how to configure a two-node **AMD Strix Halo** cluster linked via **Intel E810 (RoCE v2)** for distributed vLLM inference using Tensor Parallelism.

Table of Contents

[](https://github.com/kyuz0/amd-strix-halo-vllm-toolboxes/blob/main/rdma_cluster/setup_guide.md#table-of-contents) 1. TL;DR (Quick Start) 2. Concepts & Architecture 3. Hardware Prerequisites 4. Host Configuration (Fedora)

- 4.1 Install Packages

- 4.2 Check Native Firmware

- 4.3 Network Configuration

- 4.4 BIOS & Kernel Configuration

- 4.5 Firewall Rules

5. Toolbox Installation & Network Verification

- 5.1 Prerequisites: Passwordless SSH

- 5.2 Installation

- 5.3 Verify RDMA Connection

6. Running the Cluster

- 6.1 Setup & Verify

- 6.2 Launching vLLM

7. Troubleshooting 8. References & Acknowledgements

- * *

1. TL;DR (Quick Start)

[](https://github.com/kyuz0/amd-strix-halo-vllm-toolboxes/blob/main/rdma_cluster/setup_guide.md#1-tldr-quick-start) **On Both Nodes:**

1. **Preparation**:

- **Install/Update Fedora 43** and the E810 NICs (Check firmware: `ethtool -i <iface>`).

- **BIOS/Kernel**: Set iGPU to 512MB and apply kernel params (`iommu=pt`, `pci=realloc`, etc.).

- **SSH**: Configure **passwordless SSH** between nodes.

2. **Networking**: Assign static IPs (`192.168.100.1`&`.2`), set MTU 9000, and trust the interface in firewall. 3. **Install Toolbox**: Run `./refresh_toolbox.sh` (this automatically installs the container with RDMA support and the custom `librccl.so` patch). 4. **Run Cluster**:

- Run `start-vllm-cluster`.

- Select **"2. Start Ray Cluster"** (Follow prompts using the TUI).

- Select **"4. Launch VLLM Serve"** and choose your model. (Export `HF_TOKEN` first for gated models!)

**Key Note**: The `refresh_toolbox.sh` script detects your Infiniband/RDMA devices and automatically configures the container to expose them.

- * *

2. Concepts & Architecture

[](https://github.com/kyuz0/amd-strix-halo-vllm-toolboxes/blob/main/rdma_cluster/setup_guide.md#2-concepts--architecture) ![Image 1: concepts](https://github.com/kyuz0/amd-strix-halo-vllm-toolboxes/blob/main/rdma_cluster/concepts.png)

To fully utilize the Strix Halo cluster, it is helpful to understand the technologies involved:

- **vLLM**: A high-performance inference engine. To run models larger than a single GPU (or APU) can handle, it splits the model using **Tensor Parallelism (TP)**.

- **Ray**: A distributed computing framework. vLLM uses Ray to **orchestrate** the cluster, manage the "worker" processes on each node, and ensure they start up correctly. Ray handles the _control plane_ (issuing commands).

- **RCCL (ROCm Collective Communication Library)**: The AMD equivalent of NVIDIA's NCCL. This library handles the **data plane**—specifically, the extremely fast synchronization of tensor data between GPUs. When TP=2, the two nodes must exchange partial results after _every single layer_ of the neural network. This happens thousands of times per second.

- **RoCE v2 (RDMA over Converged Ethernet)**: The protocol that allows RCCL to write data directly from one Node's memory to the other Node's memory, bypassing the CPU and OS kernel.

- **Without RDMA**: Latency is ~70-100µs (TCP/IP overhead).

- **With RDMA**: Latency is ~5µs.

- **Why it matters**: For interactive token generation, high latency kills performance. RoCE makes the two nodes feel like a single machine.

- * *

3. Hardware Prerequisites

[](https://github.com/kyuz0/amd-strix-halo-vllm-toolboxes/blob/main/rdma_cluster/setup_guide.md#3-hardware-prerequisites) ![Image 2: cluster](https://github.com/kyuz0/amd-strix-halo-vllm-toolboxes/blob/main/rdma_cluster/cluster.png)

- **Nodes**: 2x Framework Desktop Mainboards with AMD Ryzen AI MAX+ "Strix Halo", 128GB of Unified Memory.

- **Network Cards**: Intel Ethernet Controller E810-CQDA1 (or similar 100GbE QSFP28).

- **Connection**: Direct Attach Copper (DAC) cable (e.g., QSFPTEK 100G QSFP28 DAC). No switch required for 2 nodes.

- **PCIe Note**: The Framework motherboard PCIe slot is physically **x4**, so a riser is required to plug in a 16x card (e.g., CY PCI-E Express 4x to 16x Extender). **Test Setup Note:** One of the boards in this setup has a modified PCIe slot (cut by Framework using an ultrasonic knife) to accept x16 cards directly. **This is not recommended for users.** Risers are the cheaper, safer, and easier solution. Performance is identical (~50Gbps bandwidth, ~5µs latency).

- * *

4. Host Configuration (Fedora)

[](https://github.com/kyuz0/amd-strix-halo-vllm-toolboxes/blob/main/rdma_cluster/setup_guide.md#4-host-configuration-fedora) Perform these steps on the **Host OS** (Fedora 43) of **both nodes**.

**Tested Host Configuration:**

| Node | Kernel | OS | IP (RDMA Interface) | | --- | --- | --- | --- | | **Node 1** | `6.18.5-200.fc43.x86_64` | Fedora Linux 43 | `192.168.100.1/30` | | **Node 2** | `6.18.6-200.fc43.x86_64` | Fedora Linux 43 | `192.168.100.2/30` |

> **Note:** These specific kernel versions were verified to work. Fedora 43 is recommended.

4.1 Install Packages

[](https://github.com/kyuz0/amd-strix-halo-vllm-toolboxes/blob/main/rdma_cluster/setup_guide.md#41-install-packages) Install the core RDMA userspace tools. You do **not** need proprietary Intel drivers; the in-kernel drivers work perfectly.

- **Ethernet Driver:**`ice`

- **RDMA Driver:**`irdma` (Unified driver for RoCE v2 & iWARP)

sudo dnf install rdma-core libibverbs-utils perftest

- `rdma-core`: The userspace components for the RDMA subsystem (libraries, daemons, and configuration tools).

- `libibverbs-utils`: Utilities for querying RDMA devices (e.g., `ibv_devinfo`).

- `perftest`: A suite of benchmarks (e.g., `ib_write_bw`, `ib_send_lat`) to verify RDMA bandwidth and latency.

4.2 Check Native Firmware

[](https://github.com/kyuz0/amd-strix-halo-vllm-toolboxes/blob/main/rdma_cluster/setup_guide.md#42-check-native-firmware) Use `ethtool` to check the current firmware version of your Intel E810 card.

ethtool -i enp194s0np0

**Recommended Firmware:** Ensure your firmware is at least as new as the version shown below (Firmware `4.91...`). If your firmware is older, please update it using the Intel® Ethernet NVM Update Tool for E810 Series.

**Example Output:**

``` driver: ice version: 6.18.5-200.fc43.x86_64 firmware-version: 4.91 0x800214b5 1.3909.0 expansion-rom-version: bus-info: 0000:c2:00.0 supports-statistics: yes supports-test: yes supports-eeprom-access: yes supports-register-dump: yes supports-priv-flags: yes ```

4.3 Network Configuration

[](https://github.com/kyuz0/amd-strix-halo-vllm-toolboxes/blob/main/rdma_cluster/setup_guide.md#43-network-configuration) This guide assumes a subnet of `192.168.100.0/30`.

**Identify your interface**: Run `ip link` to find your 100GbE card (e.g., `enp194s0np0`).

**Node 1 (Head - 192.168.100.1):**

Bring link up

sudo ip link set enp194s0np0 up

Assign IP

sudo ip addr add 192.168.100.1/30 dev enp194s0np0

Set MTU (Jumbo Frames)

sudo nmcli connection modify "rdma0" ethernet.mtu 9000 sudo nmcli connection up "rdma0"

**Node 2 (Worker - 192.168.100.2):**

Bring link up

sudo ip link set enp194s0np0 up

Assign IP

sudo ip addr add 192.168.100.2/30 dev enp194s0np0

Set MTU

sudo nmcli connection modify "rdma0" ethernet.mtu 9000 sudo nmcli connection up "rdma0"

**Verify Routing:** Ensure the route exists on both:

sudo ip route add 192.168.100.0/30 dev enp194s0np0

**Verify Link:**

rdma link

Output should show: state ACTIVE physical_state LINK_UP used_usec X ...

4.4 BIOS & Kernel Configuration

[](https://github.com/kyuz0/amd-strix-halo-vllm-toolboxes/blob/main/rdma_cluster/setup_guide.md#44-bios--kernel-configuration) **1. BIOS Settings:** Set the **iGPU Memory Allocation** to the **minimum possible (512MB)**. We will use the GTT (Graphics Translation Table) to dynamically allocate system memory as "Unified Memory" for the GPU.

**2. Kernel Parameters:** Update GRUB to enable unified memory, optimize RDMA performance, and fix PCI resource allocation.

Edit `/etc/default/grub` and append to `GRUB_CMDLINE_LINUX`:

``` iommu=pt pci=realloc pcie_aspm=off amdgpu.gttsize=126976 ttm.pages_limit=32505856 ```

**Explanation of Parameters:**

- `iommu=pt`: Sets IOMMU to "Pass-Through" mode. This is critical for performance, reducing overhead for both the RDMA NIC and the iGPU unified memory access.

- `pci=realloc`: Reallocates PCI BARs. Often needed on consumer platforms to properly map large address spaces for devices like the E810 or Strix Halo.

- `pcie_aspm=off`: Disables PCIe Active State Power Management. Prevents latency spikes and link negotiation issues on the 100GbE connection.

- `amdgpu.gttsize=126976`: Caps the GPU GTT size to ~124GiB (126976MB). This defines how much system RAM the GPU can address as its own "VRAM".

- `ttm.pages_limit=32505856`: Limits the Translation Table Manager to ~124GiB (in 4KB pages), matching the GTT size.

**3. Apply Changes:**

sudo grub2-mkconfig -o /boot/grub2/grub.cfg sudo reboot

4.5 Firewall Rules

[](https://github.com/kyuz0/amd-strix-halo-vllm-toolboxes/blob/main/rdma_cluster/setup_guide.md#45-firewall-rules) Applications like Ray and NCCL use random high ports. It is easiest to trust the internal RDMA interface completely.

Assign the interface to the trusted zone permanently

sudo firewall-cmd --permanent --zone=trusted --add-interface=enp194s0np0

Reload firewall

sudo firewall-cmd --reload

- * *

5. Toolbox Installation & Network Verification

[](https://github.com/kyuz0/amd-strix-halo-vllm-toolboxes/blob/main/rdma_cluster/setup_guide.md#5-toolbox-installation--network-verification)

5.1 Prerequisites: Passwordless SSH

[](https://github.com/kyuz0/amd-strix-halo-vllm-toolboxes/blob/main/rdma_cluster/setup_guide.md#51-prerequisites-passwordless-ssh) The cluster management and verification scripts rely on SSH to execute commands on remote nodes. You must configure **passwordless SSH** between both nodes (root or sudo-enabled user).

- **Guide:**How to Set Up SSH Keys on Linux (DigitalOcean)

- **Quick Check:** Run `ssh <other-node-ip> date` from each node. It should print the date without asking for a password.

5.2 Installation

[](https://github.com/kyuz0/amd-strix-halo-vllm-toolboxes/blob/main/rdma_cluster/setup_guide.md#52-installation)

The toolbox container provided in this repo includes a **critical patch**: a custom-built `librccl.so` that enables `gfx1151` (Strix Halo) support for RDMA (https://github.com/kyuz0/rocm-systems/tree/gfx1151-rccl), which is currently missing in upstream ROCm packages. This library is automatically compiled using the `build-rccl` GitHub Action in this repository, which generates the artifact that is then bundled into the Docker container.

To install the toolbox on **both nodes**, run:

./refresh_toolbox.sh

**What this does:**

1. Pulls the latest `kyuz0/vllm-therock-gfx1151` image. 2. Detects if `/dev/infiniband` exists on your host. 3. Creates the toolbox with flags to expose:

- **iGPU Access**: `/dev/dri`, `/dev/kfd` (Required for ROCm)

- **RDMA Access**: `/dev/infiniband`, `--group-add rdma`

- **Memory Pinning**: `--ulimit memlock=-1` (Required for DMA)

5.3 Verify RDMA Connection

[](https://github.com/kyuz0/amd-strix-halo-vllm-toolboxes/blob/main/rdma_cluster/setup_guide.md#53-verify-rdma-connection) Before proceeding to run the cluster, verify that RDMA is active and providing low latency (~5µs vs ~70µs for Ethernet).

Run the provided verification script from the **Head Node**:

Inside toolbox

/opt/compare_eth_vs_rdma.sh

**Expected Results:**

``` Path Latency Bandwidth ------------------------------------------------ Ethernet (1G LAN) 0.074 ms 0.94 Gbps Ethernet (RoCE NIC) 0.068 ms 55.70 Gbps RDMA (RoCE) 5.23 us 50.64 Gbps ```

_Note the massive latency drop (milliseconds to microseconds) for RDMA._

- * *

6. Running the Cluster

[](https://github.com/kyuz0/amd-strix-halo-vllm-toolboxes/blob/main/rdma_cluster/setup_guide.md#6-running-the-cluster) A TUI utility, `start-vllm-cluster`, is provided to manage the Ray cluster and vLLM.

6.1 Setup & Verify

[](https://github.com/kyuz0/amd-strix-halo-vllm-toolboxes/blob/main/rdma_cluster/setup_guide.md#61-setup--verify) 1. **Enter the toolbox**: toolbox enter vllm 2. **Run the Cluster Manager**: start-vllm-cluster 3. **Configure IPs** (Option 1):

- Ensure Head is `192.168.100.1` and Worker is `192.168.100.2`.

4. **Start Ray Cluster** (Option 2):

- **On Node 1**: Select **"Head"** when prompted.

- **On Node 2**: Select **"Worker"** when prompted.

- The script effectively runs: # Head

export NCCL_SOCKET_IFNAME=<rdma_iface> ray start --head --node-ip-address=192.168.100.1 ...

Worker

ray start --address=192.168.100.1:6379 ...

5. **Check Status** (Option 3):

- Ensure you see **2 nodes** and adequate GPU resources (e.g., `2.0 GPU`).

6.2 Launching vLLM

[](https://github.com/kyuz0/amd-strix-halo-vllm-toolboxes/blob/main/rdma_cluster/setup_guide.md#62-launching-vllm) Once the cluster is active (checked via Option 3):

1. Select **"4. Launch VLLM Serve"** in the TUI. 2. Choose a model (e.g., `Meta-Llama-3.1-8B-Instruct`). 3. **Configuration Menu**:

- **Tensor Parallelism**: Set to `2` (one GPU per node).

- **Context Length**: Auto or custom (e.g., `131072`).

- **Erase vLLM Cache**: Select `YES` if you are restarting after a crash.

- **Force Eager Mode**: Select `YES`.

- _Why?_ CUDA Graphs can be unstable on distributed APU clusters and cause deadlocks. Eager mode is safer, but you might be able to squeeze 1-3% more performance if you take a chance and disable it.

4. **Launch**: Select "LAUNCH SERVER".

**Important Gotchas:**

- **First Run Download**: When running a model for the first time, each node in the cluster must download the weights independently. This may take some time depending on your internet connection.

- **Gated Models (e.g., Gemma)**:

- Models like `google/gemma-2-27b-it` are "gated" and require you to request access on Hugging Face.

- You must export your Hugging Face token before running the cluster script: export HF_TOKEN=your_token_here

start-vllm-cluster

- If you don't provide a token or haven't accepted the license on Hugging Face, the download will fail.

- * *

7. Troubleshooting

[](https://github.com/kyuz0/amd-strix-halo-vllm-toolboxes/blob/main/rdma_cluster/setup_guide.md#7-troubleshooting)

vLLM Deadlocks / Hangs

[](https://github.com/kyuz0/amd-strix-halo-vllm-toolboxes/blob/main/rdma_cluster/setup_guide.md#vllm-deadlocks--hangs)

- **Cause**: CUDA Graph capture can freeze on distributed APU nodes.

- **Fix**: Enable **"Force Eager Mode"** in the start menu.

Firmware

[](https://github.com/kyuz0/amd-strix-halo-vllm-toolboxes/blob/main/rdma_cluster/setup_guide.md#firmware) If you see link issues, ensure your Intel E810 firmware is up to date using the Intel standard tools.

- * *

8. References & Acknowledgements

[](https://github.com/kyuz0/amd-strix-halo-vllm-toolboxes/blob/main/rdma_cluster/setup_guide.md#8-references--acknowledgements)

- **Reddit - Strix Halo Batching with Tensor Parallel**: Thread by Hungry_Elk_3276

- Special thanks to user **Hungry_Elk_3276** for their initial experiments with vLLM RDMA, which highlighted the missing `gfx1151` support in upstream RCCL.

- * *

9. Alternative: Thunderbolt Networking

[](https://github.com/kyuz0/amd-strix-halo-vllm-toolboxes/blob/main/rdma_cluster/setup_guide.md#9-alternative-thunderbolt-networking) If you do not have dedicated 100GbE RDMA network cards, you can directly connect the two nodes using a high-quality **Thunderbolt 4 / USB4 cable**. This will create a `thunderbolt0` network interface.

While it lacks the ultra-low microprocessor-level latency of RDMA, it provides significantly more bandwidth than standard 1GbE/5GbE Ethernet and is easier to configure.

> **Note**: `thunderbolt-net` relies on standard OS kernel TCP/IP stacks.

9.1 Thunderbolt Configuration

[](https://github.com/kyuz0/amd-strix-halo-vllm-toolboxes/blob/main/rdma_cluster/setup_guide.md#91-thunderbolt-configuration) **1. Establish Connection:** Connect the nodes directly using a certified Thunderbolt 4 or USB4 cable. Verify the link is active:

ip link show thunderbolt0

**2. Network Configuration (Head - Node 1):** Configure a persistent connection using `nmcli` with a static IP and Jumbo Frames (reduces CPU overhead). _Note: Jumbo Frames may be unsupported on some Thunderbolt host controllers._

sudo nmcli connection add type ethernet ifname thunderbolt0 con-name thunderbolt0 ipv4.method manual ipv4.addresses 192.168.2.1/24 mtu 9000 sudo nmcli connection up thunderbolt0

**3. Network Configuration (Worker - Node 2):**

sudo nmcli connection add type ethernet ifname thunderbolt0 con-name thunderbolt0 ipv4.method manual ipv4.addresses 192.168.2.2/24 mtu 9000 sudo nmcli connection up thunderbolt0

**4. Firewall Rules:** To ensure Ray and NCCL can communicate freely over this link:

Assign the interface to the trusted zone permanently

sudo firewall-cmd --permanent --zone=trusted --add-interface=thunderbolt0 sudo firewall-cmd --reload

9.2 Running vLLM over Thunderbolt

[](https://github.com/kyuz0/amd-strix-halo-vllm-toolboxes/blob/main/rdma_cluster/setup_guide.md#92-running-vllm-over-thunderbolt) Our cluster scripts dynamically detect the network interface based on the provided IPs. There is no need to manually export environment variables!

1. Open the Toolbox: `toolbox enter vllm` 2. Launch the cluster manager: `start-vllm-cluster` 3. Select **Option 1 (Configure IPs)**. 4. Set the **Head IP** explicitly to `192.168.2.1` and the **Worker IP** to `192.168.2.2`. 5. Start the cluster normally (Option 2). The script will automatically discover and utilize `thunderbolt0` as the backend network for Ray orchestration and GPU synchronization.

9.3 Validating the Link

[](https://github.com/kyuz0/amd-strix-halo-vllm-toolboxes/blob/main/rdma_cluster/setup_guide.md#93-validating-the-link) I have added Thunderbolt support to the `compare_eth_vs_rdma.sh` script. Run it from inside the toolbox to see the latency and bandwidth of your Thunderbolt link compared to your other network interfaces.

You can use the `-t` flag to ONLY benchmark the Thunderbolt connection (or `-e`, `-r`, `-i` for the others):

/opt/compare_eth_vs_rdma.sh -t