Remote Direct Memory Access – RDMA and RoCE

By February 14, 2019 February 21st, 2019 NEWS
Remote Direct Memory Access - RDMA and RoCE

Remote Direct Memory Access (RDMA)

Remote Direct Memory Access (RDMA) provides direct memory access from the memory of one host (storage or compute) to the memory of another host without involving the remote Operating System and CPU, boosting network and host performance with lower latency, lower CPU load and higher bandwidth. In contrast, TCP/IP communications typically require copy operations, which add latency and consume significant CPU and memory resources.

Remote Direct Memory Access (RDMA) is a technology that allows computers in a network to exchange data in main memory without involving the processor, cache or operating system of either computer. Like locally based Direct Memory Access (DMA), RDMA improves throughput and performance because it frees up resources. RDMA also facilitates a faster data transfer rate and low-latency networking. It can be implemented for networking and storage applications.

How RDMA works

RDMA enables more direct data movement in and out of a server by implementing a transport protocol in the network interface card (NIC) hardware. The technology supports a feature called zero-copy networking that makes it possible to read data directly from the main memory of one computer and write that data directly to the main memory of another computer.

If both the sending and receiving devices support RDMA, then the conversation between the two will complete much quicker than comparable non-RDMA network systems.

RDMA vs. standard network connection
At left is a standard network connection. At right is an RDMA connection. The initiator and the target must use the same type of RDMA technology — RDMA over Converged Ethernet or InfiniBand, for example.

RDMA has proven useful in applications that require fast and massive parallel high-performance computing (HPC) clusters and data center networks. It is particularly useful when analyzing big data, in supercomputing environments that process applications, and for machine learning that requires the absolutely lowest latencies and highest transfer rates. You can also find RDMA used in connections between nodes in compute clusters and with latency-sensitive database workloads.

Network protocols that support RDMA

RDMA over Converged Ethernet. RoCE is a network protocol that enables RDMA over an Ethernet network by defining how it will perform in such an environment.

Internet Wide Area RDMA Protocol. IWARP leverages the Transmission Control Protocol (TCP) or Stream Control Transmission Protocol (SCTP) to transmit data. It was developed by the Internet Engineering Task Force to enable applications on a server to read or write directly to applications executing on another server without support from the operating system on either server.

InfiniBand. RDMA is the standard protocol for high-speed InfiniBand network connections. This RDMA network protocol is often used for intersystem communication and was first popular in high-performance computing environments. Because of its ability to speedily connect large computer clusters, InfiniBand has found its way into additional use cases such as big data environments, databases, highly virtualized settings and resource-demanding web applications.

Products and vendors that support RDMA

  • Apache Hadoop and Apache Spark big data analysis
  • Baidu Paddle (PArallel Distributed Deep LEarning) platform
  • Broadcom and Emulex adapters
  • Caffe deep learning framework
  • Cavium FastLinQ 45000/41000 Series Ethernet NICs
  • Ceph object storage platform
  • ChainerMN Python-based deep learning open source framework
  • Chelsio Terminator 5 & 6 iWARP adapters
  • Dell EMC PowerEdge servers
  • FreeBSD operating system
  • GlusterFS internetwork filesystem
  • Intel Xeon Scalable processors and Platform Controller Hub
  • Mellanox ConnectX family of network adapters and InfiniBand switches
  • Microsoft Windows Server (2012 and higher) via SMB Direct supports RDMA-capable network adapters, Hyper-V virtual switch and the Cognitive Toolkit.
  • Nutanix’s upcoming NX-9030 NVM Express flash appliance is said to support RDMA.
  • Nvidia DGX-1 deep learning appliance
  • Oracle Solaris 11 and higher for NFS over RDMA
  • Red Hat
  • SUSE Linux Enterprise Server
  • TensorFlow open source software library for machine intelligence
  • Torch scientific computing framework
  • VMware ESXi

RDMA with flash, SSD and NVDIMMs

Because all-flash storage systems perform much faster than disk or hybrid arrays, latency in storage performance is significantly reduced. As a result, the traditional software stack starts to act as a bottleneck, simultaneously adding to overall latency. RDMA is one of the technologies that can step in to lower that latency.

Non-volatile dual in-line memory module (NVDIMM), a type of memory that acts as storage, is quickly finding its way into data centers. NVDIMM can greatly improve database performance by as much as 100 times, and will prove especially beneficial in virtualized clusters and as a means to accelerate virtual SANs. But to get the most out of NVDIMM, in terms of both data integrity and performance when transmitting data between servers or throughout a virtual cluster, you must use the fastest network possible. RDMA over Converged Ethernet fits the bill by allowing data to move directly between NVDIMM modules with little system overhead and low latency.

RDMA over Fabrics and future directions

RDMA over Fabrics, a logical evolution of existing shared storage architectures, increases performance access to shared data benefiting from solid-state and flash memory. Here, an RDMA network sends data between memory address spaces over an interface using a protocol, such as RoCE, iWARP or InfiniBand, that accelerate operations to increase the value of application, server and storage investments. Fibre Channel storage networks at Gen 6 — 32 gigabits per second — and PCI Express support the RDMA over Fabrics interface.


RDMA over Converged Ethernet (RoCE) 

RDMA over Converged Ethernet (RoCE) is a standard protocol which enables RDMA’s efficient data transfer over Ethernet networks allowing transport offload with hardware RDMA engine implementation, and superior performance. RoCE is a standard protocol defined in the InfiniBand Trade Association (IBTA) standard. RoCE makes use of UDP encapsulation allowing it to transcend Layer 3 networks. RDMA is a key capability natively used by the InfiniBand interconnect technology. Both InfiniBand and Ethernet RoCE share a common user API but have different physical and link layers.

RoCE Fabric Consideration

Mellanox ConnectX-4 and later generations incorporate Resilient RoCE to provide best of breed performance with only a simple enablement of Explicit Congestion Notification (ECN) on the network switches. Lossless fabric which is usually achieved through enablement of PFC is not mandated anymore. The Resilient RoCE congestion management, implemented in ConnectX NIC hardware delivers reliability even with UDP over a lossy network.

Mellanox Spectrum Ethernet switches provide 100GbE line rate performance and consistent low latency with zero packet loss. With its high performance, low latency, intelligent end-to-end congestion management and QoS options, Mellanox Spectrum Ethernet switches are ideal to implement RoCE fabric at scale. Additionally, Spectrum makes it easy to configure RoCE and has end-to-end flow level visibility.

Implementing Applications over RDMA/RoCE

Application developers have several options for implementing acceleration with RDMA/RoCE using RDMA infrastructure verbs/libraries or middleware libraries:


  • RDMA Verbs – Using libibverbs library (available inbox for major distributions) provides API interfaces needed to send and receive data
  • RDMA Communication Manager (RDMA-CM) – The RDMA CM library is a communication manager (CM) used to set up reliable, connected, and unreliable datagram data transfers. It works in conjunction with the RDMA verbs API that is defined by the libibverbs library.


  • Unified Communication X (UCX) – Open-source production-grade communication framework for data-centric and high-performance applications driven by industry, laboratories, and academia
  • Accelio – A high-performance asynchronous reliable messaging and RPC open-source community driven library
    NOTE: Accelio is no longer recommended for new projects. For new projects, please refer to UCX.

Soft RoCE

Soft RoCE is a software implementation of RoCE that allows RoCE to run on any Ethernet network adapter whether it offers hardware acceleration or not. Soft-RoCE is released as part of upstream kernel 4.8 as well as with Mellanox OFED 4.0 and above.

The Soft-RoCE distribution is available at:



mellanox :-

searchstorage.techtarget :-



Join the discussion 2 Comments