The simple the root cause is in this case, the longer it took me to figure it out. Sometimes you don’t see the things which are very close.
In this case I just wanted to enable Hyper-V Replica for certain VMs between two Clusters. Below is a little diagram to illustrate the environment.
Now when trying to enable replication for a VM, the process started as expected, but after certain GB transfered, the replication stopped with the error:
Message: Hyper-V could not replicate changes for virtual machine ‘TestVM01’: An operation was attempted on a nonexistent network connection. (0x800704CD). (Virtual machine ID 10F62945-073E-479C-AC12-8E9E23A25BFC)
I first checked for plain network issues by capturing some network traces and watch for RST events. But the connection seemed to be forcefully closed on application side, not on the TCP level. After some investigation I found out that the target host, which was enumerated by the Hyper-V Replica Broker, did have issues with its RDMA storage NICs. As I’ve implemented SMB Multichannel constraints to limit SMB traffic occuring only on the RDMA capable networks. So the storage traffic from the Hyper-V host to the Scale out File Server did pass the management NIC. After transferring some data, the constraint kicked in and terminated the SMB connection. This seems weird to me because I’d expect the constraint being always forced.
However, after removing the constraint, Replica worked as expected. Then I fixed the RDMA issue and re-enabled the SMB constraints et voilà!
Sometimes solutions are closer than they appear 🙂
Common Hyper-V Replica initialization errors and root causes