The @SRM_Guru Blog: October 2013

Can you ping me now? Good!

One of the biggest issues I have encountered with vSphere Replication is connectivity. Since the vSphere Replication suite is so big and talks to so much stuff, it can be hard to make sure that everything is talking to each other. Luckily for you, I have a quick checklist for you so that you can check it! But before we get to the list, we have to understand a few quick things so that some of this makes sense.

Theory time!

Okay so vSphere Replication is a big suite. There are a lot of moving parts so let's explore what talks to what and why it is such.

The vSphere Replication Appliance

The vSphere Replication Appliance (VRA) is actually 2 servers in one. It consists of a vSphere Replication Management Server (VRMS) and a vSphere Replication Server (VRS). In the 5.0 release of vSphere Replication, these were actually 2 separate servers but in the 5.1 release, VMware squished them down to the VRA. For you math people out there VRA = VRMS + VRS. If you understand this, you will understand why I'm going to explain what the VRA does by explaining what the VRMS and VRS do below.

The vSphere Replication Management Server

The vSphere Replication Management Server (VRMS) is the vSphere Replication's gateway to everything management. The VRMS has 3 primary jobs. First, it manages all of the vSphere Replication Servers. Second, it reports what the VRSs are doing to vCenter and third, it reports to the other VRMS at the other site so that it has a consistent view of the replication states of all of the VMs. You have to have exactly one VRMS per site, no more, no less.

The vSphere Replication Server

The vSphere Replication Server (VRS) is the worker of the vSphere Replication Suite. The VRS is in charge of getting the information from the remote host and getting it to the local host. The VRS reports all of what it does to the VRMS but doesn't talk to any other management (SRM, vCenter, the other VRMS, etc). The VM's data is replicated from the source host to the destination VRA and then the VRA chooses a slave host to replicate the data to the destination datastore. This basically means that a VM that is being replicated will show up on the destination VRS, not the local one. One last note, when you deploy a VRA, you get 1 VRS. You can choose to deploy additional VRSs at a single site to balance out the load but it's not common.

That's a lot of stuff, what talks to what???

The best way to explain this is with a graphic:

So what does this all mean? It means you've got a lot of ports to check! So here is how to do it. First, log into your VRA using an ash session (it's enabled by default). You are going to use telnet to open a session on port 80 to the opposite site's vCenter. From the command line it will look something like this:

#telnet myvCenterFQDN.myDomain.com 80

Of course, the vCenter server needs a way to talk to the local VRMS server as well. This is accomplished with port 8043. To test this, use telnet on the vCenter server against the vSphere Replication Appliance. It looks like this:

#telnet myLocalVRA.myDomain.com 8043

This will probe the remote vCenter for a connection and make sure that it can communicate over port 80. Next, since we are on the VRA anyways, let's check pot 902 to the local hosts. vSphere replication uses Network File Copy (NFC) to copy data from the VRA to a local host and then the host copies that data to the satay store for the VRA. This is all done over port 902 so that has to be open to work. The command will look something like this:

#telnet myLocalHost.myDomain.com 902

If you get the connection, you are looking good! So what's next? Well, we have 2 sites so we had better test all of this on the DR site as well. Run the exact same steps as above on the remote VRA.

So we are done right? Wrong! We are just getting started! Now we have to test the all important host to VRA connection. So, to do this, we want to establish an SSH session to one of the hosts on the local site. We are going to use NetCat to probe the connections to the remote VRA. To do this, we use this command:

#nc -z myRemoteVRA.myDomain.com <port#>

Where <port#> is replaced with 31031 and then 44046. Why two ports you are obviously asking? Well, vSphere replication uses port 31031 to do any initial replications and then port 44046 to do any subsequent syncs of the VM. Why, I have no idea but I'm sure there is a reason. Most of the time, checking that these ports are open on any one host is good enough to check all of the hosts but if you having issues, you should check this on all of the hosts. So what's next, you guessed it, second sites a charm! Check the same connections from the DR hosts to the production VRA. If all of this is open and working, you've got yourself a good recitation environment!

So, you've read through this, it makes sense but that's a lot to read through and I'm going to get lost. Not to fear, connection lists are here! Here's the short version. Copy it to a notepad and check it off as you go. (And save the planet, only print it if you absolutely have too)!

vSphere Connectivity List:

Production VRA:
-Port 80 to the DR vCenter [ ]
-Port 902 to the prod host(s) [ ]

Production host(s):
Port 31031 to the DR VRA [ ]
Port 44046 to the DR VRA [ ]

Production vCenter:
Port 8043 to the Prod VRA [ ]

DR VRA:
-Port 80 to the Prod vCenter [ ]
-Port 902 to the DR host(s) [ ]

DR host(s):
Port 31031 to the Prod VRA [ ]
Port 44046 to the Prod VRA [ ]

DR vCenter:
Port 8043 to the DR VRA [ ]

And there ya have it folks. If you have any questions, please put them in the comments and let me know if this helps!

Set it higher!

I see this issue ALL THE TIME. The error messages looks like this:

Error - Timed out waiting for VMware Tools after 300 seconds

But why am I getting this, you might ask. Well 300 seconds is 5 minutes. The average boot time for a Windows VM is right around 3-4 minutes. This leaves just 60 seconds for VMware tools to come up and start talking to vCenter. Depending on how many other services are starting at the same time, this might not happen!

So what do we do to fix it?

We set it higher! Assuming that it really is just not starting up in time, we can set the wait for VMware tools time globally. To do this, follow these steps:

1) Open Site Recovery Manager in vCenter

2) In the "Sites" section, right click your local site and select "Advanced Settings..."

3) in the "recovery" section, scroll to the bottom and you will see recovery.powerOnTimeout

4) Set recovery.powerOnTimeout to whatever time value you want (I usually set it to 900 seconds)

5) Repeat these steps on the DR site

Now your SRM server will wait for 900 seconds for VMware tools to come up. If you set it for 900 seconds and it STILL isn't working, log into the server as soon as it boots up and see if VMware tools is starting. If it is, you have a different issue, if it's not, time it and find out how long it takes to start and set this setting accordingly. Last thing here, setting the timeout to 900 seconds does NOT mean that it will always wait 900 seconds. It simply means that is the longest it will wait. So for instance, let's say that VMware tools comes up in 302 seconds, you will only wait 302 seconds, not the full 900. In this manor, you may have a higher success rate and still only wait about 5 minutes. Hope this helps!

The @SRM_Guru Blog

Wednesday, October 23, 2013

Checking vSphere Replication connectivity