Thursday, March 6, 2014

Check your SRM ports with ease!

I have been dragging my feet on writing this post because, sadly, it will probably be my last on this blog. I have moved on to a different team and no longer support SRM. I will be looking for another TSE to take over this blog to keep you all up-to-date and SRM informed.

All of that being said, here is my swan song! As most of you know, there are a LOT of different ports to check when dealing with SRM. Not only do you have to check all of the ports, you have to check a lot of ports on a lot of different servers. This can get confusing and cumbersome and if you miss just one, you can end up chasing your tail forever. Luckily for you, I have created a batch script  you can use to check every port with ease!

##############Copy below this line###############

@echo off
echo Welcome to Casey's SRM communication test script

REM #Auto input#
set PRODVC= ###############REPLACE ME#########
set PRODSRM= ##############REPLACE ME#########
set DRVC= ##################REPLACE ME#########
set DRSRM= ################REPLACE ME#########

echo Here are the servers:
echo ProdVC is %PRODVC%
echo ProdSRM is %PRODSRM%
echo DRVC is %DRVC%
echo DRSRM is %DRSRM%

echo *****NS LOOKUPS***** 
echo ***ProdVC
nslookup %PRODVC%
echo ***ProdSRM
nslookup %PRODSRM%
echo ***DR VC
nslookup %DRVC%
echo ***DRSRM
nslookup %DRSRM%


REM ###Telnet Tests Section###
echo *****Telnet Tests*****

echo Is this the......
echo 1) Prod VC
echo 2) Prod SRM
echo 3) DR VC
echo 4) DR SRM

set /p USERIN='make your selection'

echo You Seleceted is %USERIN%

REM ### prod vc ###
IF %USERIN%==1 echo telnet %PRODSRM% 8095
IF %USERIN%==1 start cmd /k telnet %PRODSRM% 8095
IF %USERIN%==1 pause

IF %USERIN%==1 echo telnet %PRODSRM% 9085
IF %USERIN%==1 start cmd /k telnet %PRODSRM% 9085
IF %USERIN%==1 pause

IF %USERIN%==1 echo telnet %PRODSRM% 9086
IF %USERIN%==1 start cmd /k telnet %PRODSRM% 9086
IF %USERIN%==1 pause

REM ### prod SRM ###
IF %USERIN%==2 echo telnet %PRODVC% 80
IF %USERIN%==2 start cmd /k telnet %PRODVC% 80
IF %USERIN%==2 pause

IF %USERIN%==2 echo telnet %DRVC% 80
IF %USERIN%==2 start cmd /k telnet %DRVC% 80
IF %USERIN%==2 pause

REM ### DR vc ###
IF %USERIN%==3 echo telnet %DRSRM% 8095
IF %USERIN%==3 start cmd /k telnet %DRSRM% 8095
IF %USERIN%==3 pause

IF %USERIN%==3 echo telnet %DRSRM% 9085
IF %USERIN%==3 start cmd /k telnet %DRSRM% 9085
IF %USERIN%==3 pause

IF %USERIN%==3 echo telnet %DRSRM% 9086
IF %USERIN%==3 start cmd /k telnet %DRSRM% 9086
IF %USERIN%==3 pause

REM ### DR SRM ###
IF %USERIN%==4 echo telnet %PRODVC% 80
IF %USERIN%==4 start cmd /k telnet %PRODVC% 80
IF %USERIN%==4 pause

IF %USERIN%==4 echo telnet %DRVC% 80
IF %USERIN%==4 start cmd /k telnet %DRVC% 80
IF %USERIN%==4 pause

echo You're Done!

##############Copy Above this line###############

So how does it work?

1) Copy everything between the lines and paste it into a notepad.
2) under the "REM #Auto input#" section, replace the "###############REPLACE ME#########" lines with the fully qualified domain names of each server. For example:


You will want to make sure that there are no spaces between the "=" and the FQDN.

3) Save the file as SRMTest.bat
4) Ensure on all 4 servers (Production vCenter server, Production SRM server, DR vCenter Server, DR SRM server) that the telnet client is enabled (we use this to check the ports)
5) Copy the SRMTest.bat file to all 4 servers
6) Right-click and run the scripts as administrator
7) You should see the script start. Press any key to continue.
8) The script lists out what you entered for the FQDNs in the script and then runs nslookup for each server. SRM is very reliant on these entries so check to make sure that all 4 of the nslookups worked and that the IP addresses are correct. Once you have finished this, press any key to continue.
9) You are then prompted with the following:

*****Telnet Tests*****
Is this the......
1) Prod VC
2) Prod SRM
3) DR VC
'make your selection'

If you are running the script on the Production vCenter Server, you will enter "1", Production SRM server "2", etc etc (note, there is no error checking in the script so don't enter anything except the number. The entry should look like this: 'make your selection'1) If you enter any but a number, it will skip right to the end of the script without checking anything and you will see the "You're Done!" prompt.

10) As soon as you make your selection and press enter, a new cmd prompt will show up. This is a telnet test to the first port. If it shows up blank, the telnet test succeeded. If you see:

"Connecting To <serverName>...Could not open connection to the host, on port <port#>: Connect failed"

The telnet failed and that server is unable to communicate over that port to the destination server.

11) Click the main cmd prompt and press any key, the script will open another telnet cmd prompt window to test the next port. You will get 3 telnet windows when testing from a vCenter server and 2 from the SRM servers.
12) After the last port is checked, you will see the "You're Done!" Prompt.
13) Run this same script on all 4 servers. If you run the script on all 4 servers, you are guaranteed to check all the ports necessary for SRM to communicate properly. Be aware that this is ONLY for SRM, not for vSphere Replication. To test vSphere Replication, see my other post Checking vSphere Replication Connectivity (

For a list of ports that this script is checking, see the VMware KB on the ports required for SRM to function (and check out the diagram at the very bottom of the KB under Attachments, it is very useful)

Port numbers that must be open for Site Recovery Manager, vSphere Replication, and vCenter Server (1009562):

I hope this script helps everyone to make sure their SRM is connecting well!

This blog is in no way sponsored, supported or endorsed by VMware. Any configuration or environmental changes are to be made at your own risk. Casey, VMware, and any other company and/or persons mentioned in this blog take no responsibility for anything.  

Wednesday, October 23, 2013

Checking vSphere Replication connectivity

Can you ping me now? Good!

One of the biggest issues I have encountered with vSphere Replication is connectivity. Since the vSphere Replication suite is so big and talks to so much stuff, it can be hard to make sure that everything is talking to each other. Luckily for you, I have a quick checklist for you so that you can check it! But before we get to the list, we have to understand a few quick things so that some of this makes sense.

Theory time! 

Okay so vSphere Replication is a big suite. There are a lot of moving parts so let's explore what talks to what and why it is such. 

The vSphere Replication Appliance

The vSphere Replication Appliance (VRA) is actually 2 servers in one. It consists of a vSphere Replication Management Server (VRMS) and a vSphere Replication Server (VRS). In the 5.0 release of vSphere Replication, these were actually 2 separate servers but in the 5.1 release, VMware squished them down to the VRA. For you math people out there VRA = VRMS + VRS. If you understand this, you will understand why I'm going to explain what the VRA does by explaining what the VRMS and VRS do below.

The vSphere Replication Management Server

The vSphere Replication Management Server (VRMS) is the vSphere Replication's gateway to everything management. The VRMS has 3 primary jobs. First, it manages all of the vSphere Replication Servers. Second, it reports what the VRSs are doing to vCenter and third, it reports to the other VRMS at the other site so that it has a consistent view of the replication states of all of the VMs. You have to have exactly one VRMS per site, no more, no less. 

The vSphere Replication Server

The vSphere Replication Server (VRS) is the worker of the vSphere Replication Suite. The VRS is in charge of getting the information from the remote host and getting it to the local host. The VRS reports all of what it does to the VRMS but doesn't talk to any other management (SRM, vCenter, the other VRMS, etc). The VM's data is replicated from the source host to the destination VRA and then the VRA chooses a slave host to replicate the data to the destination datastore. This basically means that a VM that is being replicated will show up on the destination VRS, not the local one. One last note, when you deploy a VRA, you get 1 VRS. You can choose to deploy additional VRSs at a single site to balance out the load but it's not common.

That's a lot of stuff, what talks to what???

The best way to explain this is with a graphic:

So what does this all mean? It means you've got a lot of ports to check! So here is how to do it. First, log into your VRA using an ash session (it's enabled by default). You are going to use telnet to open a session on port 80 to the opposite site's vCenter. From the command line it will look something like this:

#telnet 80

Of course, the vCenter server needs a way to talk to the local VRMS server as well. This is accomplished with port 8043. To test this, use telnet on the vCenter server against the vSphere Replication Appliance. It looks like this:

#telnet 8043

This will probe the remote vCenter for a connection and make sure that it can communicate over port 80. Next, since we are on the VRA anyways, let's check pot 902 to the local hosts. vSphere replication uses Network File Copy (NFC) to copy data from the VRA to a local host and then the host copies that data to the satay store for the VRA. This is all done over port 902 so that has to be open to work. The command will look something like this:

#telnet 902

If you get the connection, you are looking good! So what's next? Well, we have 2 sites so we had better test all of this on the DR site as well. Run the exact same steps as above on the remote VRA.

So we are done right? Wrong! We are just getting started! Now we have to test the all important host to VRA connection. So, to do this, we want to establish an SSH session to one of the hosts on the local site. We are going to use NetCat to probe the connections to the remote VRA. To do this, we use this command:

#nc -z <port#>

Where <port#> is replaced with 31031 and then 44046. Why two ports you are obviously asking? Well, vSphere replication uses port 31031 to do any initial replications and then port 44046 to do any subsequent syncs of the VM. Why, I have no idea but I'm sure there is a reason. Most of the time, checking that these ports are open on any one host is good enough to check all of the hosts but if you having issues, you should check this on all of the hosts. So what's next, you guessed it, second sites a charm! Check the same connections from the DR hosts to the production VRA. If all of this is open and working, you've got yourself a good recitation environment!

So, you've read through this, it makes sense but that's a lot to read through and I'm going to get lost. Not to fear, connection lists are here! Here's the short version. Copy it to a notepad and check it off as you go. (And save the planet, only print it if you absolutely have too)!

vSphere Connectivity List:

Production VRA:
-Port 80 to the DR vCenter [ ]
-Port 902 to the prod host(s) [ ]

Production host(s):
Port 31031 to the DR VRA [ ]
Port 44046 to the DR VRA [ ]

Production vCenter:
Port 8043 to the Prod VRA [ ]

-Port 80 to the Prod vCenter [ ]
-Port 902 to the DR host(s) [ ]

DR host(s):
Port 31031 to the Prod VRA [ ]
Port 44046 to the Prod VRA [ ]

DR vCenter:
Port 8043 to the DR VRA [ ]

And there ya have it folks. If you have any questions, please put them in the comments and let me know if this helps!

When VMware tools times out after 300 seconds you.......

Set it higher! 

I see this issue ALL THE TIME. The error messages looks like this:

Error - Timed out waiting for VMware Tools after 300 seconds

But why am I getting this, you might ask. Well 300 seconds is 5 minutes. The average boot time for a Windows VM is right around 3-4 minutes. This leaves just 60 seconds for VMware tools to come up and start talking to vCenter. Depending on how many other services are starting at the same time, this might not happen!

So what do we do to fix it?

We set it higher! Assuming that it really is just not starting up in time, we can set the wait for VMware tools time globally. To do this, follow these steps:

1) Open Site Recovery Manager in vCenter
2) In the "Sites" section, right click your local site and select "Advanced Settings..."
3) in the "recovery" section, scroll to the bottom and you will see recovery.powerOnTimeout
4) Set recovery.powerOnTimeout to whatever time value you want (I usually set it to 900 seconds)
5) Repeat these steps on the DR site

Now your SRM server will wait for 900 seconds for VMware tools to come up. If you set it for 900 seconds and it STILL isn't working, log into the server as soon as it boots up and see if VMware tools is starting. If it is, you have a different issue, if it's not, time it and find out how long it takes to start and set this setting accordingly. Last thing here, setting the timeout to 900 seconds does NOT mean that it will always wait 900 seconds. It simply means that is the longest it will wait. So for instance, let's say that VMware tools comes up in 302 seconds, you will only wait 302 seconds, not the full 900. In this manor, you may have a higher success rate and still only wait about 5 minutes. Hope this helps!

Tuesday, August 20, 2013

Re-sync when everything is outa sync

Re-sync when everything is outa' sync

Since vSphere Replication hit last year, I have had to walk countless people through a vSphere Replication re-deploy. The re-deploy is for another post but what I want to cover here is the BEST way to re-create your replications with the least amount of hassle.

The setup:

I have my VM, cLevingerAD replicating successfully from our production site to the DR site. I have shown it here in SRM but it could be in the 5.1 web client as well.

The problem:

I need to stop replication and re-start it for some reason. This could be for a million reasons. Some of the common ones are you need to make a change to the VM, you need to re-deploy vSphere Replication, you failed over and now you need to reverse replication, you need to stop replication for some business reason but want to enable it later. Whatever the reason, you have a need, let's give you a solution.

So, how do I go about this? Well, you could hit the "Remove Replication" button and then just re-replicate everything AGAIN but this isn't the best way to go about this. Instead, we can preserve the remote VMDKs and use them as initial seeds. This means we don't have to replicate any of the already-replicated information. vSphere Replication will go through the disks at the Production and DR sites and compare them. It will figure out what is different and then only replicate the changes made while replication was off.

The procedure:

So how do we do this magic? Easy, first, we want to pause replication.
This ensures that no operations will go through while we make changes to the back-end storage.

Next, we need to change the name of the VMs folder at the DR site. I usually add "(hold)" to the end of the folder name. When we remove replication late, the vSphere Replication Appliance looks for the name of the folder from when replication was initially created. Since it's no longer there (cLevingerAD  ≠ cLevingerAD(hold)) vSphere Replication will leave this new folder alone.

After changing the folder name, we can safely remove replication.

Now we can make all the changes we want to the VM. Once we are done messing around with it, we can re-enable replication using the old disks as initial full seeds. The first thing we want to do is change the folder name back to the original name. This isn't absolutely necessary but a good idea so that all of the folder names are the same. 

After re-re-naming the folder, we can enable replication for the VM. Click the VM, then click the vSphere Replication tab and click "Configure Replication". This will bring up the vSphere Replication configuration page. 

Here is where we work our magic. When it asks for the destination, we are going to specify the old datastore. 

Hit OK and if you did everything right, a message should pop up saying that an initial seed was found and do you want to use it. Duh, of course we want to use it.

 Finish the configuration (note that on the summary page next to "initial seed found" we see "yes").

So you finish this and you expect to see a regular sync going through. WRONG. You will see "initial full sync" just like you would if you replicated from scratch. 

So what was the point of all of that?! This is normal. The initial full sync is mapping the 2 VMKDs and only replicating the changes made. It will take a little longer than a regular sync due to the mapping processes but not nearly as long as replicating ALL the data again. 

All in all, this process can be used for a lot of different reason, re-deploying being one of them that I will cover in another post, but hopefully this sheds some light on how to avoid re-replication and make your vSphere Replication experience a little better (and faster!).

Thanks for reading and don't forget to follow me on Twitter! @SRM_Guru

This blog is in no way sponsored, supported or endorsed by VMware. Any configuration or environmental changes are to be made at your own risk. Casey, VMware, and any other company and/or persons mentioned in this blog take no responsibility for anything.  

Wednesday, August 14, 2013

De-mystifying the multi-site SRM installation

De-mystifying the multi-site SRM installation

There are many complexities to a single-site VMware vCenter Site Recovery Manager (SRM) installation so when most people hear "multi-site SRM installation" their eyes roll back and they foam at the mouth. This shouldn't be the case. Unfortunately, there isn't much SOLID documentation on what is needed for a multi-site install and exactly how to do it (there is a link at the bottom of this page to the VMware-supplied documentation). I aim to fix that. This post is going to cover not only the theory behind the multi-site install but also a step-by-step walk through of the entire installation.

The Theory

What do I need for a multi-site configuration? Can I fail over from all sites to all sites? How do I connect to all of my sites? These are all questions that, unfortunately, the documentation out there doesn't cover very well. This is, by no means, an exhaustive list of all the questions you might have but I am aiming to hit the big ones. 

What do I need for a multi-site SRM configuration?

Well the first thing you need is a better term. "Multi-site" implies that vCenter is going to communicate with more than 2 sites at a time. This is wrong. The current limitation is that vCenter can communicate with 1 and only 1 pair of SRM servers at a time. That being said, vCenter CAN be paired to multiple pairs of SRM servers, hence "Multi-site". 

The illustration below is a typical 3 site SRM configuration:

In this configuration, we have 7 servers: Production VC, Production SRM Alpha, Production SRM Beta, DR VC Alpha, DR SRM Alpha, DR VC Beta and DR SRM Beta. This can be any mix of physical and virtual servers you like, the only limitation is that you can't have the 2 Production SRMs on the same box.  (This is showing best practices which is to have all services deployed on their own servers, be they virtual or physical. Some people like to consolidate this by putting the vCenter and SRM services on the same server. This WILL work, the only limitation is, as stated above, you can't have the 2 Production SRMs on the same box).

As you can see, the production vCenter is connected to both SRM pairs through 1 line. This is an important observation because, as I said before, the vCenter can only be connected to 1 pair of SRM servers at a time. 

Can I fail over from all sites to all sites?

Purple (yes, no, sort of). Since vCenter can only connect to 1 pair of SRM servers, you can't share a connection. In the example above, this would mean that you can not fail over a VM from DR Alpha to DR Beta. You can fail a VM from DR Alpha to Prod, from DR Beta to Prod, from Prod to DR Alpha and Prod to DR Beta. This means that, in a way, you could do a fail over from DR Alpha to DR Beta. To do this, you would need to fail over from DR Alpha to Prod and then from Prod to DR Beta. One might ask "Is there a better way to do this"? Don't worry, we are getting there.

So how do I fail over directly from all sites to all sites?

A MULTI multi site configuration (don't worry, it's not as bad as it sounds).

The illustration below is how you would accomplish this:

In this configuration, all sites have a direct link to each other. This means that you can directly fail over from any site to any site. This would be a great model if you have multiple production sites and you want them all to be able to protect each other. While this isn't a typical configuration, the potential here is great. You can greatly increase the flexibility by only adding 2 more SRM server and doing one more install. In my eyes, this is the best multi-site SRM configuration.

WARNING** The information above has NOT been tested in my labs so I cannot guarantee it will work. When I get the chance, I can test it or if somebody has this already let me know but use this method at your own risk. 

I'm sick of theory, let's get to the nitty gritty install

Alright you asked for it. Below is a step-by-step walk-through for the install of the SRM multi-site configuration. Below each picture is an explanation for exactly what is going on in the step as well as what to note for the next install (remember, you are going to do this 4 times). One thing to keep in mind is that this is one out of 4 installs. Each one will NOT be identical. Hopefully if you are walking through the first one, the next ones will make more and more sense (and if you have questions, tweet them to me @SRM_Guru). Also, for security purposes, I have blurred out any IP addresses, FQDNs or anything else that may have confidential information in it. I describe any fields that are not self explanatory in the description of the image.

Step 1.
To run the Multi site install, you need to run the installer from the command line. use the command

#VMware-srm-5.1.0-941848.exe /V"Custom_SETUP=1"

Step 2.

Step 3.

Step 4.

Step 5.
vSphere Replication is not required but you might as well install it and try it out unless you really don't want it

Step 6.
vCenter Server Address should be the Fully Qualified Domain Name (FQDN) rather than IP if at all possible to avoid issues in the future.

Step 7.

This Security warning is normal as long as you are using self-signed (not custom) certificates

Step 8.

Step 9.
This can be anything and doesn't make a difference when you are pairing sites. Make is something that makes sense but don't fret over what you make it.

Step 10.
Local site name should be the name of the site. Most people use the name of the vCenter here. Local host name should be FQDN instead of IP.

Step 11.
Make sure you use the Custom SRM Plugin Identifier here. This is the "Multi site" option.

Step 12.
This SRM ID is what is shared between sites. You can see these in the images above under theory. Make sure you write these down because each pair of SRM servers MUST share the same SRM ID.

Step 13.
User name and password should be the credentials for the SRM DB.

Step 14.
And you're done!

Step 15 is to rinse, lather, repeat. You will need to run the installer on all 4 SRMs the same way. The production pair and the DR pair should share the SRM ID. In the graphic below, you can see that the lines between the 2 SRM servers for Alpha and the line between the 2 SRM server for Beta share the same SRM ID (this is also what is used up in step 12). This is really the only difference in the multi site install versus the regular install.

Hope this help somebody, I know I wished this was out there my first time installing it and don't forget to follow me on Twitter @SRM_Guru thanks all!

VMware documentation:
This blog is in no way sponsored, supported or endorsed by VMware. Any configuration or environmental changes are to be made at your own risk. Casey, VMware, and any other company and/or persons mentioned in this blog take no responsibility for anything.  

VMware documentation:

Tuesday, July 16, 2013

Replication is already enabled..............But it's not

Don't lie to me vSphere Replication!

Have you ever gone to enable vSphere Replication (VR) in the new 5.1 web GUI and it comes back with the error "Replication is already enabled" when it just clearly is not? YES?! Well then maybe I can help you! VR is based off of a service called Host Based Replication (HBR). This is a service that actually runs on the host and not on the vSphere Replication Appliance or vCenter. Because of this, if you lose your VRA or VC configuration, your host may still think it's replicating a VM but VRA and/or VC think otherwise. When you go to enable replication, VC queries the host and the host comes back and says (you guessed it) "Replication is already enabled".

So how do we make an honest box out of VR?

Here is a quick way to diagnose and resolve this issue.

1) In the summary tab for the VM, find what host the VM is currently on
2) Get an SSH session to the host
3) Run the command ~#vim-cmd vmsvc/getallvms
4) In front of the VM in question, there will be a number, copy this number down
5) Run the command ~#vim-cmd hbrsvc/vmreplica.getState <#>  where "<#>" is replaced with the number in the previous step
6) If the replica state says "VM not enabled for replication" there is a different issue and you will need to dig further
7) If the VM shows that the disks are replicating you can stop the replication with the command ~#vim-cmd hbrsvc/vmreplica.disable <#> where "<#>" is replaced with the same number as in step 5
8) A
fter that, you should be able to enable replication again through the GUI

Hopefully this helps get your replication back up and running!

Want the SRM findings of a TSE in the trenches? Follow me on Twitter! @SRM_Guru

This blog is in no way sponsored, supported or endorsed by VMware. Any configuration or environmental changes are to be made at your own risk. Casey, VMware, and any other company and/or persons mentioned in this blog take no responsibility for anything.   

Monday, July 15, 2013

Another quick fix...... if you look in the logs.

Help! My SRM Service is dead!

What's wrong?!

So your SRM service is dying? What do you do? Call VMware? Call your system admin? Call the hardware vendor? NO! You open the logs! The SRM logs give great insight into why the SRM service is crashing, usually it even tells you what is wrong. So Casey, where are these magical logs you speak of? The SRM logs for SRM 5.x are at

C:\ProgramData\VMware\VMware vCenter Site Recovery Manager\Logs\vmware-dr.txt

If the service is crashing, the logical place to look for the error is at the bottom. Scroll all the way to the bottom and read the last page or so of the logs. In this case, the customer saw this error:

Initializing service content: std::exception 'class Vmacore::Exception' "Registration with the local VC server is not valid"

As it turns out, the customer had encountered and issue over the weekend and had to re-install vCenter. This wiped out the SRM extension.

So that's nice, how did you fix it?! 

Simple. Do a modify install. This will register SRM to the new vCenter. We did the modify install, kept all the old settings (used the previous database, kept the same certificate, etc.) and when we were done, the service stayed up and everything was fixed! Just goes to show you, a little log review can go a looooooong way!

Want the SRM findings of a TSE in the trenches? Follow me on Twitter! @SRM_Guru

This blog is in no way sponsored, supported or endorsed by VMware. Any configuration or environmental changes are to be made at your own risk. Casey, VMware, and any other company and/or persons mentioned in this blog take no responsibility for anything.