The @SRM_Guru Blog: 2013

Wednesday, October 23, 2013

Checking vSphere Replication connectivity

Can you ping me now? Good!

One of the biggest issues I have encountered with vSphere Replication is connectivity. Since the vSphere Replication suite is so big and talks to so much stuff, it can be hard to make sure that everything is talking to each other. Luckily for you, I have a quick checklist for you so that you can check it! But before we get to the list, we have to understand a few quick things so that some of this makes sense.

Theory time!

Okay so vSphere Replication is a big suite. There are a lot of moving parts so let's explore what talks to what and why it is such.

The vSphere Replication Appliance

The vSphere Replication Appliance (VRA) is actually 2 servers in one. It consists of a vSphere Replication Management Server (VRMS) and a vSphere Replication Server (VRS). In the 5.0 release of vSphere Replication, these were actually 2 separate servers but in the 5.1 release, VMware squished them down to the VRA. For you math people out there VRA = VRMS + VRS. If you understand this, you will understand why I'm going to explain what the VRA does by explaining what the VRMS and VRS do below.

The vSphere Replication Management Server

The vSphere Replication Management Server (VRMS) is the vSphere Replication's gateway to everything management. The VRMS has 3 primary jobs. First, it manages all of the vSphere Replication Servers. Second, it reports what the VRSs are doing to vCenter and third, it reports to the other VRMS at the other site so that it has a consistent view of the replication states of all of the VMs. You have to have exactly one VRMS per site, no more, no less.

The vSphere Replication Server

The vSphere Replication Server (VRS) is the worker of the vSphere Replication Suite. The VRS is in charge of getting the information from the remote host and getting it to the local host. The VRS reports all of what it does to the VRMS but doesn't talk to any other management (SRM, vCenter, the other VRMS, etc). The VM's data is replicated from the source host to the destination VRA and then the VRA chooses a slave host to replicate the data to the destination datastore. This basically means that a VM that is being replicated will show up on the destination VRS, not the local one. One last note, when you deploy a VRA, you get 1 VRS. You can choose to deploy additional VRSs at a single site to balance out the load but it's not common.

That's a lot of stuff, what talks to what???

The best way to explain this is with a graphic:

So what does this all mean? It means you've got a lot of ports to check! So here is how to do it. First, log into your VRA using an ash session (it's enabled by default). You are going to use telnet to open a session on port 80 to the opposite site's vCenter. From the command line it will look something like this:

#telnet myvCenterFQDN.myDomain.com 80

Of course, the vCenter server needs a way to talk to the local VRMS server as well. This is accomplished with port 8043. To test this, use telnet on the vCenter server against the vSphere Replication Appliance. It looks like this:

#telnet myLocalVRA.myDomain.com 8043

This will probe the remote vCenter for a connection and make sure that it can communicate over port 80. Next, since we are on the VRA anyways, let's check pot 902 to the local hosts. vSphere replication uses Network File Copy (NFC) to copy data from the VRA to a local host and then the host copies that data to the satay store for the VRA. This is all done over port 902 so that has to be open to work. The command will look something like this:

#telnet myLocalHost.myDomain.com 902

If you get the connection, you are looking good! So what's next? Well, we have 2 sites so we had better test all of this on the DR site as well. Run the exact same steps as above on the remote VRA.

So we are done right? Wrong! We are just getting started! Now we have to test the all important host to VRA connection. So, to do this, we want to establish an SSH session to one of the hosts on the local site. We are going to use NetCat to probe the connections to the remote VRA. To do this, we use this command:

#nc -z myRemoteVRA.myDomain.com <port#>

Where <port#> is replaced with 31031 and then 44046. Why two ports you are obviously asking? Well, vSphere replication uses port 31031 to do any initial replications and then port 44046 to do any subsequent syncs of the VM. Why, I have no idea but I'm sure there is a reason. Most of the time, checking that these ports are open on any one host is good enough to check all of the hosts but if you having issues, you should check this on all of the hosts. So what's next, you guessed it, second sites a charm! Check the same connections from the DR hosts to the production VRA. If all of this is open and working, you've got yourself a good recitation environment!

So, you've read through this, it makes sense but that's a lot to read through and I'm going to get lost. Not to fear, connection lists are here! Here's the short version. Copy it to a notepad and check it off as you go. (And save the planet, only print it if you absolutely have too)!

vSphere Connectivity List:

Production VRA:
-Port 80 to the DR vCenter [ ]
-Port 902 to the prod host(s) [ ]

Production host(s):
Port 31031 to the DR VRA [ ]
Port 44046 to the DR VRA [ ]

Production vCenter:
Port 8043 to the Prod VRA [ ]

DR VRA:
-Port 80 to the Prod vCenter [ ]
-Port 902 to the DR host(s) [ ]

DR host(s):
Port 31031 to the Prod VRA [ ]
Port 44046 to the Prod VRA [ ]

DR vCenter:
Port 8043 to the DR VRA [ ]

And there ya have it folks. If you have any questions, please put them in the comments and let me know if this helps!

When VMware tools times out after 300 seconds you.......

Set it higher!

I see this issue ALL THE TIME. The error messages looks like this:

Error - Timed out waiting for VMware Tools after 300 seconds

But why am I getting this, you might ask. Well 300 seconds is 5 minutes. The average boot time for a Windows VM is right around 3-4 minutes. This leaves just 60 seconds for VMware tools to come up and start talking to vCenter. Depending on how many other services are starting at the same time, this might not happen!

So what do we do to fix it?

We set it higher! Assuming that it really is just not starting up in time, we can set the wait for VMware tools time globally. To do this, follow these steps:

1) Open Site Recovery Manager in vCenter

2) In the "Sites" section, right click your local site and select "Advanced Settings..."

3) in the "recovery" section, scroll to the bottom and you will see recovery.powerOnTimeout

4) Set recovery.powerOnTimeout to whatever time value you want (I usually set it to 900 seconds)

5) Repeat these steps on the DR site

Now your SRM server will wait for 900 seconds for VMware tools to come up. If you set it for 900 seconds and it STILL isn't working, log into the server as soon as it boots up and see if VMware tools is starting. If it is, you have a different issue, if it's not, time it and find out how long it takes to start and set this setting accordingly. Last thing here, setting the timeout to 900 seconds does NOT mean that it will always wait 900 seconds. It simply means that is the longest it will wait. So for instance, let's say that VMware tools comes up in 302 seconds, you will only wait 302 seconds, not the full 900. In this manor, you may have a higher success rate and still only wait about 5 minutes. Hope this helps!

Tuesday, August 20, 2013

Re-sync when everything is outa sync

Re-sync when everything is outa' sync

Since vSphere Replication hit last year, I have had to walk countless people through a vSphere Replication re-deploy. The re-deploy is for another post but what I want to cover here is the BEST way to re-create your replications with the least amount of hassle.

The setup:

I have my VM, cLevingerAD replicating successfully from our production site to the DR site. I have shown it here in SRM but it could be in the 5.1 web client as well.

The problem:

I need to stop replication and re-start it for some reason. This could be for a million reasons. Some of the common ones are you need to make a change to the VM, you need to re-deploy vSphere Replication, you failed over and now you need to reverse replication, you need to stop replication for some business reason but want to enable it later. Whatever the reason, you have a need, let's give you a solution.

So, how do I go about this? Well, you could hit the "Remove Replication" button and then just re-replicate everything AGAIN but this isn't the best way to go about this. Instead, we can preserve the remote VMDKs and use them as initial seeds. This means we don't have to replicate any of the already-replicated information. vSphere Replication will go through the disks at the Production and DR sites and compare them. It will figure out what is different and then only replicate the changes made while replication was off.

The procedure:

So how do we do this magic? Easy, first, we want to pause replication.

This ensures that no operations will go through while we make changes to the back-end storage.

Next, we need to change the name of the VMs folder at the DR site. I usually add "(hold)" to the end of the folder name. When we remove replication late, the vSphere Replication Appliance looks for the name of the folder from when replication was initially created. Since it's no longer there (cLevingerAD ≠ cLevingerAD(hold)) vSphere Replication will leave this new folder alone.

After changing the folder name, we can safely remove replication.

Now we can make all the changes we want to the VM. Once we are done messing around with it, we can re-enable replication using the old disks as initial full seeds. The first thing we want to do is change the folder name back to the original name. This isn't absolutely necessary but a good idea so that all of the folder names are the same.

After re-re-naming the folder, we can enable replication for the VM. Click the VM, then click the vSphere Replication tab and click "Configure Replication". This will bring up the vSphere Replication configuration page.

Here is where we work our magic. When it asks for the destination, we are going to specify the old datastore.

Hit OK and if you did everything right, a message should pop up saying that an initial seed was found and do you want to use it. Duh, of course we want to use it.

Finish the configuration (note that on the summary page next to "initial seed found" we see "yes").

So you finish this and you expect to see a regular sync going through. WRONG. You will see "initial full sync" just like you would if you replicated from scratch.

So what was the point of all of that?! This is normal. The initial full sync is mapping the 2 VMKDs and only replicating the changes made. It will take a little longer than a regular sync due to the mapping processes but not nearly as long as replicating ALL the data again.

All in all, this process can be used for a lot of different reason, re-deploying being one of them that I will cover in another post, but hopefully this sheds some light on how to avoid re-replication and make your vSphere Replication experience a little better (and faster!).

Thanks for reading and don't forget to follow me on Twitter! @SRM_Guru

**********************************************Disclaimer**********************************************
This blog is in no way sponsored, supported or endorsed by VMware. Any configuration or environmental changes are to be made at your own risk. Casey, VMware, and any other company and/or persons mentioned in this blog take no responsibility for anything.

Wednesday, August 14, 2013

De-mystifying the multi-site SRM installation

There are many complexities to a single-site VMware vCenter Site Recovery Manager (SRM) installation so when most people hear "multi-site SRM installation" their eyes roll back and they foam at the mouth. This shouldn't be the case. Unfortunately, there isn't much SOLID documentation on what is needed for a multi-site install and exactly how to do it (there is a link at the bottom of this page to the VMware-supplied documentation). I aim to fix that. This post is going to cover not only the theory behind the multi-site install but also a step-by-step walk through of the entire installation.

The Theory

What do I need for a multi-site configuration? Can I fail over from all sites to all sites? How do I connect to all of my sites? These are all questions that, unfortunately, the documentation out there doesn't cover very well. This is, by no means, an exhaustive list of all the questions you might have but I am aiming to hit the big ones.

What do I need for a multi-site SRM configuration?

Well the first thing you need is a better term. "Multi-site" implies that vCenter is going to communicate with more than 2 sites at a time. This is wrong. The current limitation is that vCenter can communicate with 1 and only 1 pair of SRM servers at a time. That being said, vCenter CAN be paired to multiple pairs of SRM servers, hence "Multi-site".

The illustration below is a typical 3 site SRM configuration:

In this configuration, we have 7 servers: Production VC, Production SRM Alpha, Production SRM Beta, DR VC Alpha, DR SRM Alpha, DR VC Beta and DR SRM Beta. This can be any mix of physical and virtual servers you like, the only limitation is that you can't have the 2 Production SRMs on the same box. (This is showing best practices which is to have all services deployed on their own servers, be they virtual or physical. Some people like to consolidate this by putting the vCenter and SRM services on the same server. This WILL work, the only limitation is, as stated above, you can't have the 2 Production SRMs on the same box).

As you can see, the production vCenter is connected to both SRM pairs through 1 line. This is an important observation because, as I said before, the vCenter can only be connected to 1 pair of SRM servers at a time.

Can I fail over from all sites to all sites?

Purple (yes, no, sort of). Since vCenter can only connect to 1 pair of SRM servers, you can't share a connection. In the example above, this would mean that you can not fail over a VM from DR Alpha to DR Beta. You can fail a VM from DR Alpha to Prod, from DR Beta to Prod, from Prod to DR Alpha and Prod to DR Beta. This means that, in a way, you could do a fail over from DR Alpha to DR Beta. To do this, you would need to fail over from DR Alpha to Prod and then from Prod to DR Beta. One might ask "Is there a better way to do this"? Don't worry, we are getting there.

So how do I fail over directly from all sites to all sites?

A MULTI multi site configuration (don't worry, it's not as bad as it sounds).

The illustration below is how you would accomplish this:

In this configuration, all sites have a direct link to each other. This means that you can directly fail over from any site to any site. This would be a great model if you have multiple production sites and you want them all to be able to protect each other. While this isn't a typical configuration, the potential here is great. You can greatly increase the flexibility by only adding 2 more SRM server and doing one more install. In my eyes, this is the best multi-site SRM configuration.

WARNING** The information above has NOT been tested in my labs so I cannot guarantee it will work. When I get the chance, I can test it or if somebody has this already let me know but use this method at your own risk.

I'm sick of theory, let's get to the nitty gritty install

Alright you asked for it. Below is a step-by-step walk-through for the install of the SRM multi-site configuration. Below each picture is an explanation for exactly what is going on in the step as well as what to note for the next install (remember, you are going to do this 4 times). One thing to keep in mind is that this is one out of 4 installs. Each one will NOT be identical. Hopefully if you are walking through the first one, the next ones will make more and more sense (and if you have questions, tweet them to me @SRM_Guru). Also, for security purposes, I have blurred out any IP addresses, FQDNs or anything else that may have confidential information in it. I describe any fields that are not self explanatory in the description of the image.

Step 1.

To run the Multi site install, you need to run the installer from the command line. use the command

#VMware-srm-5.1.0-941848.exe /V"Custom_SETUP=1"

Step 2.

Step 3.

Step 4.

Step 5.

vSphere Replication is not required but you might as well install it and try it out unless you really don't want it

Step 6.

vCenter Server Address should be the Fully Qualified Domain Name (FQDN) rather than IP if at all possible to avoid issues in the future.

Step 7.

This Security warning is normal as long as you are using self-signed (not custom) certificates

Step 8.

Step 9.

This can be anything and doesn't make a difference when you are pairing sites. Make is something that makes sense but don't fret over what you make it.

Step 10.

Local site name should be the name of the site. Most people use the name of the vCenter here. Local host name should be FQDN instead of IP.

Step 11.

Make sure you use the Custom SRM Plugin Identifier here. This is the "Multi site" option.

Step 12.

This SRM ID is what is shared between sites. You can see these in the images above under theory. Make sure you write these down because each pair of SRM servers MUST share the same SRM ID.

Step 13.

User name and password should be the credentials for the SRM DB.

Step 14.

And you're done!

Step 15 is to rinse, lather, repeat. You will need to run the installer on all 4 SRMs the same way. The production pair and the DR pair should share the SRM ID. In the graphic below, you can see that the lines between the 2 SRM servers for Alpha and the line between the 2 SRM server for Beta share the same SRM ID (this is also what is used up in step 12). This is really the only difference in the multi site install versus the regular install.

Hope this help somebody, I know I wished this was out there my first time installing it and don't forget to follow me on Twitter @SRM_Guru thanks all!

VMware documentation:

http://www.vmware.com/pdf/srm_shared_recovery.pdf

**********************************************Disclaimer**********************************************

This blog is in no way sponsored, supported or endorsed by VMware. Any configuration or environmental changes are to be made at your own risk. Casey, VMware, and any other company and/or persons mentioned in this blog take no responsibility for anything.

VMware documentation:

http://www.vmware.com/pdf/srm_shared_recovery.pdf

Tuesday, July 16, 2013

Replication is already enabled..............But it's not

Don't lie to me vSphere Replication!

Have you ever gone to enable vSphere Replication (VR) in the new 5.1 web GUI and it comes back with the error "Replication is already enabled" when it just clearly is not? YES?! Well then maybe I can help you! VR is based off of a service called Host Based Replication (HBR). This is a service that actually runs on the host and not on the vSphere Replication Appliance or vCenter. Because of this, if you lose your VRA or VC configuration, your host may still think it's replicating a VM but VRA and/or VC think otherwise. When you go to enable replication, VC queries the host and the host comes back and says (you guessed it) "Replication is already enabled".

So how do we make an honest box out of VR?

Here is a quick way to diagnose and resolve this issue.

1) In the summary tab for the VM, find what host the VM is currently on
2) Get an SSH session to the host
3) Run the command ~#vim-cmd vmsvc/getallvms
4) In front of the VM in question, there will be a number, copy this number down
5) Run the command ~#vim-cmd hbrsvc/vmreplica.getState <#> where "<#>" is replaced with the number in the previous step
6) If the replica state says "VM not enabled for replication" there is a different issue and you will need to dig further
7) If the VM shows that the disks are replicating you can stop the replication with the command ~#vim-cmd hbrsvc/vmreplica.disable <#> where "<#>" is replaced with the same number as in step 5
8) After that, you should be able to enable replication again through the GUI

Hopefully this helps get your replication back up and running!

Want the SRM findings of a TSE in the trenches? Follow me on Twitter! @SRM_Guru

**********************************************Disclaimer**********************************************
This blog is in no way sponsored, supported or endorsed by VMware. Any configuration or environmental changes are to be made at your own risk. Casey, VMware, and any other company and/or persons mentioned in this blog take no responsibility for anything.

Monday, July 15, 2013

Another quick fix...... if you look in the logs.

Help! My SRM Service is dead!

What's wrong?!

So your SRM service is dying? What do you do? Call VMware? Call your system admin? Call the hardware vendor? NO! You open the logs! The SRM logs give great insight into why the SRM service is crashing, usually it even tells you what is wrong. So Casey, where are these magical logs you speak of? The SRM logs for SRM 5.x are at

C:\ProgramData\VMware\VMware vCenter Site Recovery Manager\Logs\vmware-dr.txt

If the service is crashing, the logical place to look for the error is at the bottom. Scroll all the way to the bottom and read the last page or so of the logs. In this case, the customer saw this error:

Initializing service content: std::exception 'class Vmacore::Exception' "Registration with the local VC server is not valid"

As it turns out, the customer had encountered and issue over the weekend and had to re-install vCenter. This wiped out the SRM extension.

So that's nice, how did you fix it?!

Simple. Do a modify install. This will register SRM to the new vCenter. We did the modify install, kept all the old settings (used the previous database, kept the same certificate, etc.) and when we were done, the service stayed up and everything was fixed! Just goes to show you, a little log review can go a looooooong way!

Want the SRM findings of a TSE in the trenches? Follow me on Twitter! @SRM_Guru

**********************************************Disclaimer**********************************************
This blog is in no way sponsored, supported or endorsed by VMware. Any configuration or environmental changes are to be made at your own risk. Casey, VMware, and any other company and/or persons mentioned in this blog take no responsibility for anything.

Friday, June 14, 2013

The ultra-frustrating "Site pairing or break operation failed." error

SO glad I figured this one out!!!

When deploying the vSphere Replication appliance (VRA) for standalone vSphere replication, you deploy the servers, they connect to their local vCenters but when you try and pair the sites, you get this error:

"Site pairing or break operation failed."

AND THAT'S IT! No clues, no hints. no NOTHING.

As it turns out, this is a DNS issue. Here is how to fix it.

1) Log into all 4 servers, ProdVC, DRVC, Prod VRA, DR VRA, and make sure you can do forward and revers nslookups for all servers.
2) Log into the VAMI for the two VRAs
3) Unregister the VRA from vCenter
4) Make sure that the local site name field and the vCenter field are FQDN, not short name, not IP, FQDN!
5) Save and restart for both VRAs
6) Log out and back into both of the vCenter Server web clients
7) Pair your sites

Hope this helps somebody, I nearly broke my skull while beating my head against the wall on this one!

**********************************************Disclaimer**********************************************
This blog is in no way sponsored, supported or endorsed by VMware. Any configuration or environmental changes are to be made at your own risk. Casey, VMware, and any other company and/or persons mentioned in this blog take no responsibility for anything.

Monday, May 20, 2013

Datastore Groups Explained

"Why can't I fail over a single VM using array-based replication?"

I get this question a lot. The truth is, you can, as long as you understand how datastore groups work. The smallest "thing" that SRM can fail over using array-based replication is a datastore group. Now in VMware terms, a LUN is almost always a 1-to-1 mapping with a datastore or an RDM so for the purposes of this article, we will ignore the rare cases as they fall outside of the scope of what I am trying to explain. I give to you, the test environment.

In the test environment, we have 3 VMs: Alpha, Beta and Charlie, and 3 datastores: datastore1, datastore2 and datastore3. We start the test with Alpha's disk on datastore1, Beta's disk on datastore2 and Charlie's 2 disks on datastore3.

So remember how I said the smallest thing you can fail over is a datastore group? Well, what is a datastore group you might ask. A datastore group is the smallest number of LUNs that satisfy the dependencies of all the VMs on those LUNs. I know, that's confusing, don't worry I will explain.

So let's look at just VM Alpha and datastore1.

As you can see, the only VMDK file on datastore1 is VM Alpha's only disk. Because VM Alpha doesn't depend on any other datastores and because datastore1 doesn't hold any other VMs, they make up a datastore group. In this way, you could fail over a single VM.

So what about VM Charlie with 2 VMDKs?

That's a good question. Well let's think this one through. VM Charlie has no dependencies on any other datastores other than datastore3 and datastore3 has no other VMs in it besides VM Charlie so they make up a datastore group as well! As you can see, datastore groups don't care how many disks a VM has, just what datastores they are on. Consequently, datastore groups also don't care how many VMDKs are on a datastore but rather what VMs they belong to.

So if that is the case, then why do datastore end up lumping themselves together?

Here is where things get a little more tricky. Let's say that you move one of VM Charlie's VMDKs to datastore2.

So now, we can't fail over just datastore3 because to fail over datastore3 we MUST fail over all VMs on that datastore. To do that, we have to fail over VM Charlie but to fail over VM Charlie we MUST fail over all of the datastores that it depends on. Because of this rule we must also fail over datastore2. In this way, now the datastore group that is created contains datastores 2&3. Note that in this example, VM Beta from the above picture is unregistered. So what happens if we register it?

If you guessed that it also has to be failed over, you're right!

In this example, we see that the datastore group now includes VM Beta. VM Beta will have to be failed over because its VMDK file lives on datastore2, which also holds VM Charlie, which has another disk dependency on datastore3. In this example, you can't fail over just VM Beta or just VM Charlie because of the dependencies of other VMs on the datastores that the VMDK files live on.

Ok last example. What if I have a 2 VMs that live on 2 datastores and don't share any disks but another VM shares both of those disks? Well, that would look like this:

So you can see, in this example, we gave VM Beta another VMDK that lives on datastore1. VMs Alpha and Charlie share no cross dependencies but are grouped into the same datastore group. Why you might ask? Because VM Beta links them together.

In this example, to fail over VM Alpha you must fail over datastore1. To fail over datastore1 you must fail over VM Beta. To fail over VM Beta you must fail over datastore2. To fail over datastore2, you must fail over VM Charlie and to fail over VM Charlie, you must fail over datastore 3. Whew!

In the end, it all has to do with the dependancies.

As you can see, careful planning of your SRM storage can really help to keep from having the "all or nothing" failover plan. When deploying an SRM environment using array-based replication, try and create datastores specifically for VMs that you want replicated and only put VMs on the same datastore that you know you want to fail over together. Another thing to think about is RDMs. If your VM is using RDMs, those will also need to be failed over any time the VM is failed over and that will also add to the number of LUNs in your datastore groups.

Obviously, there are literally an infinite number of combinations and scenarios to go over but if you understand the definition of a datastore group and understand why certain datastores get lumped together with other datastores, you can figure out the dependencies for yourself. I hope this helps to shed some light on what seems to be one of my most common questions! Don't forget to follow me on Twitter @SRM_Guru and if you have questions, please put them in the comments below!

**********************************************Disclaimer**********************************************

Wednesday, May 15, 2013

The most common SRM error EVER!

What's the most common SRM issue I see?

This one!

Connection Error: Lost connection to SRM server srmServer.domain.com:8095
The server 'srmServer.domain.com' could not interpret the client's request. (The remote server retunred an error: (503) Server Unavailable).

So what does this all mean? You can't talk to your SRM server! This could be for a number of different reasons but there are 2 things that YOU can check BEFORE calling support.

1) Can your vCenter server talk to your SRM server? Check this through your good ol' friend ping

2) Is the SRM service started on the SRM server? Check this through services.msc

These are the quick steps to check. You wouldn't believe how many times I have gotten on the phone with admins who think the whole world is crashing down, come to find that the SRM services isn't started.

**********************************************Disclaimer**********************************************
This blog is in no way sponsored, supported or endorsed by VMware. Any configuration or environmental changes are to be made at your own risk. Casey, VMware, and any other company and/or persons mentioned in this blog take no responsibility for anything.

Monday, May 13, 2013

Logging in gives the error "The remote name could not be resolved:RemoteVC.domain.com"

But I swear I can!!

This was a fun one. I had a case today where we were getting this when trying to log into the remote vCenter server through SRM:

The remote name could not be resolved:RemoteVC.domain.com

The odd thing about this was, we could (or so we thought). We checked all kinds of connectivity.

ProdVC > ProdSRM = good
ProdVC > DRSRM = good
ProdVC > DRVC = good
DRVC > DRSRM = good
DRVC > ProdSRM = good
DRVC > ProdVC= good

So we checked these, what did we check? We checked the ports (see VMware KB 1009562 for a list of ports) via telnet, we checked the nslookup, both forward and reverse from all to all worked, we checked pings. So what was going wrong?!

Since all of the connectivity looked good, we tried connecting to the vCenter directly on the vCenter (opened the vSphere client on the vCenter server, pointed it to localhost). Once we did that, BAM, we were connected! In the end, it turned out the DNS server IP address on the workstation we were connecting from was wrong. Changed this, and everything connected.

It did raise one very interesting questions (that I still can't figure out myself). Why were we able to point a web client to the FQDN of the remote vCenter server and connect and we were also able to connect via FQDN to the remote vCenter server via the thick client. Here is my guess. I think that the FQDN of the remote vCenter Server was saved in the local DNS on the workstation itself. This meant we could still resolve the name even though we weren't hitting the DNS server. Once we were logged in, the DNS resolve was going through vCenter and vCenter forced the DNS lookup to go out to the DNS server instead of looking in the workstation's DNS entries. If anybody has a better explanation, put it in the comments!

**********************************************Disclaimer**********************************************
This blog is in no way sponsored, supported or endorsed by VMware. Any configuration or environmental changes are to be made at your own risk. Casey, VMware, and any other company and/or persons mentioned in this blog take no responsibility for anything.

Array pair issues with NetApp NFS mounts

Where's my array pairs NetApp?!

I have seen this issue a few times now and it always seems to be with the NetApp arrays. You get everything configured, your array pairs are enabled and you see replicated devices but you go to make a protection group and....... no array pairs. I have seen this on both NFS and FC arrays but they presented and were fixed just the same. Here are the ways I have seen it fixed.

The "Include" List

This is the most obvious and is actually in the array documentation. Especially if you are using NFS mounts, you will need to go to the array managers, click the array, in the summary tab click the "Edit Array Manager" link, click next until you see the options page and add the datastore names in. For NFS, you need to add them AFTER the mount so for example, if you mounted the datastore with:

/NetAppArray/NFSDatastore

where /NetAppArray is the FQDN of the share and /NFSDatastore is the mount point. You will add the datastore to the include list with just "NFSDatastore". Remember that this is CaSe SeNsItIvE so be careful typing it all in!

IPs vs FQDN

This one is ANNOYING! Let's say you have your Prod site and a DR site. On the Prod site, you have the following datastore mounted and replicated:

/ProdNetAppArray/Prod_NFSDatastore

where /ProdNetAppArray is the FQDN of the share and /Prod_NFSDatastore is the mount point. This datastore is replicated to the following datastore at the DR site:

/10.10.20.100/DR_NFSDatastore

where /10.10.20.100 is the IP address of the share and /DR_NFSDatastore is the mount point. I have seen this cause issues even if they are in the include list correctly. The fix here was to mount them either both as IP or both as FQDN. I have also seen where they are both mounted as IP and changing them both to FQDN fixed the issue and vise-versa.

vCenter is set via IP, not FQDN

This was the last fix that I have heard of but haven't seen personally. This issue occurred when during the install of SRM, vCenter was put in via the IP address instead of as the FQDN. Not sure why this would cause this issue but the fix was to do a modify install and when you are defining the vCenter Server, use the FQDN instead of the IP address. This MAY also work the opposite way aka, you set the VC as FQDN and changing it to IP fixes it. As I stated before, I haven't seen this one personally but another TSE here said this was the fix.

Well that wraps this one up! Hope this helps somebody and saves them the days of troubleshooting I did on it! Want the SRM findings of a TSE in the trenches? Follow me on Twitter! @SRM_Guru

**********************************************Disclaimer**********************************************
This blog is in no way sponsored, supported or endorsed by VMware. Any configuration or environmental changes are to be made at your own risk. Casey, VMware, and any other company and/or persons mentioned in this blog take no responsibility for anything.

Tuesday, April 30, 2013

SRM Service, He's dead Jim.

SRM service, He's dead Jim.

Ever had your SRM server randomly through an error like this?

The server 'server.domain' could not interpret the client's request (the remove server returned and error: (503) Server Unavailable.)

Well I might just have a fix for ya!

This was a pretty simple one. Cracked open the logs (C:\ProgramData\VMware\VMware vCenter Site Recovery Manager\Logs\vmware-dr-###) and saw this gem:

DBManager error: Could not initialize Vdb connection: ODBC error: (08001) - [Microsoft][SQL Native Client]Named Pipes Provider: Could not open a connection to SQL Server [2].

In this case, the issue was that the SQL server was dead due to an expired password but this message in the logs could point to SQL being down for any reason or being unreachable (if say you had a remote SQL server instead of a local one). Always remember to crack open those logs!

**********************************************Disclaimer**********************************************

NetApp SRA issue

A NetApp SRA issue:

Last week I encountered an SRM issue that I thought would be an easy fix. Turned out it wasn't.

The Issue:

The admin was unable to create a protection group pair. In the devices tab, we saw replication one way but not the other. We saw the following error:

Device '/vol/NFS_datastore_name' cannot be matched to a remote peer device.

The Environment:

The important thing here is that the datastores were NFS. The array was a NetApp DS5020. The other important thing in this case was that the storage that wasn't replicating was on a new shelf (same head though).

The Fix:

This turned out to be an issue with the NetApp SRA. The datastores were unable to replicate due to a permissions issue. Originally, the array was set up to allow all hosts to have root access. To fix the issue, we had to give the hosts explicit root access individually. Once that was finished, replication started right up!

**********************************************Disclaimer**********************************************
This blog is in no way sponsored, supported or endorsed by VMware. Any configuration or environmental changes are to be made at your own risk. Casey, VMware, and any other company and/or persons mentioned in this blog take no responsibility for anything.