The @SRM_Guru Blog: May 2013

Monday, May 20, 2013

Datastore Groups Explained

"Why can't I fail over a single VM using array-based replication?"

I get this question a lot. The truth is, you can, as long as you understand how datastore groups work. The smallest "thing" that SRM can fail over using array-based replication is a datastore group. Now in VMware terms, a LUN is almost always a 1-to-1 mapping with a datastore or an RDM so for the purposes of this article, we will ignore the rare cases as they fall outside of the scope of what I am trying to explain. I give to you, the test environment.

In the test environment, we have 3 VMs: Alpha, Beta and Charlie, and 3 datastores: datastore1, datastore2 and datastore3. We start the test with Alpha's disk on datastore1, Beta's disk on datastore2 and Charlie's 2 disks on datastore3.

So remember how I said the smallest thing you can fail over is a datastore group? Well, what is a datastore group you might ask. A datastore group is the smallest number of LUNs that satisfy the dependencies of all the VMs on those LUNs. I know, that's confusing, don't worry I will explain.

So let's look at just VM Alpha and datastore1.

As you can see, the only VMDK file on datastore1 is VM Alpha's only disk. Because VM Alpha doesn't depend on any other datastores and because datastore1 doesn't hold any other VMs, they make up a datastore group. In this way, you could fail over a single VM.

So what about VM Charlie with 2 VMDKs?

That's a good question. Well let's think this one through. VM Charlie has no dependencies on any other datastores other than datastore3 and datastore3 has no other VMs in it besides VM Charlie so they make up a datastore group as well! As you can see, datastore groups don't care how many disks a VM has, just what datastores they are on. Consequently, datastore groups also don't care how many VMDKs are on a datastore but rather what VMs they belong to.

So if that is the case, then why do datastore end up lumping themselves together?

Here is where things get a little more tricky. Let's say that you move one of VM Charlie's VMDKs to datastore2.

So now, we can't fail over just datastore3 because to fail over datastore3 we MUST fail over all VMs on that datastore. To do that, we have to fail over VM Charlie but to fail over VM Charlie we MUST fail over all of the datastores that it depends on. Because of this rule we must also fail over datastore2. In this way, now the datastore group that is created contains datastores 2&3. Note that in this example, VM Beta from the above picture is unregistered. So what happens if we register it?

If you guessed that it also has to be failed over, you're right!

In this example, we see that the datastore group now includes VM Beta. VM Beta will have to be failed over because its VMDK file lives on datastore2, which also holds VM Charlie, which has another disk dependency on datastore3. In this example, you can't fail over just VM Beta or just VM Charlie because of the dependencies of other VMs on the datastores that the VMDK files live on.

Ok last example. What if I have a 2 VMs that live on 2 datastores and don't share any disks but another VM shares both of those disks? Well, that would look like this:

So you can see, in this example, we gave VM Beta another VMDK that lives on datastore1. VMs Alpha and Charlie share no cross dependencies but are grouped into the same datastore group. Why you might ask? Because VM Beta links them together.

In this example, to fail over VM Alpha you must fail over datastore1. To fail over datastore1 you must fail over VM Beta. To fail over VM Beta you must fail over datastore2. To fail over datastore2, you must fail over VM Charlie and to fail over VM Charlie, you must fail over datastore 3. Whew!

In the end, it all has to do with the dependancies.

As you can see, careful planning of your SRM storage can really help to keep from having the "all or nothing" failover plan. When deploying an SRM environment using array-based replication, try and create datastores specifically for VMs that you want replicated and only put VMs on the same datastore that you know you want to fail over together. Another thing to think about is RDMs. If your VM is using RDMs, those will also need to be failed over any time the VM is failed over and that will also add to the number of LUNs in your datastore groups.

Obviously, there are literally an infinite number of combinations and scenarios to go over but if you understand the definition of a datastore group and understand why certain datastores get lumped together with other datastores, you can figure out the dependencies for yourself. I hope this helps to shed some light on what seems to be one of my most common questions! Don't forget to follow me on Twitter @SRM_Guru and if you have questions, please put them in the comments below!

**********************************************Disclaimer**********************************************

This blog is in no way sponsored, supported or endorsed by VMware. Any configuration or environmental changes are to be made at your own risk. Casey, VMware, and any other company and/or persons mentioned in this blog take no responsibility for anything.

Wednesday, May 15, 2013

The most common SRM error EVER!

What's the most common SRM issue I see?

This one!

Connection Error: Lost connection to SRM server srmServer.domain.com:8095
The server 'srmServer.domain.com' could not interpret the client's request. (The remote server retunred an error: (503) Server Unavailable).

So what does this all mean? You can't talk to your SRM server! This could be for a number of different reasons but there are 2 things that YOU can check BEFORE calling support.

1) Can your vCenter server talk to your SRM server? Check this through your good ol' friend ping

2) Is the SRM service started on the SRM server? Check this through services.msc

These are the quick steps to check. You wouldn't believe how many times I have gotten on the phone with admins who think the whole world is crashing down, come to find that the SRM services isn't started.

**********************************************Disclaimer**********************************************
This blog is in no way sponsored, supported or endorsed by VMware. Any configuration or environmental changes are to be made at your own risk. Casey, VMware, and any other company and/or persons mentioned in this blog take no responsibility for anything.

Monday, May 13, 2013

Logging in gives the error "The remote name could not be resolved:RemoteVC.domain.com"

But I swear I can!!

This was a fun one. I had a case today where we were getting this when trying to log into the remote vCenter server through SRM:

The remote name could not be resolved:RemoteVC.domain.com

The odd thing about this was, we could (or so we thought). We checked all kinds of connectivity.

ProdVC > ProdSRM = good
ProdVC > DRSRM = good
ProdVC > DRVC = good
DRVC > DRSRM = good
DRVC > ProdSRM = good
DRVC > ProdVC= good

So we checked these, what did we check? We checked the ports (see VMware KB 1009562 for a list of ports) via telnet, we checked the nslookup, both forward and reverse from all to all worked, we checked pings. So what was going wrong?!

Since all of the connectivity looked good, we tried connecting to the vCenter directly on the vCenter (opened the vSphere client on the vCenter server, pointed it to localhost). Once we did that, BAM, we were connected! In the end, it turned out the DNS server IP address on the workstation we were connecting from was wrong. Changed this, and everything connected.

It did raise one very interesting questions (that I still can't figure out myself). Why were we able to point a web client to the FQDN of the remote vCenter server and connect and we were also able to connect via FQDN to the remote vCenter server via the thick client. Here is my guess. I think that the FQDN of the remote vCenter Server was saved in the local DNS on the workstation itself. This meant we could still resolve the name even though we weren't hitting the DNS server. Once we were logged in, the DNS resolve was going through vCenter and vCenter forced the DNS lookup to go out to the DNS server instead of looking in the workstation's DNS entries. If anybody has a better explanation, put it in the comments!

**********************************************Disclaimer**********************************************
This blog is in no way sponsored, supported or endorsed by VMware. Any configuration or environmental changes are to be made at your own risk. Casey, VMware, and any other company and/or persons mentioned in this blog take no responsibility for anything.

Array pair issues with NetApp NFS mounts

Where's my array pairs NetApp?!

I have seen this issue a few times now and it always seems to be with the NetApp arrays. You get everything configured, your array pairs are enabled and you see replicated devices but you go to make a protection group and....... no array pairs. I have seen this on both NFS and FC arrays but they presented and were fixed just the same. Here are the ways I have seen it fixed.

The "Include" List

This is the most obvious and is actually in the array documentation. Especially if you are using NFS mounts, you will need to go to the array managers, click the array, in the summary tab click the "Edit Array Manager" link, click next until you see the options page and add the datastore names in. For NFS, you need to add them AFTER the mount so for example, if you mounted the datastore with:

/NetAppArray/NFSDatastore

where /NetAppArray is the FQDN of the share and /NFSDatastore is the mount point. You will add the datastore to the include list with just "NFSDatastore". Remember that this is CaSe SeNsItIvE so be careful typing it all in!

IPs vs FQDN

This one is ANNOYING! Let's say you have your Prod site and a DR site. On the Prod site, you have the following datastore mounted and replicated:

/ProdNetAppArray/Prod_NFSDatastore

where /ProdNetAppArray is the FQDN of the share and /Prod_NFSDatastore is the mount point. This datastore is replicated to the following datastore at the DR site:

/10.10.20.100/DR_NFSDatastore

where /10.10.20.100 is the IP address of the share and /DR_NFSDatastore is the mount point. I have seen this cause issues even if they are in the include list correctly. The fix here was to mount them either both as IP or both as FQDN. I have also seen where they are both mounted as IP and changing them both to FQDN fixed the issue and vise-versa.

vCenter is set via IP, not FQDN

This was the last fix that I have heard of but haven't seen personally. This issue occurred when during the install of SRM, vCenter was put in via the IP address instead of as the FQDN. Not sure why this would cause this issue but the fix was to do a modify install and when you are defining the vCenter Server, use the FQDN instead of the IP address. This MAY also work the opposite way aka, you set the VC as FQDN and changing it to IP fixes it. As I stated before, I haven't seen this one personally but another TSE here said this was the fix.

Well that wraps this one up! Hope this helps somebody and saves them the days of troubleshooting I did on it! Want the SRM findings of a TSE in the trenches? Follow me on Twitter! @SRM_Guru

**********************************************Disclaimer**********************************************
This blog is in no way sponsored, supported or endorsed by VMware. Any configuration or environmental changes are to be made at your own risk. Casey, VMware, and any other company and/or persons mentioned in this blog take no responsibility for anything.