I told my friends about problems I encountered on Microsoft Azure. One of my friends, Riza, then asked me to share my experience of hosting web applications on Azure during the Singapore Azure Community meetup two weeks ago.
Problems with On-Premise Servers
Our web applications were hosted on-premise for about 9 years. Recently, we realized that our systems were running slower and slower. The clients kept receiving timeout exception. At the same time, we also ran out of storage space. We had to drive all the way to data centre which is about 15km away from our office just to connect one 1TB external hard disk to our server.
Hence, in one of our company meetings in June, we finally decided to migrate our web applications and databases to the cloud. None of the developers, besides me, knew about cloud hosting. Hence, we all agreed to use Microsoft Azure, the only cloud computing platform that I was familiar with.
Self Learning Microsoft Azure on MVA
When I first heard that the top management of our company had the intentions to migrate web applications to cloud last year, I already started to learn Azure on Microsoft Virtual Academy (MVA) at my own time and pace.
MVA is an online learning platform for public to get free IT training, including some useful introductory courses to Microsoft Azure, as listed below.
- Establish Microsoft Azure IaaS Technical Fundamentals
- Windows Azure for IT Pros Jump Start
- Microsoft Azure IaaS Deep Dive Jump Start
- SQL Server in Windows Azure Virtual Machines Jump Start
If you have noticed, the courses above are actually mostly related to IaaS. This is because IaaS was the most straightforward way for us who were going to migrate systems and databases from on-premise to the cloud. If we had chosen PaaS, we would need to redo our entire code base.
If you are more into reading books, you can also checkout some free eBooks about Microsoft Azure available on MVA. Personally, I didn’t read any of the book because I found watching MVA training videos was far more interesting.
I learnt after work and during weekends. I started learning Azure around March and the day we did the migration from on-premise to Azure was July. So I basically had a crash course of Azure in just four months.
Now I will say that the learning approach is not recommended. If you are going to learn Azure, it’s important to understand key concepts by reading books and talking to people who are more experience with Microsoft Azure and networking. Otherwise, you might encounter some problems that were hard to be fixed in later stage.
Migration at Midnight
Before doing a migration, we had to do some preparation work.
Firstly, we called our clients one by one. This is because we also hosted clients’ websites on our server. So, we need to inform them to update A record in their DNS. Later, we found out that, in fact, they should be using CNAME so that change of IP address on our side shouldn’t affect them.
Secondly, we prepared a file called app_offline.htm. This is a file to put in the root folder of our web applications hosted on our on-premise server. It would show a page telling our online users that our application was under maintenance no matter the user visited which web page.
Finally, we did backup for all our databases which were running on our on-premise servers. Due to the fact that our databases were big, it took about 20-30 minutes for us to just do a backup of one database. Of course, this could only be done right before we migrated to the cloud.
We chose to do the migration at midnight because we had many online transactions going on at daytime. In our company, only my senior and I were in charge of doing the migration. The following schedule listed the main activities during our midnight migration.
- 2am – 3am: Uploading app_offline.htm and backing up databases
- 3am – 4am: Restoring databases on Azure
- 4am – 5am: Uploading web applications to Azure and updating DNS in Route 53
Complaints Received on First Day after Migration
We need to finish the migration by 5am because that is when our clients start logging in to our web applications. So, everything was done in a rush and thus we received a number of calls from our clients after 6am on the day.
Some clients complaining that our system became very slow. It turns out that this has to do with us not putting our web application and databases in the same virtual network (v-net). Without putting them in the same v-net, every time our web application called the databases, they had to go through the Internet, instead of the internal connection. Thus the connection was slow and expensive (Azure charged us for outbound data transfer).
We also received calls complaining their websites were gone. That was actually caused by them not updating their DNS records fast enough.
Another interesting problem is part of our system was rejected by our client’s network because they only allowed traffics from certain IP address to access. So, we had to give them the new IP address of our Azure server before everything can work at their side again.
Downtime: The Impact and Microsoft Responses
The web applications have been running for about 8 months on Azure environment since July 2014. We encountered roughly 10 downtimes. Some are because we setup wrongly. Some are due to the Azure platform errors, as reported by Microsoft Azure team.
Our first downtime happened on 4 August 2014, from 12pm to 1:30pm. It’s expected to have high volume to our websites at noon. So, the downtime caused us to loss a huge amount of online sales. The cause of the downtime was later reported by Microsoft Azure team as all our deployments were in the affected cluster in Southeast Asia data centre.
Traffic Manager Came to Rescue
That was when we started to plan to host the backup of all our web applications in another Azure data centre. We then use traffic manager to do a failover load balancing. We planned to carry that out so that when our primary server went down, the backup server was still be there running fine.
In the reply Microsoft Azure team sent us, they also mentioned that uptime SLA of virtual machine requires 2 or more instances. Hence, they highly recommended to implement the Availability set configuration for our deployment. Before that, we always thought that it’s sufficient to have one instance running. However, the planned maintenance in Azure was, in fact, quite frequent and sometimes the maintenance took a long time to complete.
Database Mirroring: DB Will Always be Available
So, in addition to the traffic manager, we also applied database mirroring to our setup. We then had three database servers, instead of just one. One as principal, one as witness, and one as mirror. Regarding steps on how we set that up can be find in my another post.
With all these setup, we thought the downtime would not happen again. However, soon we realized that the database mirroring was not working.
When the principal was down, there was auto failover. However, none of our web application could connect to the mirror. Also, when the original principal was back online, it would still be a mirror until I did a manual failover. After a few experiments with Microsoft engineers, we concluded that it could be due to the fact that our web applications were not in the same virtual network as the database instances.
Availability Set: At Least One VM is Running
Up to this point, I haven’t talked about configuring two virtual machines in an availability set. That is to make sure that in the same data centre, when one of the virtual machines goes down, another will still be up and running. However, for our web applications, due to the fact that they were all using old version of .NET framework, Azure Redis Cache Service couldn’t even help.
Our web applications use session state a lot. Hence, without Redis, an external session state provider, we had no choice but to use SQL Server as the external session state provider. Otherwise, we would be limited to run web applications on only one instance.
Soon, we found out that we couldn’t even use SQL Server mode for session state because some of the values stored in our session are not serialisable. We had no other option but to rely on Traffic Manager at that moment.
In October 2014, few days after we encountered our third downtime, Microsoft Azure announced the new distribution mode in Azure Load Balancer, called Source IP Affinity. We were so happy when we heard that because that means sticky session would be possible on Azure. Soon, we configured the second instance successfully in the same availability set.
After all these have been done, there were still downtime or restarts for one of the virtual machine. However, thanks to load balancer and traffic manager, our websites were still up and running. Regarding the random restarts of virtual machines, Microsoft Azure team had investigated the issue and identified that some of them were due to platform bugs.
There are still more work needs to be done to achieve high availability for our web applications on Azure. If you are interested to find out more about high availability and disaster recovery on Azure, please read this article from Microsoft Azure.
Migrating Back to On-Premise?
When we were still using on-premise, we had only one web server and one database server. However, when we moved to Azure, we had to setup seven servers. So, it’s a challenge to explain to managers on the increase of the cost.
Sometimes, our developers would be also asked by manager if moving back to on-premise was a better option. I have no answer for that. However, if we migrated back to on-premise and there was a downtime happening, who would be in charge of fixing the problems rapidly?
Hence, what we can do now as developers, is to learn as much as we can on how to improve the performance as well as the stability of our web application on Azure. In addition, we will also need to seek help from Microsoft Azure team, if necessary, to introduce new cloud solution to our web applications.