When I write a blog, and I should write more I know, it’s most of the time to share a tip or to write my opinion on something.
This blog is more of a question, or a call to share knowledge of a piece of technology that is not used by many.
About 4 months ago my largest customer moved to SQL Azure with their 200GB database after a lot of testing and careful considerations.
I’m not going to post about the business value of the Azure platform and the benefits of running on this. I’m going to share some issues we run into and I’m looking for others with experiences to solve this.
We have a very good performance experience running on P4. This gives us 500 DTU’s and we typically stay below 50% usage.
We could probably switch to P2 were it not that we have moments where DTU goes up to 80% and to avoid these peaks we need some time to optimise processes
Cost & ‘Auto’ Scaling
To save on Azure Cost (P4 is pretty expensive) we use a PowerShell script to scale down to P2 after business hours and to P1 during the night hours and in weekends.
This makes Azure SQL cost 800 euro/month = 9.600 euro/year and 48.000 euro/five years.
Yes, this is a lot of money, but so is setting up a SQL Cluster On-Premises with maintenance, backups and an occaisional hardware issue.
We expect to be able to scale down to P2 which will cut the cost down a bit more.
So far, so good. We’ve been running for four months and on average everyone is happy except for a few moments. This is when shit hits the fan, the phone starts ringing and users start to be angry at us.
The system is slow! Fix it!
Funny enough when this happens we don’t see issues with slow queries on the database. DTU does not go up and even the processes on the Job Queue don’t take longer than average. Only the end-users complain.
We have four service tiers on two virtual machines. The users and the job queue are on different machines. The other two service tiers we need for interfacing with different credential types to support some older add-ons we are using that don’t support AAD.
When we ask a user during performance issues to switch to another service tier on a different machine the performance is ok. No problems.
Then we restart the service tier and everybody restarts NAV and works happily ever after.
Unfortunately we don’t have a lot of telemetry when this happens. The only conistent number is what SQL Azure calls “Data I/O Percentage”.
Normally, like 99.5% of the time this number does not top 0.5%
However, when users call us with performance issues we see this going up to 25%.
Most often this happens when we scale down the database, but sometimes, not often, it happens in the middle of the day.
In some cases the memory of the Users’ service tier is quite high, like 90% or 95% but we’ve also seen it happen with only 75% or 80%.
We discussed this issue with our SQL Azure experts, with Microsoft in Redmond and Lyngby but nobody seems to have a real answer to what causes this to happen.
If anyone has an idea you would most definately make my day.
The situation is under control. We have procedures in place that allow the users to switch service tiers when this happens but it’s just anoying and bad for the “reputation” of NAV in this company.
We run NAV2018 build 33015 with 80 concurrent windows client users and a few hundred interfaces, portals and jobs that pound the system pretty hard.