SQL Azure, Dynamics NAV & Data I/O Issues

When I write a blog, and I should write more I know, it’s most of the time to share a tip or to write my opinion on something.

This blog is more of a question, or a call to share knowledge of a piece of technology that is not used by many.

About 4 months ago my largest customer moved to SQL Azure with their 200GB database after a lot of testing and careful considerations.

I’m not going to post about the business value of the Azure platform and the benefits of running on this. I’m going to share some issues we run into and I’m looking for others with experiences to solve this.

We have a very good performance experience running on P4. This gives us 500 DTU’s and we typically stay below 50% usage.

We could probably switch to P2 were it not that we have moments where DTU goes up to 80% and to avoid these peaks we need some time to optimise processes

Cost & ‘Auto’ Scaling

To save on Azure Cost (P4 is pretty expensive) we use a PowerShell script to scale down to P2 after business hours and to P1 during the night hours and in weekends.

This makes Azure SQL cost 800 euro/month = 9.600 euro/year and 48.000 euro/five years.

Yes, this is a lot of money, but so is setting up a SQL Cluster On-Premises with maintenance, backups and an occaisional hardware issue.

We expect to be able to scale down to P2 which will cut the cost down a bit more.

The Problem

So far, so good. We’ve been running for four months and on average everyone is happy except for a few moments. This is when shit hits the fan, the phone starts ringing and users start to be angry at us.

The system is slow! Fix it!

Funny enough when this happens we don’t see issues with slow queries on the database. DTU does not go up and even the processes on the Job Queue don’t take longer than average. Only the end-users complain.

We have four service tiers on two virtual machines. The users and the job queue are on different machines. The other two service tiers we need for interfacing with different credential types to support some older add-ons we are using that don’t support AAD.

When we ask a user during performance issues to switch to another service tier on a different machine the performance is ok. No problems.

Then we restart the service tier and everybody restarts NAV and works happily ever after.

The Telemetry

Unfortunately we don’t have a lot of telemetry when this happens. The only conistent number is what SQL Azure calls “Data I/O Percentage”.

Normally, like 99.5% of the time this number does not top 0.5%

However, when users call us with performance issues we see this going up to 25%.

Most often this happens when we scale down the database, but sometimes, not often, it happens in the middle of the day.

In some cases the memory of the Users’ service tier is quite high, like 90% or 95% but we’ve also seen it happen with only 75% or 80%.

Any Ideas?

We discussed this issue with our SQL Azure experts, with Microsoft in Redmond and Lyngby but nobody seems to have a real answer to what causes this to happen.

If anyone has an idea you would most definately make my day.

The situation is under control. We have procedures in place that allow the users to switch service tiers when this happens but it’s just anoying and bad for the “reputation” of NAV in this company.

We run NAV2018 build 33015 with 80 concurrent windows client users and a few hundred interfaces, portals and jobs that pound the system pretty hard.

10 thoughts on “SQL Azure, Dynamics NAV & Data I/O Issues

  1. Pingback: SQL Azure, Dynamics NAV & Data I/O Issues - 365 Community

  2. kriki

    It looks to me like a few queries read (past sense) a lot of data, kicking other, useful, data out of memory. So sql needs to read that data again, going from a few milliseconds (or less) responses to some 10’s of milliseconds.
    Result: slower responses and more reads.

    Liked by 1 person

    Reply
  3. Marek Mengel

    Lot of things will happen (in Azure), when Change Log is enabled and Change Log Entry table is used and it’ size is more than XGB. So my suggestion is to remove older than 3 months Change Log Entries to another database and use it as external database. Second thing is to enable elastic database job for optimizing table indexes.

    Liked by 1 person

    Reply
  4. Andrei Popescu

    Mark the best will be to enable SQL Insights for the database with the SQL Monitor solution on Log Analytics. My suggestion will be also to switch from DTU to vCore based business critical tiers.
    We’ve been running Azure SQL for our customers for years now and we have similar issues usually around big uncommitted transactions, e.g. posting journals with a few thousand of lines.

    Liked by 1 person

    Reply
  5. shankar25_12_82@yahoo.com

    Hi Mark,

    I was facing similar issue, but our intention was to use the Azure SQL to migrate the data from on-prem to BC cloud using the intelligent cloud services.

    Initially we set the DTU’s 50 for Azure SQL (database size is approx. 12GB) and thought it would work without any issues, apparently the process took almost 12 hours to complete the migration also noticed some of the tables data could not be migrated due to temp db maxed out.

    So we increased the DTU’s 100 and re-tried the process again and it took only 1 hour to complete the full data migration.

    I believe the DTU come with set temp db limit and if you go-over the limit for any reason then it just grind on it own and took ages to recover back. I believe if elastic pool option is selected apparently the DTU can move up and down depends on the work load. I never dig into details to understand why it is grinding on it own and how to recover it quickly.

    I’m not an expert in Azure SQL and I find it is a mystery black box. This might give you an idea where to look things went wrong & remove all the possibility of the root cause of the issue.

    Thanks

    Shankar

    Liked by 1 person

    Reply
  6. Mark Connolly

    Hi Mark, we get a lot of this, and yeh its a tough one!

    In recent weeks we’ve moved one or two clients off Premium tiers and onto Standard tiers to increase the number of DTUs but that will only help of your workload is more CPU than IO Intensive. As i’m sure you are aware the Premium Tiers offer much higher IOPS but less CPU for the $$. We identified a couple of workloads that benefited from moving from a P2 to an S6 and the feedback has been positive. Obviously you take a hit on some of the Business Continuity advantages but you get almost double the DTUs and £300pcm cheaper to run.

    We use a custom SQL script to return some key measures from the SQL server and monitor that through our own Monitoring Software – that gives us some good insights into what is consuming the DTU (CPU or IOPS) and gives us a more informed choice of where to put our resources.

    Having said all that it doesn’t look like its a SQL issue if moving service tiers helps. Are all service tiers in the same region? Use the Heartbeat time (ms) perfmon counter to see if you have any latency between service tier and SQL.

    Mark Connolly
    Technology Management
    mark.connolly@tecman.co.uk
    @Mav_81

    Liked by 1 person

    Reply
    1. Adriaan Van Bauwel

      Mark Connely,

      It is not just the amount of IOPS that changes when comparing the S-series to P-series in Azure DB service. Response time for said IO operations are guaranteed at 10ms read/write in the S-series, while the P-series are set to 2ms read/write response times.

      10ms read is more than good, but with a 15k SAS disk in Raid10 getting the write response times to 2ms is easy. Even the old HP Lefthand / Dell Equallogic technology on Raid-5 with 1GB network sync easily broke that 10ms mark and landed at 6 to 7ms.

      For Write-intense environments I would always advise to go for the P-series services, but for CPU-heavy or read-heavy environments an S-series system will often provide better metrics indeed.

      Adriaan Van Bauwel
      SQL Perform

      Liked by 1 person

      Reply
  7. Pingback: Introduction | Business Central Performance Tuning – Application Insights | Mark Brummel Blog | Business Central, Azure and more…

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.