Monday, February 1, 2010

Some interesting factors while reviewing performance issues on SQL Queries

There are many situations you will write queries just to get the output right without thinking about performance. But as a developer I have seen many people taking care of minimal requirement for performance like avoiding cursors, temp tables etc. Some situation your query might look alright but there will be some hidden things that affect the performance which can be solved only by detailed analysis.

We had one of the sp which was using a table that had almost 2 billion records. The sp was using this table for 6 times on left outer join. We started reviewing if there is any way to fine tune this to reduce the IO cost. One of the best ways I always found was analyzing the execution plans.

One of the DBA gave me an idea of using the same query in other way without using the table many times. I liked the idea but I felt that was good only when we used same table. It had a significant improvement when I had to take data from only one table but when linked with other tables, it had a serious problem.

I will try to explain that scenario with a sample table and show you what is the advantage and disadvantage of using the queries and how you can measure the performance.

Here are my tables
  • Application
  • ApplicationContent
  • ApplicationAttachment

Script:

CREATE TABLE [Application](
      [ApplicationID] [int] NOT NULL,
      [ApplicationStatus] [varchar](1) NULL,
      [ApplicationDate] [date] NULL,
      [ApplicationType] [varchar](10) NULL
)

CREATE TABLE [ApplicationAttachment](
      [ApplicationID] [int] NOT NULL,
      [AttachPath] [varchar](50) NOT NULL,
      [AttachType] [varchar](50) NOT NULL,
      [AttachDate] [date] NOT NULL
)

CREATE TABLE [dbo].[ApplicationContent](
      [ApplicationID] [int] NOT NULL,
      [FieldID] [int] NOT NULL,
      [FieldValue] [varchar](100) NOT NULL
)



I have put few data in these tables.

 



Now look at the table ApplicationContent which has all the data of application on each row. But in some situations you might need them as columns. The most common approach to write this as below


select App.ApplicationID, App.ApplicationStatus, App.ApplicationType,
FName.FieldValue as FirstName, LName.FieldValue as LastName, Gen.FieldValue Gender, DOB.FieldValue as DOB from [Application] App
LEFT OUTER JOIN ApplicationContent FName on FName.ApplicationID = App.ApplicationID and FName.FieldID=1
LEFT OUTER JOIN ApplicationContent LName on LName.ApplicationID = App.ApplicationID and LName.FieldID=2
LEFT OUTER JOIN ApplicationContent Gen on Gen.ApplicationID = App.ApplicationID and Gen.FieldID=3
LEFT OUTER JOIN ApplicationContent DOB on DOB.ApplicationID = App.ApplicationID and DOB.FieldID=4


This would give the output as below

 




Do you see any problem with this query? Let’s imagine the table ApplicationContent which has data for millions of applications. For each application we can have multiple fields and it may be in hundreds of fields for each application. So definitely the size of this table will be very huge. Even though we have the indexes in place, the above query still hits the same table 4 times.

We can rewrite the same query in below form

Select App.ApplicationID, App.ApplicationStatus, App.ApplicationType,
MAX(CASE AContent.FieldID WHEN 1 THEN AContent.FieldValue ELSE NULL END) AS FirstName,
MAX(CASE AContent.FieldID WHEN 2 THEN AContent.FieldValue ELSE NULL END) AS LastName,
MAX(CASE AContent.FieldID WHEN 3 THEN AContent.FieldValue ELSE NULL END) AS Gender,
MAX(CASE AContent.FieldID WHEN 4 THEN AContent.FieldValue ELSE NULL END) AS DOB
From [Application] App
LEFT OUTER JOIN ApplicationContent AContent on AContent.ApplicationID = App.ApplicationID
GROUP BY App.ApplicationID, App.ApplicationStatus, App.ApplicationType


This query hit ApplicationContent table only once so the performance is much better.
The result of second query




This is exactly the same result set as above query.
You will be sure that the second query is much better if you look at the execution plan.
Execution plan for first query:

 



Lets look at the cost of second query

 



Ok the first query resulted overall cost of 0.022 and second one with 0.0071 so the second query performance is almost more than 65% better. The result set of both the queries are same. The main aspects that you need to consider here is that the second query completely eliminated the multiple IO reads of the same table.

So can we blindly say this is the best way to write the queries? I will give you another example with the same query set which needs data from other tables and I will let you decide to answer the question

Ok. Now lets say I need to get the attachment along with this data and get the applications ordered by Application date. So you can update the query as below


Select App.ApplicationID, App.ApplicationStatus, App.ApplicationType,
MAX(CASE AContent.FieldID WHEN 1 THEN AContent.FieldValue ELSE NULL END) AS FirstName,
MAX(CASE AContent.FieldID WHEN 2 THEN AContent.FieldValue ELSE NULL END) AS LastName,
MAX(CASE AContent.FieldID WHEN 3 THEN AContent.FieldValue ELSE NULL END) AS Gender,
MAX(CASE AContent.FieldID WHEN 4 THEN AContent.FieldValue ELSE NULL END) AS DOB,
AAttach.AttachPath, AAttach.AttachType
From [Application] App
LEFT OUTER JOIN ApplicationContent AContent on AContent.ApplicationID = App.ApplicationID
LEFT OUTER JOIN ApplicationAttachment AAttach on AAttach.ApplicationID = App.ApplicationID
GROUP BY App.ApplicationID, App.ApplicationStatus, App.ApplicationType,App.ApplicationDate, AAttach.AttachPath, AAttach.AttachType
Order by App.ApplicationDate



So if you think your query is still perfect then you are wrong. Lets looks at the execution plan now



Here you can see significant increase in the cost and also you can find two “Sort” step added which is contributing 33% each

 



This step added because of the Group by clause.

Now lets consider our first query and add these columns


select App.ApplicationID, App.ApplicationStatus, App.ApplicationType,
FName.FieldValue as FirstName, LName.FieldValue as LastName, Gen.FieldValue Gender, DOB.FieldValue as DOB,
AAttach.AttachPath, AAttach.AttachType
from [Application] App
LEFT OUTER JOIN ApplicationContent FName on FName.ApplicationID = App.ApplicationID and FName.FieldID=1
LEFT OUTER JOIN ApplicationContent LName on LName.ApplicationID = App.ApplicationID and LName.FieldID=2
LEFT OUTER JOIN ApplicationContent Gen on Gen.ApplicationID = App.ApplicationID and Gen.FieldID=3
LEFT OUTER JOIN ApplicationContent DOB on DOB.ApplicationID = App.ApplicationID and DOB.FieldID=4
LEFT OUTER JOIN ApplicationAttachment AAttach on AAttach.ApplicationID = App.ApplicationID
Order by App.ApplicationDate

Here is the execution plan for this

 




You can see only one table scan is added and this has not changed much. The cost of the query is almost same as the previous query which did not have attachment table scan and the order by.

Now you try joining another table on the same queries and see. The query which we were feeling most efficient will go down drastically compared to the first one discussed. The second query will help us only when we have to get the different records from the same table and it will not help you while adding any additional tables. So now you can clearly understand when we are using Group by and use the different tables, it keeps adding extra burden on sql server. So I would suggest you to analyze the query thoroughly before concluding on performance. The query and the result may look perfect but we might miss some important factors to consider.

Overall I just wanted to say that the performance of the query depends on the situations and need thorough analysis. There are different ways to fine tune the queries.

No comments:

Post a Comment