Monday, March 19, 2012

Aggregation Queries with > 80.000.000 rows

Hi,
We have a logging database with a table "Logs" that hold 80.000.000 log rows
(about 1 year of application logs). We need to analyze this data in for
variable time periods, so we cannot simply split the table. The table
contains a column "LogID" (Identity INT and the only member of the clustered
PK) and a column "Time" (datetime there exists an index for this column).
The query
SELECT *
FROM Logs WITH (nolock)
WHERE LogID > 110385284
takes 2 seconds and returns about 3000 rows.
The query
SELECT MAX(Time), MIN(Time)
FROM Logs WITH (nolock)
WHERE LogID > 110385284
takes more than a minute (I've cancelled the query after a minute).
How can this be? The query analyzer shows me a really simple query plan for
the first query using the PK. For the second query, the query plan shows me
no usage of the PK, but usage of the index for the column "Time" (two time,
one for min and one for max).
Even more strange is this query:
SELECT MAX(Logs1.Time), MIN(Logs1.Time)
FROM Logs Logs1 WITH (nolock)
INNER JOIN Logs Logs2 WITH (nolock) ON Logs1.LogID = Logs2.LogID
WHERE Logs1.LogID > 110385284
the query finishes in less than a second, and is using the PK.
Also specifying the PK as an index hint solves the problem, but I don't like
to specify query hints, because this may prevent an updated query processor
to use a better query plan.
Another problem is that we also have less expirienced programmers that need
to deal with such tables, so at the moment I need to review each and every
"slow" query and insert query hints. Will there be a fix that "optimizes" the
query engine of sql server?
CU,
SvenMatzen
Yes, SQL Server needs to do some extra work when you specify an aggregate
functions.
Look, you may want to use INDEX hint to dictate the query optimizer to use
PK,but generally speaking it's not a good idea because the Optimizer is
smart enough to create much more efficient execution plan.
Also,you may create a clustered index on Time column and see how it does the
work.
"Matzen" <Matzen@.discussions.microsoft.com> wrote in message
news:ADAE2C5C-C301-4F72-966C-C6AB4EAB9FF7@.microsoft.com...
> Hi,
> We have a logging database with a table "Logs" that hold 80.000.000 log
rows
> (about 1 year of application logs). We need to analyze this data in for
> variable time periods, so we cannot simply split the table. The table
> contains a column "LogID" (Identity INT and the only member of the
clustered
> PK) and a column "Time" (datetime there exists an index for this column).
> The query
> SELECT *
> FROM Logs WITH (nolock)
> WHERE LogID > 110385284
> takes 2 seconds and returns about 3000 rows.
> The query
> SELECT MAX(Time), MIN(Time)
> FROM Logs WITH (nolock)
> WHERE LogID > 110385284
> takes more than a minute (I've cancelled the query after a minute).
> How can this be? The query analyzer shows me a really simple query plan
for
> the first query using the PK. For the second query, the query plan shows
me
> no usage of the PK, but usage of the index for the column "Time" (two
time,
> one for min and one for max).
> Even more strange is this query:
> SELECT MAX(Logs1.Time), MIN(Logs1.Time)
> FROM Logs Logs1 WITH (nolock)
> INNER JOIN Logs Logs2 WITH (nolock) ON Logs1.LogID = Logs2.LogID
> WHERE Logs1.LogID > 110385284
> the query finishes in less than a second, and is using the PK.
> Also specifying the PK as an index hint solves the problem, but I don't
like
> to specify query hints, because this may prevent an updated query
processor
> to use a better query plan.
> Another problem is that we also have less expirienced programmers that
need
> to deal with such tables, so at the moment I need to review each and every
> "slow" query and insert query hints. Will there be a fix that "optimizes"
the
> query engine of sql server?
> CU,
> Sven|||Hi Uri,
As you can see in my post, the query including the hint performs at least 60
times better than without it (without > 1 minute, with hint less than a
second).
With the clustered index ... well 80 million rows are currently clustered by
ID, resorting them to sort by another column may take a while ... may be next
year.
My "problem" is that I cannot belive that a query optimizer like the one of
sql server does not recognize that the "where" statement reduces the amount
of data to be processed from 80 million to 3000 (table statistics are up to
date), because this analysis is not really compex:
1) the "where" does contain an identity field, that is equal to the
clustered PK
2) the condition in the where stament eleminates > 90% of the data to be
processed
This must be a bug, so I assume this to be removed in the next service pack.
CU,
Sven
"Uri Dimant" wrote:
> Matzen
> Yes, SQL Server needs to do some extra work when you specify an aggregate
> functions.
> Look, you may want to use INDEX hint to dictate the query optimizer to use
> PK,but generally speaking it's not a good idea because the Optimizer is
> smart enough to create much more efficient execution plan.
> Also,you may create a clustered index on Time column and see how it does the
> work.
>
>
>
>
> "Matzen" <Matzen@.discussions.microsoft.com> wrote in message
> news:ADAE2C5C-C301-4F72-966C-C6AB4EAB9FF7@.microsoft.com...
> > Hi,
> >
> > We have a logging database with a table "Logs" that hold 80.000.000 log
> rows
> > (about 1 year of application logs). We need to analyze this data in for
> > variable time periods, so we cannot simply split the table. The table
> > contains a column "LogID" (Identity INT and the only member of the
> clustered
> > PK) and a column "Time" (datetime there exists an index for this column).
> >
> > The query
> > SELECT *
> > FROM Logs WITH (nolock)
> > WHERE LogID > 110385284
> > takes 2 seconds and returns about 3000 rows.
> >
> > The query
> > SELECT MAX(Time), MIN(Time)
> > FROM Logs WITH (nolock)
> > WHERE LogID > 110385284
> > takes more than a minute (I've cancelled the query after a minute).
> >
> > How can this be? The query analyzer shows me a really simple query plan
> for
> > the first query using the PK. For the second query, the query plan shows
> me
> > no usage of the PK, but usage of the index for the column "Time" (two
> time,
> > one for min and one for max).
> > Even more strange is this query:
> > SELECT MAX(Logs1.Time), MIN(Logs1.Time)
> > FROM Logs Logs1 WITH (nolock)
> > INNER JOIN Logs Logs2 WITH (nolock) ON Logs1.LogID = Logs2.LogID
> > WHERE Logs1.LogID > 110385284
> > the query finishes in less than a second, and is using the PK.
> >
> > Also specifying the PK as an index hint solves the problem, but I don't
> like
> > to specify query hints, because this may prevent an updated query
> processor
> > to use a better query plan.
> >
> > Another problem is that we also have less expirienced programmers that
> need
> > to deal with such tables, so at the moment I need to review each and every
> > "slow" query and insert query hints. Will there be a fix that "optimizes"
> the
> > query engine of sql server?
> >
> > CU,
> > Sven
>
>|||Matzen wrote:
> Hi Uri,
> As you can see in my post, the query including the hint performs at
> least 60 times better than without it (without > 1 minute, with hint
> less than a second).
> With the clustered index ... well 80 million rows are currently
> clustered by ID, resorting them to sort by another column may take a
> while ... may be next year.
> My "problem" is that I cannot belive that a query optimizer like the
> one of sql server does not recognize that the "where" statement
> reduces the amount of data to be processed from 80 million to 3000
> (table statistics are up to date), because this analysis is not
> really compex: 1) the "where" does contain an identity field, that is
> equal to the clustered PK
> 2) the condition in the where stament eleminates > 90% of the data to
> be processed
> This must be a bug, so I assume this to be removed in the next
> service pack.
> CU,
> Sven
>
The problem you are seeing is not really a bug, but a known issue with
the query optimizer. Many times SQL Server decides that based on the
number of rows likely to be returned from a query that using an index
(index seek + bookmark lookup for a non-clustered index or just a
clustered index seek for a clustered index) is actually more work than
scanning the table.
It's my understanding that SQL Server fails over to a table scan /
clustered index scan operation too soon in some cases. Your case just
may be one of those. I would keep the index hint and just keep an eye on
the query.
David Gugick
Imceda Software
www.imceda.com

No comments:

Post a Comment