Thursday, March 26, 2009

Change in Behavior of RAND and NEWID in SQL Server 2005

I’m writing this blog entry as a result of a customer query regarding a change
in behavior related to invocations of the RAND and NEWID functions in
SQL Server 2005.

Suppose you need to write an expression in a query invoking the RAND or
NEWID function (say for randomization purposes) and you need the
function to be invoked only once. For example, suppose you need to make
a random choice out of three options (call them ‘option one’, ‘option two’
and ‘option three’), and you write the following code:


select
case rnd
when 1 then 'option one'
when 2 then 'option two'
when 3 then 'option three'
else 'oops'
end
from (select cast(rand()*3 as int)%3 + 1 as rnd) as d;
Remember that the RAND function returns a float value in the range 0
through 1, inclusive (0 and 1 are possible result values). Casting the
expression rand()*3 as an integer will truncate the fraction part of the value.

Even though the probability to get exactly 3 back after casting is very low, it
is still a possibility. This is why I used %3 (modulo 3)—to ensure the
expression will return an integer in the range 0 through 2. By adding 1, the
expression is guaranteed to return an integer in the range 1 through 3.

Internally, SQL Server rearranges a simple form of a CASE expression
such as the above to the searched form, namely, it expands the WHEN
clauses to incorporate the full predicates. As an example, the above CASE
expression is internally evaluated as follows:

select
case
when rnd = 1 then 'option one'
when rnd = 2 then 'option two'
when rnd = 3 then 'option three'
else 'oops'
end
from (select cast(rand()*3 as int)%3 + 1 as rnd) as d;
SQL Server 2000 evaluates each reference to the alias rnd assigned in the
table expression d separately; therefore, it actually invokes the RAND
function three times. You realize that this means that in SQL Server 2000 it
is possible that none of the WHEN clauses of the CASE expression will
evaluate to TRUE, and you might end up getting ‘oops’ back. Try running this
code several times in SQL Server 2000 and you will be able to verify this.

SQL Server 2005 changes the behavior of outer references to aliases
assigned in table expressions, where the aliased expression invokes the
RAND or NEWID function. SQL Server 2005 will invoke the function only
once, therefore it is guaranteed that one of the WHEN clauses in the above
query will evaluate to TRUE, and you will never get ‘oops’ back.

You can test this with a similar example that invokes the NEWID function.
To return a random value in the range 1 through n, instead of using the
expression:

cast(rand()*n as int)%n + 1
You can use the expression:

abs(checksum(newid()))%n + 1
CHECKSUM(NEWID()) returns a random integer. Applying ABS on top
ensures you get a nonnegative integer. Applying %n (modulo n) ensures that
the value is >= 0 and < n. By adding 1 you ensure that the value is >= 1 and
<= n. In short, this is just another way to get a random integer value in the
range 1 through n. So the above query can be rewritten as follows:

select
case rnd
when 1 then 'option one'
when 2 then 'option two'
when 3 then 'option three'
else 'oops'
end
from (select abs(checksum(newid()))%3 + 1 as rnd) as d;
Try running this query several times in both SQL Server 2000 and in SQL
Server 2005. In SQL Server 2000 you will occasionally get ‘oops’ back,
while in SQL Server 2005 you will never get ‘oops’ back. This is due to the
same change in behavior I described earlier.

A simple way to test the difference in behavior between the versions is by
running the following code:

select rnd, rnd
from (select rand() as rnd) as d;
Run it in SQL Server 2000 and you will get two different invocations of
RAND, hence most probably two different values back. Run it in SQL
Server 2005, and you’re guaranteed to get the same value back twice since
RAND will be invoked only once.

This change is described in SQL Server 2005’s Books Online under the
section “Behavior Changes to Database Engine Features in SQL Server
2005,” but it’s very easy to overlook it.

Note that the change in behavior has nothing to do with multiple invocations
of RAND or NEWID in the same query, as opposed to being invoked once
in a table expression and then referenced multiple times in the outer query.
For example, the following code can return ‘oops’ in both SQL Server 2000
and in SQL Server 2005:

select
case abs(checksum(newid()))%3 + 1
when 1 then 'option one'
when 2 then 'option two'
when 3 then 'option three'
else 'oops'
end;
In SQL Server 2005 you can now use a table expression as demonstrated
earlier as a workaround. In SQL Server 2000 (and also in 2005), you can
use a variable as a workaround:

declare @rnd as int;
set @rnd = abs(checksum(newid()))%3 + 1;

select
case @rnd
when 1 then 'option one'
when 2 then 'option two'
when 3 then 'option three'
else 'oops'
end;
You can experience similar problems in less obvious scenarios; for example,
consider the predicate:
where abs(checksum(newid()))%3 + 1 between col1 and col2
Also here, the predicate is expanded internally and NEWID is invoked multiple times:

where col1 >= abs(checksum(newid()))%3 + 1
and col2 <= abs(checksum(newid()))%3 + 1
So you end up getting two independent invocations of the NEWID function.
If you want to rely on a single invocation, as demonstrated earlier in SQL
Server 2005 you can use a table expression, and in both versions you can
store the result of the expression in a variable and then refer to the variable.

As the last example of problematic use of RAND and NEWID, I recently
saw code written by a programmer that was supposed to populate a
temporary table with a set of unique random integers (say, 50 random
integers in the range 1 through 100). The code looked similar to the
following:

set nocount on;
create table #random_values(rnd int not null);
create index idx1 on #random_values(rnd);

declare @i as int;
set @i = 1;
while @i <= 50
begin
insert into #random_values
select abs(checksum(newid()))%100 + 1
where abs(checksum(newid()))%100 + 1
not in (select rnd from #random_values);
if @@rowcount = 1 set @i = @i + 1;
end

select rnd from #random_values order by rnd;

drop table #random_values;
You realize that the two invocations of the NEWID function (in the
SELECT and WHERE clauses) are independent of each other. Therefore,
this code may very well populate the temporary table with duplicate values
in all versions of SQL Server. For example, here’s a subset of the output I
got after running this code:

rnd
-----------
2
3
4
4
...
I’m not saying that this particular solution is the optimal way to get a set of
unique random values rather just explaining the logical problems with this
solution and the workarounds.

In SQL Server 2005 you can now use a table expression as a workaround,
revising the SELECT query to the following:

...
insert into #random_values
select rnd
from (select abs(checksum(newid()))%100 + 1 as rnd) as d
where rnd not in (select rnd from #random_values);
...
In both SQL Server 2000 and 2005 you can use a variable as a workaround:

...
set @rnd = abs(checksum(newid()))%100 + 1;
insert into #random_values
select @rnd
where @rnd not in (select rnd from #random_values);
...
To summarize, care should be taken when using functions such as RAND
and NEWID in queries. If you need to rely on a single invocation of the
function, SQL Server 2005 will give you the desired behavior as long as you
encapsulate the invocation of the function in a table expression and then in
the outer query refer to the alias of the expression as many times as you like.
In SQL Server 2000 (or any other version), you can use a variable as a
workaround.

Thanks
Srini Reddy

Change in Behavior of RAND and NEWID in SQL Server 2005

I’m writing this blog entry as a result of a customer query regarding a change
in behavior related to invocations of the RAND and NEWID functions in
SQL Server 2005.

Suppose you need to write an expression in a query invoking the RAND or
NEWID function (say for randomization purposes) and you need the
function to be invoked only once. For example, suppose you need to make
a random choice out of three options (call them ‘option one’, ‘option two’
and ‘option three’), and you write the following code:


select
case rnd
when 1 then 'option one'
when 2 then 'option two'
when 3 then 'option three'
else 'oops'
end
from (select cast(rand()*3 as int)%3 + 1 as rnd) as d;
Remember that the RAND function returns a float value in the range 0
through 1, inclusive (0 and 1 are possible result values). Casting the
expression rand()*3 as an integer will truncate the fraction part of the value.

Even though the probability to get exactly 3 back after casting is very low, it
is still a possibility. This is why I used %3 (modulo 3)—to ensure the
expression will return an integer in the range 0 through 2. By adding 1, the
expression is guaranteed to return an integer in the range 1 through 3.

Internally, SQL Server rearranges a simple form of a CASE expression
such as the above to the searched form, namely, it expands the WHEN
clauses to incorporate the full predicates. As an example, the above CASE
expression is internally evaluated as follows:

select
case
when rnd = 1 then 'option one'
when rnd = 2 then 'option two'
when rnd = 3 then 'option three'
else 'oops'
end
from (select cast(rand()*3 as int)%3 + 1 as rnd) as d;
SQL Server 2000 evaluates each reference to the alias rnd assigned in the
table expression d separately; therefore, it actually invokes the RAND
function three times. You realize that this means that in SQL Server 2000 it
is possible that none of the WHEN clauses of the CASE expression will
evaluate to TRUE, and you might end up getting ‘oops’ back. Try running this
code several times in SQL Server 2000 and you will be able to verify this.

SQL Server 2005 changes the behavior of outer references to aliases
assigned in table expressions, where the aliased expression invokes the
RAND or NEWID function. SQL Server 2005 will invoke the function only
once, therefore it is guaranteed that one of the WHEN clauses in the above
query will evaluate to TRUE, and you will never get ‘oops’ back.

You can test this with a similar example that invokes the NEWID function.
To return a random value in the range 1 through n, instead of using the
expression:

cast(rand()*n as int)%n + 1
You can use the expression:

abs(checksum(newid()))%n + 1
CHECKSUM(NEWID()) returns a random integer. Applying ABS on top
ensures you get a nonnegative integer. Applying %n (modulo n) ensures that
the value is >= 0 and < n. By adding 1 you ensure that the value is >= 1 and
<= n. In short, this is just another way to get a random integer value in the
range 1 through n. So the above query can be rewritten as follows:

select
case rnd
when 1 then 'option one'
when 2 then 'option two'
when 3 then 'option three'
else 'oops'
end
from (select abs(checksum(newid()))%3 + 1 as rnd) as d;
Try running this query several times in both SQL Server 2000 and in SQL
Server 2005. In SQL Server 2000 you will occasionally get ‘oops’ back,
while in SQL Server 2005 you will never get ‘oops’ back. This is due to the
same change in behavior I described earlier.

A simple way to test the difference in behavior between the versions is by
running the following code:

select rnd, rnd
from (select rand() as rnd) as d;
Run it in SQL Server 2000 and you will get two different invocations of
RAND, hence most probably two different values back. Run it in SQL
Server 2005, and you’re guaranteed to get the same value back twice since
RAND will be invoked only once.

This change is described in SQL Server 2005’s Books Online under the
section “Behavior Changes to Database Engine Features in SQL Server
2005,” but it’s very easy to overlook it.

Note that the change in behavior has nothing to do with multiple invocations
of RAND or NEWID in the same query, as opposed to being invoked once
in a table expression and then referenced multiple times in the outer query.
For example, the following code can return ‘oops’ in both SQL Server 2000
and in SQL Server 2005:

select
case abs(checksum(newid()))%3 + 1
when 1 then 'option one'
when 2 then 'option two'
when 3 then 'option three'
else 'oops'
end;
In SQL Server 2005 you can now use a table expression as demonstrated
earlier as a workaround. In SQL Server 2000 (and also in 2005), you can
use a variable as a workaround:

declare @rnd as int;
set @rnd = abs(checksum(newid()))%3 + 1;

select
case @rnd
when 1 then 'option one'
when 2 then 'option two'
when 3 then 'option three'
else 'oops'
end;
You can experience similar problems in less obvious scenarios; for example,
consider the predicate:
where abs(checksum(newid()))%3 + 1 between col1 and col2
Also here, the predicate is expanded internally and NEWID is invoked multiple times:

where col1 >= abs(checksum(newid()))%3 + 1
and col2 <= abs(checksum(newid()))%3 + 1
So you end up getting two independent invocations of the NEWID function.
If you want to rely on a single invocation, as demonstrated earlier in SQL
Server 2005 you can use a table expression, and in both versions you can
store the result of the expression in a variable and then refer to the variable.

As the last example of problematic use of RAND and NEWID, I recently
saw code written by a programmer that was supposed to populate a
temporary table with a set of unique random integers (say, 50 random
integers in the range 1 through 100). The code looked similar to the
following:

set nocount on;
create table #random_values(rnd int not null);
create index idx1 on #random_values(rnd);

declare @i as int;
set @i = 1;
while @i <= 50
begin
insert into #random_values
select abs(checksum(newid()))%100 + 1
where abs(checksum(newid()))%100 + 1
not in (select rnd from #random_values);
if @@rowcount = 1 set @i = @i + 1;
end

select rnd from #random_values order by rnd;

drop table #random_values;
You realize that the two invocations of the NEWID function (in the
SELECT and WHERE clauses) are independent of each other. Therefore,
this code may very well populate the temporary table with duplicate values
in all versions of SQL Server. For example, here’s a subset of the output I
got after running this code:

rnd
-----------
2
3
4
4
...
I’m not saying that this particular solution is the optimal way to get a set of
unique random values rather just explaining the logical problems with this
solution and the workarounds.

In SQL Server 2005 you can now use a table expression as a workaround,
revising the SELECT query to the following:

...
insert into #random_values
select rnd
from (select abs(checksum(newid()))%100 + 1 as rnd) as d
where rnd not in (select rnd from #random_values);
...
In both SQL Server 2000 and 2005 you can use a variable as a workaround:

...
set @rnd = abs(checksum(newid()))%100 + 1;
insert into #random_values
select @rnd
where @rnd not in (select rnd from #random_values);
...
To summarize, care should be taken when using functions such as RAND
and NEWID in queries. If you need to rely on a single invocation of the
function, SQL Server 2005 will give you the desired behavior as long as you
encapsulate the invocation of the function in a table expression and then in
the outer query refer to the alias of the expression as many times as you like.
In SQL Server 2000 (or any other version), you can use a variable as a
workaround.

Thanks
Srini Reddy