ORDER BY RAND() seems to be less than random

Go To StackoverFlow.com

5

I have a fairly simple SQL (MySQL):

SELECT foo FROM bar ORDER BY rank, RAND()

I notice that when I refresh the results, the randomness is suspiciously weak.

In the sample data at the moment there are six results with equal rank (integer zero). There are lots of tests for randomness but here is a simple one to do by hand: when run twice, the first result should be the same in both runs about one sixth of the time. This is certainly not happening, the leading result is the same at least a third of the time.

I want a uniform distribution over the permutations. I'm not an expert statistician but I'm pretty sure ORDER BY RAND() should achieve this. What am I missing?

With MySQL, SELECT rand(), rand() shows two different numbers, so I don't buy the "once per query" explanation

2012-04-04 18:10
by spraff
Which platform? Also: please supply a code snippet, demonstrating this behaviour - wildplasser 2012-04-04 18:15


3

RAND() is only executed once per query. You can verify this by looking at the result set.

If you're trying to get a randomized order, you should be using either NEWID() or CHECKSUM(NEWID()).

WITH T AS ( -- example using RAND()
  SELECT 'Me' Name UNION SELECT 'You' UNION SELECT 'Another'
)
SELECT Name, RAND()
FROM T;

WITH T AS ( -- example using just NEWID()
  SELECT 'Me' Name UNION SELECT 'You' UNION SELECT 'Another'
)
SELECT Name, NEWID()
FROM T;

WITH T AS ( -- example getting the CHECKSUM() of NEWID()
  SELECT 'Me' Name UNION SELECT 'You' UNION SELECT 'Another'
)
SELECT Name, CHECKSUM(NEWID())
FROM T;
2012-04-04 18:14
by Yuck
You're assuming SQL Server? I think they must be on MySQL or they would be getting the same ordering every time - Martin Smith 2012-04-04 18:17
@MartinSmith I added the link, not Yuck, so it's my assumption - JNK 2012-04-04 18:18
@MartinSmith To be fair, it's my assumption as well. I am, after all, referring to built-in functions unique to MSSQL - Yuck 2012-04-04 18:19
Do not use NEWID as a source of randomness. NEWID is for generating unique values. Unique is not the same as random! There's no guarantee that it's random (it doesn't have to be, it could just be sequentially allocated and still be unique). There's no guarantee that CHECKSUM(NEWID) is random either. No, no, no a thousand times no - jason 2012-04-04 18:21
@Jason That's why it's wrapped with CHECKSUM() which produces a pseudo-random integer. It may not be cryptographically strong, but it certainly results in a unique ordering each time the query is run - Yuck 2012-04-04 18:22
No. CHECKSUM is for balancing hash tables. There is no guarantee it's random or pseudo-random. No, no, no, a million times no - jason 2012-04-04 18:23
@Jason Yes yes yes. This is an accepted practice when working specifically with MS SQL Server. http://stackoverflow.com/questions/4979799/order-by-newid-how-does-it-work See Martin Smith's answer on the subject, and countless other articles advocating the same - Yuck 2012-04-04 18:27
@Yuck: I don't care. Accepted doesn't mean right. They are wrong. It's an incredibly common misunderstanding. Unique is not random. Period. Checksum is not random. Period - jason 2012-04-04 18:29
@Jason - So what would you use? (In 2008 CRYPT_GEN_RANDOM is available but assume previous versions - Martin Smith 2012-04-04 18:29
@Jason If you want guaranteed randomness you should use a custom function, but this solution is close enough for most scenario - JNK 2012-04-04 18:30


1

The RAND() can not be refresh for each row. A possible solution might be:

SELECT foo FROM bar ORDER BY rank, CHECKSUM(NEWID())
2012-04-04 18:15
by Arion
I'm pretty sure you can't do two ORDER BY clauses.. - JNK 2012-04-04 18:15
Oops fixed it:P.. thx for the point : - Arion 2012-04-04 18:16
No! NEWID for unique values, unique is not the same as random. CHECKSUM is for balancing hash tables, not for generating random values. There is no guarantee that any of this is random. No, no, no - jason 2012-04-04 18:24
Ads