Hashset equivalent in SQL Server - 【StackMirror】|sql|sql-server|hash|cluster-computing|hashset

I want to create a large table (about 45 billion rows) that is always accessed by a unique key.

Outside of the DB, the best structure to hold this is a Dictionary or a HashSet, but of course due to the size of data, it's not possible to do this outside of the database.

Does SQL Server provide a structure that's optimized for key-value access? I understand that a clustered key is very fast, but still it's an index and therefore there will be some additional disk reads associated with traversing index pages. What I would like to get from SQL Server is a "native" structure that stores data as key-value pairs and then makes it possible to access values based on keys.

In other words, my question is how to store in SQL Server 45 billion rows and efficiently access them WITHOUT having an index, clustered or non-clustered, because reading the index non-leaf pages may result in substantial IO, and since each value can be accessed by a unique key, it should be possible to have a structure where the hash of a key resolves into a physical location of the value. To get 1 value, we would need to do 1 read (unless there are hash collisions).

(an equivalent in Oracle is Hash Cluster)

Thanks for your help.

2012-04-04 18:02
by user1044169

No such thing in SQL server. Your only option is an index. If you're going to be requesting all columns for a given key, you should use a clustered index. If you're only going to be requesting a subset, you should use a non-clustered index including only the columns you want like this:

  create index IX_MyBigTable on MyBigTable(keyColumn) include (col1, col2, col3youneed);

This will be pretty efficient.

2012-04-04 18:06
by John Gibb

Traversing a b-tree probably isn't that much less efficient than generating a hash value, and the reason Clustered indexes are so important in SQL Server is that the data rows are stored at the leaf level. So the read that hits the b-tree leaf for your index key also reads the data row for that ke - Rick 2012-04-04 18:17

This answer is correct. The intermediate index levels will be small and fully cached. Basically, any get by PK into such a table will require at most one IO. In contrast to using an on-disk hash-table you will even benefit from key locality - usr 2012-04-04 20:34

Random suggestion - if you're really, truly, 100% only doing key-value lookups, and never ANY type of relational queries, maybe SQL isn't your answer? Check out Redis - it's incomprehensibly fast, transactional, consistent, persistent to disk, easy to set up - sounds like it may be a better fit. http://redis.i - John Gibb 2012-04-04 20:46

Thanks for all your feedback. I am going to do more testing with a clustered PK first - user1044169 2012-04-04 21:48

According to my benchmarks, the best approach is to create a hash column for the key. Details.

2013-01-29 11:09
by user2019014