SOSD is a collection of dataset to benchmark the lookup performance of learned indexes.
SOSD currently includes eight different datasets. Each dataset consists of 200 million 64-bit unsigned integers (keys) with very few duplicates (if at all):
amzn
represents book sale popularity data.
face
is an upsampled version of a Facebook user ID dataset.
logn
and norm
are lognormal (0, 2) and normal distributions, respectively.
osmc
is uniformly sampled OpenStreetMap locations represented as Google S2 CellIds.
uden
is dense integers.
uspr
is uniformly distributed sparse integers.
wiki
is Wikipedia article edit timestamps.
In addition, there are 32-bit versions of all datasets (except osmc
and wiki
) with similar CDFs. We use different parameters, (0, 1), for logn in the 32-bit case to reduce the number of duplicates.
Paper | Code | Results | Date | Stars |
---|