Validating UTF-8 In Less Than One Instruction Per Byte

6 Oct 2020 John Keiser Daniel Lemire

The majority of text is stored in UTF-8, which must be validated on ingestion. We present the lookup algorithm, which outperforms UTF-8 validation routines used in many libraries and languages by more than 10 times using commonly available SIMD instructions... (read more)

PDF Abstract