fastvalidate-utf-8

纯头库,用于高速验证 utf-8 字符串(使用 SIMD 指令)。「header-only library to validate utf-8 strings at high speeds (using SIMD instructions)」

  • 所有者: lemire/fastvalidate-utf-8
  • 平台: Linux, Mac, Windows
  • 许可证: Apache License 2.0
  • 分类:
  • 主题:
  • 喜欢:
    0
      比较:

Github星跟踪图

fastvalidate-utf-8

Build Status
Code Quality: Cpp

Most strings online are in unicode using the UTF-8 encoding. Validating strings
quickly before accepting them is important.

This is a header-only C library to validate UTF-8 strings at high speeds using SIMD instructions.
Specifically, this expects an x64 processor (capable of SSE instruction). It will not
work currently on ARM processors.

A modified version of this code improved the performance of Scylla.

Quick usage:

make
./unit
./benchmark

Code usage:

  #include "simdutf8check.h"

  char * mystring = ...
  bool is_it_valid = validate_utf8_fast(mystring, thestringlength);

It should be able to validate strings using less than 1 cycle per input byte.

If you expect your strings to be plain ASCII, you can spend less than 0.1 cycles per input byte to check whether that is the case using the validate_ascii_fast function found in the simdasciicheck.h header. There are even faster functions like validate_utf8_fast_avx.

Command-line tool

Adam Retter maintains a useful command-line tool related to this library.

Experimental results

On a Skylake processor, using GCC, we get:

$ ./benchmark
string size = 65536
We are feeding ascii so it is always going to be ok.
It favors schemes that skip ASCII characters.
validate_utf8(data, N)                                          :  1.256 cycles per operation (best)     1.316 cycles per operation (avg)
validate_utf8_fast(data, N)                                     :  0.704 cycles per operation (best)     0.706 cycles per operation (avg)
validate_utf8_fast_avx(data, N)                                 :  0.450 cycles per operation (best)     0.452 cycles per operation (avg)
validate_utf8_fast_avx_asciipath(data, N)                       :  0.088 cycles per operation (best)     0.091 cycles per operation (avg)
validate_ascii_fast(data, N)                                    :  0.082 cycles per operation (best)     0.084 cycles per operation (avg)
validate_ascii_fast_avx(data, N)                                :  0.050 cycles per operation (best)     0.074 cycles per operation (avg)
validate_ascii_nosimd(data, N)                                  :  0.104 cycles per operation (best)     0.106 cycles per operation (avg)
validate_ascii_nointrin(data, N)                                :  0.068 cycles per operation (best)     0.088 cycles per operation (avg)
validate_utf8_fast(data, N)                                      :  0.701 cycles per operation (best)     0.703 cycles per operation (avg)  (linux counter)
validate_ascii_fast(data, N)                                     :  0.083 cycles per operation (best)     0.085 cycles per operation (avg)  (linux counter)


string size (approx) = 65536
Producing random-looking UTF-8
validate_utf8(data, actualN)                                    :  10.967 cycles per operation (best)     11.005 cycles per operation (avg)
validate_utf8_fast(data, actualN)                               :  0.702 cycles per operation (best)     0.705 cycles per operation (avg)
validate_utf8_fast_avx(data, actualN)                           :  0.448 cycles per operation (best)     0.485 cycles per operation (avg)
validate_utf8_fast_avx_asciipath(data, actualN)                 :  0.480 cycles per operation (best)     0.594 cycles per operation (avg)

Thus, after rounding, it takes 0.7 cycles per input byte to validate UTF-8 strings.

In Go

There is an assembly wrapper in Go by Stuart Carnie.

ARM Neon and SSE4

Fast UTF-8 validation with range algorithm (NEON+SSE4)

License

This library is distributed under the terms of any of the following
licenses, at your option:

主要指标

概览
名称与所有者lemire/fastvalidate-utf-8
主编程语言C
编程语言Makefile (语言数: 2)
平台Linux, Mac, Windows
许可证Apache License 2.0
所有者活动
创建于2018-05-15 23:51:20
推送于2024-03-13 14:40:25
最后一次提交2024-03-13 10:40:25
发布数0
用户参与
星数303
关注者数22
派生数26
提交数93
已启用问题?
问题数13
打开的问题数0
拉请求数14
打开的拉请求数0
关闭的拉请求数0
项目设置
已启用Wiki?
已存档?
是复刻?
已锁定?
是镜像?
是私有?