fastvalidate-utf-8

纯头库,用于高速验证 utf-8 字符串(使用 SIMD 指令)。「header-only library to validate utf-8 strings at high speeds (using SIMD instructions)」

  • 所有者: lemire/fastvalidate-utf-8
  • 平台: Linux, Mac, Windows
  • 許可證: Apache License 2.0
  • 分類:
  • 主題:
  • 喜歡:
    0
      比較:

Github星跟蹤圖

fastvalidate-utf-8

Build Status
Code Quality: Cpp

Most strings online are in unicode using the UTF-8 encoding. Validating strings
quickly before accepting them is important.

This is a header-only C library to validate UTF-8 strings at high speeds using SIMD instructions.
Specifically, this expects an x64 processor (capable of SSE instruction). It will not
work currently on ARM processors.

A modified version of this code improved the performance of Scylla.

Quick usage:

make
./unit
./benchmark

Code usage:

  #include "simdutf8check.h"

  char * mystring = ...
  bool is_it_valid = validate_utf8_fast(mystring, thestringlength);

It should be able to validate strings using less than 1 cycle per input byte.

If you expect your strings to be plain ASCII, you can spend less than 0.1 cycles per input byte to check whether that is the case using the validate_ascii_fast function found in the simdasciicheck.h header. There are even faster functions like validate_utf8_fast_avx.

Command-line tool

Adam Retter maintains a useful command-line tool related to this library.

Experimental results

On a Skylake processor, using GCC, we get:

$ ./benchmark
string size = 65536
We are feeding ascii so it is always going to be ok.
It favors schemes that skip ASCII characters.
validate_utf8(data, N)                                          :  1.256 cycles per operation (best)     1.316 cycles per operation (avg)
validate_utf8_fast(data, N)                                     :  0.704 cycles per operation (best)     0.706 cycles per operation (avg)
validate_utf8_fast_avx(data, N)                                 :  0.450 cycles per operation (best)     0.452 cycles per operation (avg)
validate_utf8_fast_avx_asciipath(data, N)                       :  0.088 cycles per operation (best)     0.091 cycles per operation (avg)
validate_ascii_fast(data, N)                                    :  0.082 cycles per operation (best)     0.084 cycles per operation (avg)
validate_ascii_fast_avx(data, N)                                :  0.050 cycles per operation (best)     0.074 cycles per operation (avg)
validate_ascii_nosimd(data, N)                                  :  0.104 cycles per operation (best)     0.106 cycles per operation (avg)
validate_ascii_nointrin(data, N)                                :  0.068 cycles per operation (best)     0.088 cycles per operation (avg)
validate_utf8_fast(data, N)                                      :  0.701 cycles per operation (best)     0.703 cycles per operation (avg)  (linux counter)
validate_ascii_fast(data, N)                                     :  0.083 cycles per operation (best)     0.085 cycles per operation (avg)  (linux counter)


string size (approx) = 65536
Producing random-looking UTF-8
validate_utf8(data, actualN)                                    :  10.967 cycles per operation (best)     11.005 cycles per operation (avg)
validate_utf8_fast(data, actualN)                               :  0.702 cycles per operation (best)     0.705 cycles per operation (avg)
validate_utf8_fast_avx(data, actualN)                           :  0.448 cycles per operation (best)     0.485 cycles per operation (avg)
validate_utf8_fast_avx_asciipath(data, actualN)                 :  0.480 cycles per operation (best)     0.594 cycles per operation (avg)

Thus, after rounding, it takes 0.7 cycles per input byte to validate UTF-8 strings.

In Go

There is an assembly wrapper in Go by Stuart Carnie.

ARM Neon and SSE4

Fast UTF-8 validation with range algorithm (NEON+SSE4)

License

This library is distributed under the terms of any of the following
licenses, at your option:

主要指標

概覽
名稱與所有者lemire/fastvalidate-utf-8
主編程語言C
編程語言Makefile (語言數: 2)
平台Linux, Mac, Windows
許可證Apache License 2.0
所有者活动
創建於2018-05-15 23:51:20
推送於2024-03-13 14:40:25
最后一次提交2024-03-13 10:40:25
發布數0
用户参与
星數303
關注者數22
派生數26
提交數93
已啟用問題?
問題數13
打開的問題數0
拉請求數14
打開的拉請求數0
關閉的拉請求數0
项目设置
已啟用Wiki?
已存檔?
是復刻?
已鎖定?
是鏡像?
是私有?