? Portable UTF-8
Description
It is written in PHP (PHP 7+) and can work without "mbstring", "iconv" or any other extra encoding php-extension on your server.
The benefit of Portable UTF-8 is that it is easy to use, easy to bundle. This library will also
auto-detect your server environment and will use the installed php-extensions if they are available,
so you will have the best possible performance.
As a fallback we will use Symfony Polyfills, if needed. (https://github.com/symfony/polyfill)
The project based on ...
- Hamid Sarfraz's work - portable-utf8
- Nicolas Grekas's work - tchwork/utf8
- Behat's work - Behat/Transliterator
- Sebastián Grignoli's work - neitanod/forceutf8
- Ivan Enderlin's work - hoaproject/Ustring
- and many cherry-picks from "GitHub"-gists and "Stack Overflow"-snippets ...
Demo
Here you can test some basic functions from this library and you can compare some results with the native php function results.
Index
- Alternative
- Install
- Why Portable UTF-8?
- Requirements and Recommendations
- Warning
- Usage
- Class methods
- access
- add_bom_to_string
- binary_to_str
- bom
- chr
- chr_map
- chr_size_list
- chr_to_decimal
- chr_to_hex
- chunk_split
- clean
- cleanup
- codepoints
- count_chars
- encode
- file_get_contents
- file_has_bom
- filter
- filter_input
- filter_input_array
- filter_var
- filter_var_array
- fits_inside
- fix_simple_utf8
- fix_utf8
- getCharDirection
- hex_to_int
- html_encode
- html_entity_decode
- htmlentities
- htmlspecialchars
- int_to_hex
- is_ascii
- is_base64
- is_binary
- is_binary_file
- is_bom
- is_json
- is_html
- is_utf16
- is_utf32
- is_utf8
- json_decode
- json_encode
- lcfirst
- max
- max_chr_width
- min
- normalize_encoding
- normalize_msword
- normalize_whitespace
- ord
- parse_str
- range
- remove_bom
- remove_duplicates
- remove_invisible_characters
- replace_diamond_question_mark
- trim
- rtrim
- ltrim
- single_chr_html_encode
- split
- str_detect_encoding
- str_ireplace
- str_limit_after_word
- str_pad
- str_repeat
- str_shuffle
- str_sort
- str_split
- str_to_binary
- str_word_count
- strcmp
- strnatcmp
- strcasecmp
- strnatcasecmp
- strncasecmp
- strncmp
- string
- string_has_bom
- strip_tags
- strlen
- strwidth
- strpbrk
- strpos
- stripos
- strrpos
- strripos
- strrchr
- strrichr
- strrev
- strspn
- strstr
- stristr
- strtocasefold
- strtolower
- strtoupper
- strtr
- substr
- substr_compare
- substr_count
- substr_replace
- swapCase
- to_ascii
- to_utf8
- to_iso8859
- ucfirst
- ucwords
- urldecode
- utf8_decode
- utf8_encode
- words_limit
- wordwrap
- Unit Test
- License and Copyright
Alternative
If you like a more Object Oriented Way to edit strings, then you can take a look at voku/Stringy, it's a fork of "danielstjules/Stringy" but it used the "Portable UTF-8"-Class and some extra methods.
// Standard library
strtoupper('fòôbàř'); // 'FòôBàř'
strlen('fòôbàř'); // 10
// mbstring
// WARNING: if you don't use a polyfill like "Portable UTF-8", you need to install the php-extension "mbstring" on your server
mb_strtoupper('fòôbàř'); // 'FÒÔBÀŘ'
mb_strlen('fòôbàř'); // '6'
// Portable UTF-8
use voku\helper\UTF8;
UTF8::strtoupper('fòôbàř'); // 'FÒÔBÀŘ'
UTF8::strlen('fòôbàř'); // '6'
// voku/Stringy
use Stringy\Stringy as S;
$stringy = S::create('fòôbàř');
$stringy->toUpperCase(); // 'FÒÔBÀŘ'
$stringy->length(); // '6'
Install "Portable UTF-8" via "composer require"
composer require voku/portable-utf8
If your project do not need some of the Symfony polyfills please use the replace
section of your composer.json
.
This removes any overhead from these polyfills as they are no longer part of your project. e.g.:
{
"replace": {
"symfony/polyfill-php72": "1.99",
"symfony/polyfill-iconv": "1.99",
"symfony/polyfill-intl-grapheme": "1.99",
"symfony/polyfill-intl-normalizer": "1.99",
"symfony/polyfill-mbstring": "1.99"
}
}
Why Portable UTF-8?
PHP 5 and earlier versions have no native Unicode support. To bridge the gap, there exist several extensions like "mbstring", "iconv" and "intl".
The problem with "mbstring" and others is that most of the time you cannot ensure presence of a specific one on a server. If you rely on one of these, your application is no more portable. This problem gets even severe for open source applications that have to run on different servers with different configurations. Considering these, I decided to write a library:
Requirements and Recommendations
- No extensions are required to run this library. Portable UTF-8 only needs PCRE library that is available by default since PHP 4.2.0 and cannot be disabled since PHP 5.3.0. "\u" modifier support in PCRE for UTF-8 handling is not a must.
- PHP 5.3 is the minimum requirement, and all later versions are fine with Portable UTF-8.
- PHP 7.0 is the minimum requirement since version 4.0 of Portable UTF-8, otherwise composer will install an older version
- To speed up string handling, it is recommended that you have "mbstring" or "iconv" available on your server, as well as the latest version of PCRE library
- Although Portable UTF-8 is easy to use; moving from native API to Portable UTF-8 may not be straight-forward for everyone. It is highly recommended that you do not update your scripts to include Portable UTF-8 or replace or change anything before you first know the reason and consequences. Most of the time, some native function may be all what you need.
- There is also a shim for "mbstring", "iconv" and "intl", so you can use it also on shared webspace.
Info
Since version 5.4.26 this library will NOT force "UTF-8" by "bootstrap.php" anymore.
If you need to enable this behavior you can define "PORTABLE_UTF8__ENABLE_AUTO_FILTER", before requiring the autoloader.
define('PORTABLE_UTF8__ENABLE_AUTO_FILTER', 1);
Before version 5.4.26 this behavior was enabled by default and you could disable it via "PORTABLE_UTF8__DISABLE_AUTO_FILTER",
but the code had potential security vulnerabilities via injecting code while redirecting via header('Location ...
.
This is the reason I decided to add this BC in a bug fix release, so that everybody using the current version will receive the security-fix.
Usage
Example 1: UTF8::cleanup()
echo UTF8::cleanup('�Düsseldorf�');
// will output:
// Düsseldorf
Example 2: UTF8::strlen()
$string = 'string <strong>with utf-8 chars åèä</strong> - doo-bee doo-bee dooh';
echo strlen($string) . "\n<br />";
echo UTF8::strlen($string) . "\n<br />";
// will output:
// 70
// 67
$string_test1 = strip_tags($string);
$string_test2 = UTF8::strip_tags($string);
echo strlen($string_test1) . "\n<br />";
echo UTF8::strlen($string_test2) . "\n<br />";
// will output:
// 53
// 50
Example 3: UTF8::fix_utf8()
echo UTF8::fix_utf8('Düsseldorf');
echo UTF8::fix_utf8('ä');
// will output:
// Düsseldorf
// ä
Portable UTF-8, API
The API from the "UTF8"-Class is written as small static methods that will match the default PHP-API.
Class methods
access(string $str, int $pos)
Return the character at the specified position: $str[1] like functionality.
UTF8::access('fòô', 1); // 'ò'
add_bom_to_string(string $str)
Prepends UTF-8 BOM character to the string and returns the whole string.
If BOM already existed there, the Input string is returned.
UTF8::add_bom_to_string('fòô'); // "\xEF\xBB\xBF" . 'fòô'
binary_to_str(mixed $bin)
Convert binary into a string.
opposite: UTF8::str_to_binary()
UTF8::binary_to_str('11110000100111111001100010000011'); // '?'
bom()
Returns the UTF-8 Byte Order Mark Character.
UTF8::bom(); // "\xEF\xBB\xBF"
chr(int $code_point) : string
Generates a UTF-8 encoded character from the given code point.
opposite: UTF8::ord()
UTF8::chr(0x2603); // '☃'
chr_map(string, array $callback, string $str) : array
Applies callback to all characters of a string.
UTF8::chr_map(['voku\helper\UTF8', 'strtolower'], 'Κόσμε'); // ['κ','ό', 'σ', 'μ', 'ε']
chr_size_list(string $str) : array
Generates a UTF-8 encoded character from the given code point.
1 byte => U+0000 - U+007F
2 byte => U+0080 - U+07FF
3 byte => U+0800 - U+FFFF
4 byte => U+10000 - U+10FFFF
UTF8::chr_size_list('中文空白-test'); // [3, 3, 3, 3, 1, 1, 1, 1, 1]
chr_to_decimal(string $chr) : int
Get a decimal code representation of a specific character.
opposite: UTF8::decimal_to_chr()
alias: UTF8::chr_to_int()
UTF8::chr_to_decimal('§'); // 0xa7
chr_to_hex(string $chr, string $pfix = 'U+')
Get hexadecimal code point (U+xxxx) of a UTF-8 encoded character.
UTF8::chr_to_hex('§'); // U+00a7
chunk_split(string $body, int $chunklen = 76, string $end = "\r\n") : string
Splits a string into smaller chunks and multiple lines, using the specified line ending character.
UTF8::chunk_split('ABC-ÖÄÜ-中文空白-κόσμε', 3); // "ABC\r\n-ÖÄ\r\nÜ-中\r\n文空白\r\n-κό\r\nσμε"
clean(string $str, bool $remove_bom = false, bool $normalize_whitespace = false, bool $normalize_msword = false, bool $keep_non_breaking_space = false, bool $replace_diamond_question_mark = false, bool $remove_invisible_characters = true) : string
Accepts a string and removes all non-UTF-8 characters from it + extras if needed.
UTF8::clean("\xEF\xBB\xBF„Abcdef\xc2\xa0\x20…” — ? - Düsseldorf", true, true); // '„Abcdef …” — ? - Düsseldorf'
cleanup(string $str) : string
Clean-up a string and show only printable UTF-8 chars at the end + fix UTF-8 encoding.
UTF8::cleanup("\xEF\xBB\xBF„Abcdef\xc2\xa0\x20…” — ? - Düsseldorf", true, true); // '„Abcdef …” — ? - Düsseldorf'
codepoints(mixed $arg, bool $u_style = false) : array
Accepts a string and returns an array of Unicode code points.
opposite: UTF8::string()
UTF8::codepoints('κöñ'); // array(954, 246, 241)
// ... OR ...
UTF8::codepoints('κöñ', true); // array('U+03ba', 'U+00f6', 'U+00f1')
count_chars(string $str, bool $clean_utf8 = false) : array
Returns count of characters used in a string.
UTF8::count_chars('κaκbκc'); // array('κ' => 3, 'a' => 1, 'b' => 1, 'c' => 1)
decimal_to_chr(mixed $int) : string
Converts an int value into a UTF-8 character.
opposite: UTF8::chr_to_decimal()
alias: UTF8::int_to_chr()
UTF8::decimal_to_chr(931); // 'Σ'
emoji_decode(string $str, bool $use_reversible_string_mappings = false) : string
Decodes a string which was encoded by "UTF8::emoji_encode()".
UTF8::emoji_decode('foo CHARACTER_OGRE', false); // 'foo ?'
//
UTF8::emoji_encode('foo _-_PORTABLE_UTF8_-_308095726_-_627590803_-_8FTU_ELBATROP_-_', true); // 'foo ?'
emoji_encode(string $str, bool $use_reversible_string_mappings = false) : string
Encode a string with emoji chars into a non-emoji string.
UTF8::emoji_encode('foo ?', false); // 'foo CHARACTER_OGRE'
//
UTF8::emoji_encode('foo ?', true); // 'foo _-_PORTABLE_UTF8_-_308095726_-_627590803_-_8FTU_ELBATROP_-_'
encode(string $to_encoding, string $str, bool $auto_detect_the_from_encoding = true, string $from_encoding = ''): string
Encode a string with a new charset-encoding.
INFO: This function will also try to fix broken / double encoding,
so you can call this function also on a UTF-8 string and you don't mess up the string.
UTF8::encode('ISO-8859-1', '-ABC-中文空白-'); // '-ABC-????-'
//
UTF8::encode('UTF-8', '-ABC-中文空白-'); // '-ABC-中文空白-'
//
UTF8::encode('HTML', '-ABC-中文空白-'); // '-ABC-中文空白-'
//
UTF8::encode('BASE64', '-ABC-中文空白-'); // 'LUFCQy3kuK3mlofnqbrnmb0t'
file_get_contents(string $filename, int, null $flags = null, resource, null $context = null, int, null $offset = null, int, null $maxlen = null, int $timeout = 10, bool $convert_to_utf8 = true) : string
Reads entire file into a string.
WARNING: Do not use UTF-8 Option ($convert_to_utf8) for binary files (e.g.: images) !!!
UTF8::file_get_contents('utf16le.txt'); // ...
file_has_bom(string $file_path) : bool
Checks if a file starts with BOM (Byte Order Mark) character.
UTF8::file_has_bom('utf8_with_bom.txt'); // true
filter(mixed $var, int $normalization_form = 4, string $leading_combining = '◌') : mixed
Normalizes to UTF-8 NFC, converting from WINDOWS-1252 when needed.
UTF8::filter(array("\xE9", 'à', 'a')); // array('é', 'à', 'a')
filter_input(int $type, string $var, int $filter = FILTER_DEFAULT, null, array $option = null) : string
"filter_input()"-wrapper with normalizes to UTF-8 NFC, converting from WINDOWS-1252 when needed.
// _GET['foo'] = 'bar';
UTF8::filter_input(INPUT_GET, 'foo', FILTER_SANITIZE_STRING)); // 'bar'
filter_input_array(int $type, mixed $definition = null, bool $add_empty = true) : mixed
"filter_input_array()"-wrapper with normalizes to UTF-8 NFC, converting from WINDOWS-1252 when needed.
// _GET['foo'] = 'bar';
UTF8::filter_input_array(INPUT_GET, array('foo' => 'FILTER_SANITIZE_STRING')); // array('bar')
filter_var(string $var, int $filter = FILTER_DEFAULT, array $option = null) : string
"filter_var()"-wrapper with normalizes to UTF-8 NFC, converting from WINDOWS-1252 when needed.
UTF8::filter_var('-ABC-中文空白-', FILTER_VALIDATE_URL); // false
filter_var_array(array $data, mixed $definition = null, bool $add_empty = true) : mixed
"filter_var_array()"-wrapper with normalizes to UTF-8 NFC, converting from WINDOWS-1252 when needed.
$filters = [
'name' => ['filter' => FILTER_CALLBACK, 'options' => ['voku\helper\UTF8', 'ucwords']],
'age' => ['filter' => FILTER_VALIDATE_INT, 'options' => ['min_range' => 1, 'max_range' => 120]],
'email' => FILTER_VALIDATE_EMAIL,
];
$data = [
'name' => 'κόσμε',
'age' => '18',
'email' => 'foo@bar.de'
];
UTF8::filter_var_array($data, $filters, true); // ['name' => 'Κόσμε', 'age' => 18, 'email' => 'foo@bar.de']
fits_inside(string $str, int $box_size) : bool
Check if the number of Unicode characters isn't greater than the specified integer.
UTF8::fits_inside('κόσμε', 6); // false
fix_simple_utf8(string $str) : string
Try to fix simple broken UTF-8 strings.
INFO: Take a look at "UTF8::fix_utf8()" if you need a more advanced fix for broken UTF-8 strings.
UTF8::fix_simple_utf8('Düsseldorf'); // 'Düsseldorf'
fix_utf8(string, string[] $str) : mixed
Fix a double (or multiple) encoded UTF8 string.
UTF8::fix_utf8('Fédération'); // 'Fédération'
getCharDirection(string $char) : string ('RTL' or 'LTR')
Get character of a specific character.
UTF8::getCharDirection('ا'); // 'RTL'
hex_to_chr(string $hexdec) : string, false
Converts a hexadecimal value into a UTF-8 character.
opposite: UTF8::chr_to_hex()
UTF8::hex_to_chr('U+00a7'); // '§'
hex_to_int(string $hexdec) : int, false
Converts hexadecimal U+xxxx code point representation to integer.
opposite: UTF8::int_to_hex()
UTF8::hex_to_int('U+00f1'); // 241
html_encode(string $str, bool $keep_ascii_chars = false, string $encoding = 'UTF-8') : string
Converts a UTF-8 string to a series of HTML numbered entities.
opposite: UTF8::html_decode()
UTF8::html_encode('中文空白'); // '中文空白'
html_entity_decode(string $str, int $flags = null, string $encoding = 'UTF-8') : string
UTF-8 version of html_entity_decode()
The reason we are not using html_entity_decode() by itself is because
while it is not technically correct to leave out the semicolon
at the end of an entity most browsers will still interpret the entity
correctly. html_entity_decode() does not convert entities without
semicolons, so we are left with our own little solution here. Bummer.
Convert all HTML entities to their applicable characters
opposite: UTF8::html_encode()
alias: UTF8::html_decode()
UTF8::html_entity_decode('中文空白'); // '中文空白'
htmlentities(string $str, int $flags = ENT_COMPAT, string $encoding = 'UTF-8', bool $double_encode = true) : string
Convert all applicable characters to HTML entities: UTF-8 version of htmlentities()
UTF8::htmlentities('<白-öäü>'); // '<白-öäü>'
htmlspecialchars(string $str, int $flags = ENT_COMPAT, string $encoding = 'UTF-8', bool $double_encode = true) : string
Convert only special characters to HTML entities: UTF-8 version of htmlspecialchars()
INFO: Take a look at "UTF8::htmlentities()"
UTF8::htmlspecialchars('<白-öäü>'); // '<白-öäü>'
int_to_hex(int $int, string $pfix = 'U+') : str
Converts Integer to hexadecimal U+xxxx code point representation.
opposite: UTF8::hex_to_int()
UTF8::int_to_hex(241); // 'U+00f1'
is_ascii(string $str) : bool
Checks if a string is 7 bit ASCII.
alias: UTF8::isAscii()
UTF8::is_ascii('白'); // false
is_base64(string $str) : bool
Returns true if the string is base64 encoded, false otherwise.
alias: UTF8::isBase64()
UTF8::is_base64('4KSu4KWL4KSo4KS/4KSa'); // true
is_binary(mixed $input, bool $strict = false) : bool
Check if the input is binary... (is look like a hack).
alias: UTF8::isBinary()
UTF8::is_binary(01); // true
is_binary_file(string $file) : bool
Check if the file is binary.
UTF8::is_binary('./utf32.txt'); // true
is_bom(string $str) : bool
Checks if the given string is equal to any "Byte Order Mark".
WARNING: Use "UTF8::string_has_bom()" if you will check BOM in a string.
alias: UTF8::isBom()
UTF8::is_bom("\xef\xbb\xbf"); // true
is_json(string $str) : bool
Try to check if "$str" is a JSON-string.
alias: UTF8::isJson()
UTF8::is_json('{"array":[1,"¥","ä"]}'); // true
is_html(string $str) : bool
Check if the string contains any HTML tags .
alias: UTF8::isHtml()
UTF8::is_html('<b>lall</b>'); // true
is_utf16(string $str) : int, false
Check if the string is UTF-16: This function will return false if it's not UTF-16, 1 for UTF-16LE, 2 for UTF-16BE.
alias: UTF8::isUtf16()
UTF8::is_utf16(file_get_contents('utf-16-le.txt')); // 1
UTF8::is_utf16(file_get_contents('utf-16-be.txt')); // 2
UTF8::is_utf16(file_get_contents('utf-8.txt')); // false
is_utf32(string $str) : int, false
Check if the string is UTF-32: This function will return false if it's not UTF-32, 1 for UTF-32LE, 2 for UTF-32BE.
alias: UTF8::isUtf16()
UTF8::is_utf32(file_get_contents('utf-32-le.txt')); // 1
UTF8::is_utf32(file_get_contents('utf-32-be.txt')); // 2
UTF8::is_utf32(file_get_contents('utf-8.txt')); // false
is_utf8(string $str, bool $strict = false) : bool
Checks whether the passed string contains only byte sequences that are valid UTF-8 characters.
alias: UTF8::isUtf8()
UTF8::is_utf8('Iñtërnâtiônàlizætiøn'); // true
UTF8::is_utf8("Iñtërnâtiônàlizætiøn\xA0\xA1"); // false
json_decode(string $json, bool $assoc = false, int $depth = 512, int $options = 0) : mixed
Decodes a JSON string.
UTF8::json_decode('[1,"\u00a5","\u00e4"]'); // array(1, '¥', 'ä')
json_encode(mixed $value, int $options = 0, int $depth = 512) : string
Returns the JSON representation of a value.
UTF8::json_enocde(array(1, '¥', 'ä')); // '[1,"\u00a5","\u00e4"]'
lcfirst(string $str) : string
Makes string's first char lowercase.
UTF8::lcfirst('ÑTËRNÂTIÔNÀLIZÆTIØN'); // ñTËRNÂTIÔNÀLIZÆTIØN
max(mixed $arg) : string
Returns the UTF-8 character with the maximum code point in the given data.
UTF8::max('abc-äöü-中文空白'); // 'ø'
max_chr_width(string $str) : int
Calculates and returns the maximum number of bytes taken by any
UTF-8 encoded character in the given string.
UTF8::max_chr_width('Intërnâtiônàlizætiøn'); // 2
min(mixed $arg) : string
Returns the UTF-8 character with the minimum code point in the given data.
UTF8::min('abc-äöü-中文空白'); // '-'
normalize_encoding(string $encoding) : string
Normalize the encoding-"name" input.
UTF8::normalize_encoding('UTF8'); // 'UTF-8'
normalize_msword(string $str) : string
Normalize some MS Word special characters.
UTF8::normalize_msword('„Abcdef…”'); // '"Abcdef..."'
normalize_whitespace(string $str, bool $keep_non_breaking_space = false, bool $keep_bidi_unicode_controls = false) : string
Normalize the whitespace.
UTF8::normalize_whitespace("abc-\xc2\xa0-öäü-\xe2\x80\xaf-\xE2\x80\xAC", true); // "abc-\xc2\xa0-öäü- -"
ord(string $chr) : int
Calculates Unicode code point of the given UTF-8 encoded character.
opposite: UTF8::chr()
UTF8::ord('☃'); // 0x2603
parse_str(string $str, &$result, bool $clean_utf8 = false) : bool
Parses the string into an array (into the the second parameter).
WARNING: Unlike "parse_str()", this method does not (re-)place variables in the current scope,
if the second parameter is not set!
UTF8::parse_str('Iñtërnâtiônéàlizætiøn=測試&arr[]=foo+測試&arr[]=ການທົດສອບ', $array);
echo $array['Iñtërnâtiônéàlizætiøn']; // '測試'
range(mixed $var1, mixed $var2) : array
Create an array containing a range of UTF-8 characters.
UTF8::range('κ', 'ζ'); // array('κ', 'ι', 'θ', 'η', 'ζ',)
remove_bom(string $str) : string
Remove the BOM from UTF-8 / UTF-16 / UTF-32 strings.
UTF8::remove_bom("\xEF\xBB\xBFΜπορώ να"); // 'Μπορώ να'
remove_duplicates(string $str, string, array $what = ' ') : string
Removes duplicate occurrences of a string in another string.
UTF8::remove_duplicates('öäü-κόσμεκόσμε-äöü', 'κόσμε'); // 'öäü-κόσμε-äöü'
remove_invisible_characters(string $str, bool $url_encoded = true, string $replacement = '') : string
Remove invisible characters from a string.
UTF8::remove_invisible_characters("κόσ\0με"); // 'κόσμε'
replace_diamond_question_mark(string $str, string $replacement_char = '', bool $process_invalid_utf8 = true) : string
Replace the diamond question mark (�) and invalid-UTF8 chars with the replacement.
UTF8::replace_diamond_question_mark('中文空白�', ''); // '中文空白'
trim(string $str = '', string $chars = INF) : string
Strip whitespace or other characters from the beginning and end of a UTF-8 string.
UTF8::rtrim(' -ABC-中文空白- '); // '-ABC-中文空白-'
rtrim(string $str = '', string $chars = INF) : string
Strip whitespace or other characters from the end of a UTF-8 string.
UTF8::rtrim('-ABC-中文空白- '); // '-ABC-中文空白-'
ltrim(string $str, string $chars = INF) : string
Strip whitespace or other characters from the beginning of a UTF-8 string.
UTF8::ltrim(' 中文空白 '); // '中文空白 '
single_chr_html_encode(string $char, bool $keep_ascii_chars = false) : string
Converts a UTF-8 character to HTML Numbered Entity like "{".
UTF8::single_chr_html_encode('κ'); // 'κ'
split(string $str, int $length = 1, bool $clean_utf8 = false) : array
Convert a string to an array of Unicode characters.
UTF8::split('中文空白'); // array('中', '文', '空', '白')
str_detect_encoding(string $str) : string
Optimized "\mb_detect_encoding()"-function -> with support for UTF-16 and UTF-32.
UTF8::str_detect_encoding('中文空白'); // 'UTF-8'
UTF8::str_detect_encoding('Abc'); // 'ASCII'
str_ends_with(string $haystack, string $needle) : bool
Check if the string ends with the given substring.
UTF8::str_ends_with('BeginMiddleΚόσμε', 'Κόσμε'); // true
UTF8::str_ends_with('BeginMiddleΚόσμε', 'κόσμε'); // false
str_iends_with(string $haystack, string $needle) : bool
Check if the string ends with the given substring, case-insensitive.
UTF8::str_iends_with('BeginMiddleΚόσμε', 'Κόσμε'); // true
UTF8::str_iends_with('BeginMiddleΚόσμε', 'κόσμε'); // true
str_ireplace(mixed $search, mixed $replace, mixed $subject, int &$count = null) : mixed
Case-insensitive and UTF-8 safe version of str_replace.
UTF8::str_ireplace('lIzÆ', 'lise', array('Iñtërnâtiônàlizætiøn')); // array('Iñtërnâtiônàlisetiøn')
str_limit_after_word(string $str, int $length = 100, stirng $str_add_on = '...') : string
Limit the number of characters in a string, but also after the next word.
UTF8::str_limit_after_word('fòô bàř fòô', 8, ''); // 'fòô bàř'
str_pad(string $str, int $pad_length, string $pad_string = ' ', int $pad_type = STR_PAD_RIGHT) : string
Pad a UTF-8 string to a given length with another string.
UTF8::str_pad('中文空白', 10, '_', STR_PAD_BOTH); // '___中文空白___'
str_repeat(string $str, int $multiplier) : string
Repeat a string.
UTF8::str_repeat("°~\xf0\x90\x28\xbc", 2); // '°~ð(¼°~ð(¼'
str_shuffle(string $str) : string
Shuffles all the characters in the string.
UTF8::str_shuffle('fòô bàř fòô'); // 'àòôřb ffòô '
str_sort(string $str, bool $unique = false, bool $desc = false) : string
Sort all characters according to code points.
UTF8::str_sort(' -ABC-中文空白- '); // ' ---ABC中文白空'
str_split(string $str, int $len = 1) : array
Split a string into an array.
UTF8::split('déjà', 2); // array('dé', 'jà')
str_starts_with(string $haystack, string $needle) : bool
Check if the string starts with the given substring.
UTF8::str_starts_with('ΚόσμεMiddleEnd', 'Κόσμε'); // true
UTF8::str_starts_with('ΚόσμεMiddleEnd', 'κόσμε'); // false
str_istarts_with(string $haystack, string $needle) : bool
Check if the string starts with the given substring, case-insensitive.
UTF8::str_istarts_with('ΚόσμεMiddleEnd', 'Κόσμε'); // true
UTF8::str_iistarts_with('ΚόσμεMiddleEnd', 'κόσμε'); // true
str_to_binary(string $str) : string
Get a binary representation of a specific string.
opposite: UTF8::binary_to_str()
UTF8::str_to_binary('?'); // '11110000100111111001100010000011'
str_word_count(string $str, int $format = 0, string $charlist = '') : string
Get the number of words in a specific string.
// format: 0 -> return only word count (int)
//
UTF8::str_word_count('中文空白 öäü abc#c'); // 4
UTF8::str_word_count('中文空白 öäü abc#c', 0, '#'); // 3
// format: 1 -> return words (array)
//
UTF8::str_word_count('中文空白 öäü abc#c', 1); // array('中文空白', 'öäü', 'abc', 'c')
UTF8::str_word_count('中文空白 öäü abc#c', 1, '#'); // array('中文空白', 'öäü', 'abc#c')
// format: 2 -> return words with offset (array)
//
UTF8::str_word_count('中文空白 öäü ab#c', 2); // array(0 => '中文空白', 5 => 'öäü', 9 => 'abc', 13 => 'c')
UTF8::str_word_count('中文空白 öäü ab#c', 2, '#'); // array(0 => '中文空白', 5 => 'öäü', 9 => 'abc#c')
strcmp(string $str1, string $str2) : int
Case-insensitive string comparison: < 0 if str1 is less than str2;
> 0 if str1 is greater than str2,
0 if they are equal.
UTF8::strcmp("iñtërnâtiôn\nàlizætiøn", "iñtërnâtiôn\nàlizætiøn"); // 0
strnatcmp(string $str1, string $str2) : int
Case sensitive string comparisons using a "natural order" algorithm: < 0 if str1 is less than str2;
> 0 if str1 is greater than str2,
0 if they are equal.
INFO: natural order version of UTF8::strcmp()
UTF8::strnatcmp('2Hello world 中文空白!', '10Hello WORLD 中文空白!'); // -1
UTF8::strcmp('2Hello world 中文空白!', '10Hello WORLD 中文空白!'); // 1
UTF8::strnatcmp('10Hello world 中文空白!', '2Hello WORLD 中文空白!'); // 1
UTF8::strcmp('10Hello world 中文空白!', '2Hello WORLD 中文空白!')); // -1
strcasecmp(string $str1, string $str2) : int
Case-insensitive string comparison: < 0 if str1 is less than str2;
> 0 if str1 is greater than str2,
0 if they are equal.
INFO: Case-insensitive version of UTF8::strcmp()
UTF8::strcasecmp("iñtërnâtiôn\nàlizætiøn", "Iñtërnâtiôn\nàlizætiøn"); // 0
strnatcasecmp(string $str1, string $str2) : int
Case-insensitive string comparisons using a "natural order" algorithm: < 0 if str1 is less than str2;
> 0 if str1 is greater than str2,
0 if they are equal.
INFO: natural order version of UTF8::strcasecmp()
UTF8::strnatcasecmp('2', '10Hello WORLD 中文空白!'); // -1
UTF8::strcasecmp('2Hello world 中文空白!', '10Hello WORLD 中文空白!'); // 1
UTF8::strnatcasecmp('10Hello world 中文空白!', '2Hello WORLD 中文空白!'); // 1
UTF8::strcasecmp('10Hello world 中文空白!', '2Hello WORLD 中文空白!'); // -1
strncasecmp(string $str1, string $str2, int $len) : int
Case-insensitive string comparison of the first n characters.:
< 0 if str1 is less than str2;
> 0 if str1 is greater than str2,
0 if they are equal.
INFO: Case-insensitive version of UTF8::strncmp()
UTF8::strcasecmp("iñtërnâtiôn\nàlizætiøn321", "iñtërnâtiôn\nàlizætiøn123", 5); // 0
strncmp(string $str1, string $str2, int $len) : int
Case-sensitive string comparison of the first n characters.:
< 0 if str1 is less than str2;
> 0 if str1 is greater than str2,
0 if they are equal.
UTF8::strncmp("Iñtërnâtiôn\nàlizætiøn321", "Iñtërnâtiôn\nàlizætiøn123", 5); // 0
string(string $str1, string $str2) : int
Create a UTF-8 string from code points.
opposite: UTF8::codepoints()
UTF8::string(array(246, 228, 252)); // 'öäü'
string_has_bom(string $str) : bool
Checks if string starts with "BOM" (Byte Order Mark Character) character.
alias: UTF8::hasBom()
UTF8::string_has_bom("\xef\xbb\xbf foobar"); // true
strip_tags(string $str, sting, null $allowable_tags = null, bool $clean_utf8 = false) : string
Strip HTML and PHP tags from a string + clean invalid UTF-8.
UTF8::strip_tags("<span>κόσμε\xa0\xa1</span>"); // 'κόσμε'
strip_whitespace(string $str)
Strip all whitespace characters. This includes tabs and newline characters,
as well as multibyte whitespace such as the thin space and ideographic space.
UTF8::strip_whitespace(' Ο συγγραφέας '); // 'Οσυγγραφέας'
strlen(string $str, string $encoding = 'UTF-8', bool $clean_utf8 = false) : int
Get the string length, not the byte-length!
UTF8::strlen("Iñtërnâtiôn\xE9àlizætiøn")); // 20
strwidth(string $str, string $encoding = 'UTF-8', bool $clean_utf8 = false) : int
Return the width of a string.
UTF8::strwidth("Iñtërnâtiôn\xE9àlizætiøn")); // 21
strpbrk(string $haystack, string $char_list) : string
Search a string for any of a set of characters.
UTF8::strpbrk('-中文空白-', '白'); // '白-'
strpos(string $haystack, string $needle, int $offset = 0, string $encoding = 'UTF-8', bool $clean_utf8 = false) : int, false
Find the position of the first occurrence of a substring in a string.
UTF8::strpos('ABC-ÖÄÜ-中文空白-中文空白', '中'); // 8
stripos(string $str, string $needle, int $offset = null, string $encoding = 'UTF-8', bool $clean_utf8 = false) : int, false
Find the position of the first occurrence of a substring in a string, case-insensitive.
UTF8::strpos('ABC-ÖÄÜ-中文空白-中文空白', '中'); // 8
strrpos(string $haystack, string $needle, int $offset = 0, string $encoding = 'UTF-8', bool $clean_utf8 = false) : string, false
Find the position of the last occurrence of a substring in a string.
UTF8::strrpos('ABC-ÖÄÜ-中文空白-中文空白', '中'); // 13
strripos(string $haystack, string $needle, int $offset = 0, string $encoding = 'UTF-8', bool $clean_utf8 = false) : string, false
Find the position of the last occurrence of a substring in a string, case-insensitive.
UTF8::strripos('ABC-ÖÄÜ-中文空白-中文空白', '中'); // 13
strrchr(string $haystack, string $needle, bool $part = false, string $encoding = 'UTF-8', bool $clean_utf8 = false) : string, false
Find the last occurrence of a character in a string within another.
UTF8::strrchr('κόσμεκόσμε-äöü', 'κόσμε'); // 'κόσμε-äöü'
strrichr(string $haystack, string $needle, bool $part = false, string $encoding = 'UTF-8', bool $clean_utf8 = false) : string, false
Find the last occurrence of a character in a string within another, case-insensitive.
UTF8::strrichr('Aκόσμεκόσμε-äöü', 'aκόσμε'); // 'Aκόσμεκόσμε-äöü'
strrev(string $str) : string
Reverses characters order in the string.
UTF8::strrev('κ-öäü'); // 'üäö-κ'
strspn(string $str, string $mask, int $offset = 0, int $length = 2147483647) : string
Finds the length of the initial segment of a string consisting entirely of characters contained within a given mask.
UTF8::strspn('iñtërnâtiônàlizætiøn', 'itñ'); // '3'
strstr(string $str, string $needle, bool $before_needle = false, string $encoding = 'UTF-8', bool $clean_utf8 = false) : string
Returns part of haystack string from the first occurrence of needle to the end of haystack.
alias: UTF8::strchr()
$str = 'iñtërnâtiônàlizætiøn';
$search = 'nât';
UTF8::strstr($str, $search)); // 'nâtiônàlizætiøn'
UTF8::strstr($str, $search, true)); // 'iñtër'
stristr(string $str, string $needle, bool $before_needle = false, string $encoding = 'UTF-8', bool $clean_utf8 = false) : string
Returns all of haystack starting from and including the first occurrence of needle to the end.
alias: UTF8::strichr()
$str = 'iñtërnâtiônàlizætiøn';
$search = 'NÂT';
UTF8::stristr($str, $search)); // 'nâtiônàlizætiøn'
UTF8::stristr($str, $search, true)); // 'iñtër'
strtocasefold(string $str, bool $full = true) : string
Unicode transformation for case-less matching.
UTF8::strtocasefold('ǰ◌̱'); // 'ǰ◌̱'
strtolower(string $str, string $encoding = 'UTF-8', bool $clean_utf8 = false) : string
Make a string lowercase.
UTF8::strtolower('DÉJÀ Σσς Iıİi'); // 'déjà σσς iıii'
strtoupper(string $str, string $encoding = 'UTF-8', bool $clean_utf8 = false) : string
Make a string uppercase.
UTF8::strtoupper('Déjà Σσς Iıİi'); // 'DÉJÀ ΣΣΣ IIİI'
strtr(string $str, string, array $from, string, array $to = INF) : string
Translate characters or replace sub-strings.
$arr = array(
'Hello' => '○●◎',
'中文空白' => 'earth',
);
UTF8::strtr('Hello 中文空白', $arr); // '○●◎ earth'
str_to_words(string $str, string $charlist = '') : array
Convert a string (phrase, sentence, ...) into an array of words.
UTF8::str_to_words('中文空白 oöäü#s', '#') // array('', '中文空白', ' ', 'oöäü#s', '')
substr(string $str, int $start = 0, int $length = null, string $encoding = 'UTF-8', bool $clean_utf8 = false) : string
Get part of a string.
UTF8::substr('中文空白', 1, 2); // '文空'
substr_compare(string $main_str, string $str, int $offset, int $length = 2147483647, bool $case_insensitivity = false) : int
Binary-safe comparison of two strings from an offset, up to a length of characters.
UTF8::substr_compare("○●◎\r", '●◎', 0, 2); // -1
UTF8::substr_compare("○●◎\r", '◎●', 1, 2); // 1
UTF8::substr_compare("○●◎\r", '●◎', 1, 2); // 0
substr_count(string $haystack, string $needle, int $offset = 0, int $length = null, string $encoding = 'UTF-8', bool $clean_utf8 = false) : int, false
Count the number of substring occurrences.
UTF8::substr_count('中文空白', '文空', 1, 2); // 1
substr_left(string $haystack, string $needle) : string
Removes a prefix ($needle) from the beginning of the string ($haystack).
UTF8::substr_left('ΚόσμεMiddleEnd', 'Κόσμε'); // 'MiddleEnd'
UTF8::substr_left('ΚόσμεMiddleEnd', 'κόσμε'); // 'ΚόσμεMiddleEnd'
substr_ileft(string $haystack, string $needle) : string
Removes a prefix ($needle) from the beginning of the string ($haystack), case-insensitive.
UTF8::substr_ileft('ΚόσμεMiddleEnd', 'Κόσμε'); // 'MiddleEnd'
UTF8::substr_ileft('ΚόσμεMiddleEnd', 'κόσμε'); // 'MiddleEnd'
substr_right(string $haystack, string $needle) : string
Removes a suffix ($needle) from the end of the string ($haystack).
UTF8::substr_right('BeginMiddleΚόσμε', 'Κόσμε'); // 'BeginMiddle'
UTF8::substr_right('BeginMiddleΚόσμε', 'κόσμε'); // 'BeginMiddleΚόσμε'
substr_iright(string $haystack, string $needle) : string
Removes a suffix ($needle) from the end of the string ($haystack), case-insensitive.
UTF8::substr_iright('BeginMiddleΚόσμε', 'Κόσμε'); // 'BeginMiddle'
UTF8::substr_iright('BeginMiddleΚόσμε', 'κόσμε'); // 'BeginMiddle'
substr_replace(string, string[] $str, string, string[] $replacement, int, int[] $start, int, int[] $length = null) : string, array
Replace text within a portion of a string.
UTF8::substr_replace(array('Iñtërnâtiônàlizætiøn', 'foo'), 'æ', 1); // array('Iæñtërnâtiônàlizætiøn', 'fæoo')
swapCase(string $str, string string $encoding = 'UTF-8', bool $clean_utf8 = false) : string
Returns a case swapped version of the string.
UTF8::swapCase('déJÀ σσς iıII'); // 'DÉjà ΣΣΣ IIii'
to_ascii(string $str, string $unknown = '?', bool $strict) : string
Convert a string into ASCII.
alias: UTF8::toAscii()
alias: UTF8::str_transliterate()
UTF8::to_ascii('déjà σσς iıii'); // 'deja sss iiii'
to_utf8(string, string[] $str, bool $decode_html_entity_to_utf8 = false) : string, string[]
This function leaves UTF-8 characters alone, while converting almost all non-UTF8 to UTF8.
- It decode UTF-8 codepoints and Unicode escape sequences.
- It assumes that the encoding of the original string is either WINDOWS-1252 or ISO-8859-1.
- WARNING: It does not remove invalid UTF-8 characters, so you maybe need to use "UTF8::clean()" for this case.
alias: UTF8::toUtf8()
UTF8::to_utf8("\u0063\u0061\u0074"); // 'cat'
to_iso8859(string, string[] $str) : string, string[]
Convert a string into "ISO-8859"-encoding (Latin-1).
alias: UTF8::toIso8859()
alias: UTF8::to_latin1()
alias: UTF8::toLatin1()
UTF8::to_utf8(UTF8::to_latin1(' -ABC-中文空白- ')); // ' -ABC-????- '
ucfirst(string $str, string $encoding = 'UTF-8', bool $clean_utf8 = false) : string
Makes string's first char uppercase.
alias: UTF8::ucword()
UTF8::ucfirst('ñtërnâtiônàlizætiøn'); // 'Ñtërnâtiônàlizætiøn'
ucwords(string $str, array $exceptions = array(), string $charlist = '', string $encoding = 'UTF-8', bool $clean_utf8 = false) : string
Uppercase for all words in the string.
UTF8::ucwords('iñt ërn âTi ônà liz æti øn'); // 'Iñt Ërn ÂTi Ônà Liz Æti Øn'
rawurldecode(string $str) : string
Multi decode HTML entity + fix urlencoded-win1252-chars.
UTF8::urldecode('tes%20öäü%20\u00edtest+test'); // 'tes öäü ítest+test'
urldecode(string $str) : string
Multi decode HTML entity + fix urlencoded-win1252-chars.
UTF8::urldecode('tes%20öäü%20\u00edtest+test'); // 'tes öäü ítest test'
utf8_decode(string $str) : string
Decodes a UTF-8 string to ISO-8859-1.
UTF8::encode('UTF-8', UTF8::utf8_decode('-ABC-中文空白-')); // '-ABC-????-'
utf8_encode(string $str) : string
Encodes an ISO-8859-1 string to UTF-8.
UTF8::utf8_decode(UTF8::utf8_encode('-ABC-中文空白-')); // '-ABC-中文空白-'
words_limit(string $str, int $words = 100, string $str_add_on = '...') : string
Limit the number of words in a string.
UTF8::words_limit('fòô bàř fòô', 2, ''); // 'fòô bàř'
wordwrap(string $str, int $width = 75, string $break = "\n", bool $cut = false) : string
Wraps a string to a given number of characters
UTF8::wordwrap('Iñtërnâtiônàlizætiøn', 2, '<br>', true)); // 'Iñ<br>të<br>rn<br>ât<br>iô<br>nà<br>li<br>zæ<br>ti<br>øn'
Unit Test
- Composer is a prerequisite for running the tests.
composer install
- The tests can be executed by running this command from the root directory:
./vendor/bin/phpunit
Support
For support and donations please visit GitHub, Issues, PayPal, Patreon.
For status updates and release announcements please visit Releases, Twitter, Patreon.
For professional support please contact me.
Thanks
- Thanks to GitHub (Microsoft) for hosting the code and a good infrastructure including Issues-Management, etc.
- Thanks to IntelliJ as they make the best IDEs for PHP and they gave me an open source license for PhpStorm!
- Thanks to Travis CI for being the most awesome, easiest continuous integration tool out there!
- Thanks to StyleCI for the simple but powerful code style check.
- Thanks to PHPStan && Psalm for really great Static analysis tools and for discovering bugs in the code!
License and Copyright
"Portable UTF8" is free software; you can redistribute it and/or modify it under
the terms of the (at your option):
Unicode handling requires tedious work to be implemented and maintained on the
long run. As such, contributions such as unit tests, bug reports, comments or
patches licensed under both licenses are really welcomed.