Here is a way to have some flexibility in what should be discarded and what should be replaced. This is how I currently do it.
$string = 'À some string with junk Ĩ Ä ';
$replace = [
'<' => '', '>' => '', ''' => '', '&' => '',
'"' => '', 'À' => 'A', 'Á' => 'A', 'Â' => 'A', 'Ã' => 'A', 'Ä' => 'Ae',
'Ä' => 'A', 'Å' => 'A', 'Ā' => 'A', 'Ą' => 'A', 'Ă' => 'A', 'Æ' => 'Ae',
'Ç' => 'C', 'Ć' => 'C', 'Č' => 'C', 'Ĉ' => 'C', 'Ċ' => 'C', 'Ď' => 'D', 'Đ' => 'D',
'Ð' => 'D', 'È' => 'E', 'É' => 'E', 'Ê' => 'E', 'Ë' => 'E', 'Ē' => 'E',
'Ę' => 'E', 'Ě' => 'E', 'Ĕ' => 'E', 'Ė' => 'E', 'Ĝ' => 'G', 'Ğ' => 'G',
'Ġ' => 'G', 'Ģ' => 'G', 'Ĥ' => 'H', 'Ħ' => 'H', 'Ì' => 'I', 'Í' => 'I',
'Î' => 'I', 'Ï' => 'I', 'Ī' => 'I', 'Ĩ' => 'I', 'Ĭ' => 'I', 'Į' => 'I',
'İ' => 'I', 'IJ' => 'IJ', 'Ĵ' => 'J', 'Ķ' => 'K', 'Ł' => 'L', 'Ľ' => 'L',
'Ĺ' => 'L', 'Ļ' => 'L', 'Ŀ' => 'L', 'Ñ' => 'N', 'Ń' => 'N', 'Ň' => 'N',
'Ņ' => 'N', 'Ŋ' => 'N', 'Ò' => 'O', 'Ó' => 'O', 'Ô' => 'O', 'Õ' => 'O',
'Ö' => 'Oe', 'Ö' => 'Oe', 'Ø' => 'O', 'Ō' => 'O', 'Ő' => 'O', 'Ŏ' => 'O',
'Œ' => 'OE', 'Ŕ' => 'R', 'Ř' => 'R', 'Ŗ' => 'R', 'Ś' => 'S', 'Š' => 'S',
'Ş' => 'S', 'Ŝ' => 'S', 'Ș' => 'S', 'Ť' => 'T', 'Ţ' => 'T', 'Ŧ' => 'T',
'Ț' => 'T', 'Ù' => 'U', 'Ú' => 'U', 'Û' => 'U', 'Ü' => 'Ue', 'Ū' => 'U',
'Ü' => 'Ue', 'Ů' => 'U', 'Ű' => 'U', 'Ŭ' => 'U', 'Ũ' => 'U', 'Ų' => 'U',
'Ŵ' => 'W', 'Ý' => 'Y', 'Ŷ' => 'Y', 'Ÿ' => 'Y', 'Ź' => 'Z', 'Ž' => 'Z',
'Ż' => 'Z', 'Þ' => 'T', 'à' => 'a', 'á' => 'a', 'â' => 'a', 'ã' => 'a',
'ä' => 'ae', 'ä' => 'ae', 'å' => 'a', 'ā' => 'a', 'ą' => 'a', 'ă' => 'a',
'æ' => 'ae', 'ç' => 'c', 'ć' => 'c', 'č' => 'c', 'ĉ' => 'c', 'ċ' => 'c',
'ď' => 'd', 'đ' => 'd', 'ð' => 'd', 'è' => 'e', 'é' => 'e', 'ê' => 'e',
'ë' => 'e', 'ē' => 'e', 'ę' => 'e', 'ě' => 'e', 'ĕ' => 'e', 'ė' => 'e',
'ƒ' => 'f', 'ĝ' => 'g', 'ğ' => 'g', 'ġ' => 'g', 'ģ' => 'g', 'ĥ' => 'h',
'ħ' => 'h', 'ì' => 'i', 'í' => 'i', 'î' => 'i', 'ï' => 'i', 'ī' => 'i',
'ĩ' => 'i', 'ĭ' => 'i', 'į' => 'i', 'ı' => 'i', 'ij' => 'ij', 'ĵ' => 'j',
'ķ' => 'k', 'ĸ' => 'k', 'ł' => 'l', 'ľ' => 'l', 'ĺ' => 'l', 'ļ' => 'l',
'ŀ' => 'l', 'ñ' => 'n', 'ń' => 'n', 'ň' => 'n', 'ņ' => 'n', 'ʼn' => 'n',
'ŋ' => 'n', 'ò' => 'o', 'ó' => 'o', 'ô' => 'o', 'õ' => 'o', 'ö' => 'oe',
'ö' => 'oe', 'ø' => 'o', 'ō' => 'o', 'ő' => 'o', 'ŏ' => 'o', 'œ' => 'oe',
'ŕ' => 'r', 'ř' => 'r', 'ŗ' => 'r', 'š' => 's', 'ù' => 'u', 'ú' => 'u',
'û' => 'u', 'ü' => 'ue', 'ū' => 'u', 'ü' => 'ue', 'ů' => 'u', 'ű' => 'u',
'ŭ' => 'u', 'ũ' => 'u', 'ų' => 'u', 'ŵ' => 'w', 'ý' => 'y', 'ÿ' => 'y',
'ŷ' => 'y', 'ž' => 'z', 'ż' => 'z', 'ź' => 'z', 'þ' => 't', 'ß' => 'ss',
'ſ' => 'ss', 'ый' => 'iy', 'А' => 'A', 'Б' => 'B', 'В' => 'V', 'Г' => 'G',
'Д' => 'D', 'Е' => 'E', 'Ё' => 'YO', 'Ж' => 'ZH', 'З' => 'Z', 'И' => 'I',
'Й' => 'Y', 'К' => 'K', 'Л' => 'L', 'М' => 'M', 'Н' => 'N', 'О' => 'O',
'П' => 'P', 'Р' => 'R', 'С' => 'S', 'Т' => 'T', 'У' => 'U', 'Ф' => 'F',
'Х' => 'H', 'Ц' => 'C', 'Ч' => 'CH', 'Ш' => 'SH', 'Щ' => 'SCH', 'Ъ' => '',
'Ы' => 'Y', 'Ь' => '', 'Э' => 'E', 'Ю' => 'YU', 'Я' => 'YA', 'а' => 'a',
'б' => 'b', 'в' => 'v', 'г' => 'g', 'д' => 'd', 'е' => 'e', 'ё' => 'yo',
'ж' => 'zh', 'з' => 'z', 'и' => 'i', 'й' => 'y', 'к' => 'k', 'л' => 'l',
'м' => 'm', 'н' => 'n', 'о' => 'o', 'п' => 'p', 'р' => 'r', 'с' => 's',
'т' => 't', 'у' => 'u', 'ф' => 'f', 'х' => 'h', 'ц' => 'c', 'ч' => 'ch',
'ш' => 'sh', 'щ' => 'sch', 'ъ' => '', 'ы' => 'y', 'ь' => '', 'э' => 'e',
'ю' => 'yu', 'я' => 'ya'
];
echo str_replace[array_keys[$replace], $replace, $string];
[PHP 4 >= 4.2.0, PHP 5, PHP 7, PHP 8]
mb_ereg_replace — Replace regular expression with multibyte support
Description
mb_ereg_replace[
string $pattern
,
string $replacement
,
string
$string
,
?string $options
= null
]: string|false|null
Scans string
for matches to pattern
, then replaces the matched text with replacement
Parameters
pattern
The regular expression pattern.
Multibyte characters may be used in pattern
.
replacement
The replacement text.
string
The string being checked.
options
The search option. See mb_regex_set_options[] for explanation. Return Values
The
resultant string on success, or false
on error. If string
is not valid for the current encoding, null
is returned.
Changelog
8.0.0 | options is nullable now.
|
7.1.0 | The function checks whether string is valid for the current encoding.
|
7.1.0 | The e modifier has been deprecated.
|
Notes
Note:
The internal encoding or the character encoding specified by mb_regex_encoding[] will be used as the character encoding for this function.
Warning
Never use the e
modifier when working on untrusted
input. No automatic escaping will happen [as known from preg_replace[]]. Not taking care of this will most likely create remote code execution vulnerabilities in your application.
See Also
- mb_regex_encoding[] - Set/Get character encoding for multibyte regex
- mb_eregi_replace[] - Replace regular expression with multibyte support ignoring case
Pluche ¶
11 years ago
Unlike preg_replace, mb_ereg_replace doesn't use separators
Exemple with preg_replace :
Exemple with mb_ereg_replace :
daemoneye at gmail dot com ¶
13 years ago
I got a pretty nasty error while trying to parse table rows[all contents were set to UTF-8] from the database for a dictionary project. The idea was to get all the rows from the first table [that is a table with bulgarian phrase in the first field, and its translation in english, french and german in the next fields]. I needed to index all the bulgarian words that are found in the table to make an intelligent search. And that is where my headache started.
First of all, even with mb_strtolower[] a lot of cyrillic characters went corrupted [ex: 'т,ъ,у,ф,б,г,з,ж,' etc...]. After an hour of different attempts I got such a solution:
I am posting this solution just in case someone has encountered a similar problem. Hope it helps you in case you need something like that.
To work properly I got to set all the internal encoding settings to UTF-8. Else the default Latin-1 got half my database with missing characters.
Anonymous ¶
6 years ago
Pluche's comment should REALLY be added to the documentation, preferably under the "$pattern" param description. It is crucial to using this function.
trng ¶
11 years ago
You can use \\n for capture group in replacement.
And you can NOT use $n notation [unlike preg_replace function].
keizo at gomo dot jp ¶
14 years ago
you can use \\n for capture group in replacement
Alexey Khrulev ¶
5 years ago
If encoding of PHP script differs from encoding of string to be processed by mb_ereg_replace[], then you can't just write pattern in script. Both $pattern and $replacement must be converted to same encoding as string to be processed. In this example script is in UTF-8, file to be processed is in UTF-16LE encoding:
Anonymous ¶
15 years ago
'i' option does not work correctly with multibyte characters. The function does not locate/replace the multibyte string if it's different case then specified on multibyte needle which is in different case.
Anonymous ¶
3 months ago
Notations to reference captures in the replacement string:
// Result: « Hehe, ca marche ! »
To rewrite a phrase in URI [with createFromRules]:
// Result : « hehe-ca-marche »
vondrej[at]gmail[dot]com ¶
16 years ago
Are you looking for htmlentities[] for multibyte strings? This might help you - it just replace , ", '
gmx dot net at ulrich dot mierendorff ¶
14 years ago
If you want to replace characters like "ä" or "ø" you can use mb_ereg_replace, but it is very slow. str_replace is much faster and also works with characters like "ä" or "ø"!
I think this has something to with the fact that str_replace works on byte level and does not care about characters.
I hope that can help.
faxe at neostrada dot pl ¶
17 years ago
A simple mb_str_ireplace[] implementation - a faster [?] replacement for non-regexp multi-byte string replacement:
[thiago - EDITOR NOTE: This function has improvements from d-okumura [aat] fi{dot}kyd[dot]co.jp]
marco at thenetworksolution dot it ¶
8 years ago
To selectively uppercase parts of a string via mb_eregi_replace
$str = mb_eregi_replace['\b[[0-9]{1,4}[a-z]{1,2}]\b', "strtoupper
['\\1']", $str, 'e'];
Full example, how to fix an address manually typed, uppercasing the first letter of a words and keeping uppercase roman numerals and the letters A,B,C after the house number]:
function ucAddress[$str] {
// first lowercase all and use the default ucwords
$str = ucwords[strtolower[$str]];
// let's fix the default ucwords...
// uppercase letters after house number [was lowercased by the strtolower above]
$str = mb_eregi_replace['\b[[0-9]{1,4}[a-z]{1,2}]\b', "strtoupper
['\\1']", $str, 'e'];
// the same for roman numerals
$str = mb_eregi_replace['\bM{0,4}[CM|CD|D?C{0,3}][XC|XL|L?X{0,3}][IX|IV|V?I{0,3}]\b', "strtoupper['\\0']", $str, 'e'];
return $str;
}
squeegee ¶
15 years ago
well, if you just calculated the length of the find and replace strings once instead of on every loop, it would likely speed it up a lot.
mpnicholas [@t] gmail [dot] com ¶
16 years ago
Regarding the mb_str_ireplace[] function: I benchmarked it against mb_eregi_replace[] for single-character substitution, and it was significantly slower. Despite avoiding the ereg call, I think the while loop ends slowing you down too much for this to be practical.
marco at thenetworksolution dot it ¶
8 years ago
To selectively uppercase parts of a string via mb_eregi_replace
$str = mb_eregi_replace['\b[[0-9]{1,4}[a-z]{1,2}]\b', "strtoupper
['\\1']", $str, 'e'];
Full example, how to fix an address manually typed, uppercasing the first letter of a words and keeping uppercase roman numerals and the letters A,B,C after the house number]:
function ucAddress[$str] {
// first lowercase all and use the default ucwords
$str = ucwords[strtolower[$str]];
// let's fix the default ucwords...
// uppercase letters after house number [was lowercased by the strtolower above]
$str = mb_eregi_replace['\b[[0-9]{1,4}[a-z]{1,2}]\b', "strtoupper
['\\1']", $str, 'e'];
// the same for roman numerals
$str = mb_eregi_replace['\bM{0,4}[CM|CD|D?C{0,3}][XC|XL|L?X{0,3}][IX|IV|V?I{0,3}]\b', "strtoupper['\\0']", $str, 'e'];
return $str;
}
Dr. Marco Marsala
Network Solution srl
//www.realizzazionesitigenova.it
ms2705335 at gmail dot com ¶
5 years ago
As trng mentioned before you can use \\n for replacement but NOT \\\\n as mentioned in preg_replace docs. So string definition will be like:
$str = '\\1';