Php remove all non utf 8 characters

If you apply utf8_encode[] to an already UTF8 string it will return a garbled UTF8 output.

I made a function that addresses all this issues. It´s called Encoding::toUTF8[].

You dont need to know what the encoding of your strings is. It can be Latin1 [ISO8859-1], Windows-1252 or UTF8, or the string can have a mix of them. Encoding::toUTF8[] will convert everything to UTF8.

I did it because a service was giving me a feed of data all messed up, mixing those encodings in the same string.


use \ForceUTF8\Encoding;  // It's namespaced now.

$utf8_string = Encoding::toUTF8[$mixed_string];

$latin1_string = Encoding::toLatin1[$mixed_string];

I've included another function, Encoding::fixUTF8[], which will fix every UTF8 string that looks garbled product of having been encoded into UTF8 multiple times.


use \ForceUTF8\Encoding;  // It's namespaced now.

$utf8_string = Encoding::fixUTF8[$garbled_utf8_string];


echo Encoding::fixUTF8["Fédération Camerounaise de Football"];
echo Encoding::fixUTF8["Fédération Camerounaise de Football"];
echo Encoding::fixUTF8["FÃÂédÃÂération Camerounaise de Football"];
echo Encoding::fixUTF8["Fédération Camerounaise de Football"];

will output:

Fédération Camerounaise de Football
Fédération Camerounaise de Football
Fédération Camerounaise de Football
Fédération Camerounaise de Football



Jason Aller

3,48128 gold badges40 silver badges37 bronze badges

answered Aug 19, 2010 at 11:44


Using a regex approach:

$regex = = 0x20] && [$current = 0xE000] && [$current = 0x10000] && [$current  'Hello from Denmark with æøå',
    'Non-printable chars'   => "\x7FHello with invalid chars\r \x00"

foreach[$arr as $k => $v]{
    echo "$k:\n---------\n";
    $len = strlen[$v];
    echo "$v\n[".$len."]\n";
    $strip = utf8_decode[utf8_filter[utf8_encode[$v]]];
    $strip_len = strlen[$strip];
    echo $strip."\n[".$strip_len."]\n\n";
    echo "Chars removed: ".[$len - $strip_len]."\n\n\n";


answered Sep 10, 2019 at 13:16


26.1k67 gold badges182 silver badges320 bronze badges


$string = preg_replace['~&[[a-z]{1,2}][acute|cedil|circ|grave|lig|orn|ring|slash|th|tilde|uml];~i', '$1', htmlentities[$string, ENT_COMPAT, 'UTF-8']];

answered Sep 9, 2009 at 23:53

Alix AxelAlix Axel

148k91 gold badges388 silver badges493 bronze badges

So the rules are that the first UTF-8 octlet has the high bit set as a marker, and then 1 to 4 bits to indicate how many additional octlets; then each of the additional octlets must have the high two bits set to 10.

The pseudo-python would be:

newstring = ''
cont = 0
for each ch in string:
  if cont:
    if [ch >> 6] != 2: # high 2 bits are 10
      # do whatever, e.g. skip it, or skip whole point, or?
      # acceptable continuation of multi-octlet char
      newstring += ch
    cont -= 1
    if [ch >> 7]: # high bit set?
      c = [ch = 0x20] && [$current = 0xE000] && [$current = 0x10000] && [$current 

Chủ Đề