Php remove all non utf 8 characters

If you apply utf8_encode[] to an already UTF8 string it will return a garbled UTF8 output.

I made a function that addresses all this issues. It´s called Encoding::toUTF8[].

You dont need to know what the encoding of your strings is. It can be Latin1 [ISO8859-1], Windows-1252 or UTF8, or the string can have a mix of them. Encoding::toUTF8[] will convert everything to UTF8.

I did it because a service was giving me a feed of data all messed up, mixing those encodings in the same string.

Usage:

require_once['Encoding.php']; 
use \ForceUTF8\Encoding;  // It's namespaced now.

$utf8_string = Encoding::toUTF8[$mixed_string];

$latin1_string = Encoding::toLatin1[$mixed_string];

I've included another function, Encoding::fixUTF8[], which will fix every UTF8 string that looks garbled product of having been encoded into UTF8 multiple times.

Usage:

require_once['Encoding.php']; 
use \ForceUTF8\Encoding;  // It's namespaced now.

$utf8_string = Encoding::fixUTF8[$garbled_utf8_string];

Examples:

echo Encoding::fixUTF8["Fédération Camerounaise de Football"];
echo Encoding::fixUTF8["Fédération Camerounaise de Football"];
echo Encoding::fixUTF8["FÃÂédÃÂération Camerounaise de Football"];
echo Encoding::fixUTF8["Fédération Camerounaise de Football"];

will output:

Fédération Camerounaise de Football
Fédération Camerounaise de Football
Fédération Camerounaise de Football
Fédération Camerounaise de Football

Download:

//github.com/neitanod/forceutf8

Jason Aller

3,48128 gold badges40 silver badges37 bronze badges

answered Aug 19, 2010 at 11:44

8

Using a regex approach:

$regex = = 0x20] && [$current = 0xE000] && [$current = 0x10000] && [$current  'Hello from Denmark with æøå',
    'Non-printable chars'   => "\x7FHello with invalid chars\r \x00"
];

foreach[$arr as $k => $v]{
    echo "$k:\n---------\n";
    
    $len = strlen[$v];
    echo "$v\n[".$len."]\n";
    
    $strip = utf8_decode[utf8_filter[utf8_encode[$v]]];
    $strip_len = strlen[$strip];
    echo $strip."\n[".$strip_len."]\n\n";
    
    echo "Chars removed: ".[$len - $strip_len]."\n\n\n";
}

//www.tehplayground.com/q5sJ3FOddhv1atpR

answered Sep 10, 2019 at 13:16

clarkkclarkk

26.1k67 gold badges182 silver badges320 bronze badges

3

$string = preg_replace['~&[[a-z]{1,2}][acute|cedil|circ|grave|lig|orn|ring|slash|th|tilde|uml];~i', '$1', htmlentities[$string, ENT_COMPAT, 'UTF-8']];

answered Sep 9, 2009 at 23:53

Alix AxelAlix Axel

148k91 gold badges388 silver badges493 bronze badges

So the rules are that the first UTF-8 octlet has the high bit set as a marker, and then 1 to 4 bits to indicate how many additional octlets; then each of the additional octlets must have the high two bits set to 10.

The pseudo-python would be:

newstring = ''
cont = 0
for each ch in string:
  if cont:
    if [ch >> 6] != 2: # high 2 bits are 10
      # do whatever, e.g. skip it, or skip whole point, or?
    else:
      # acceptable continuation of multi-octlet char
      newstring += ch
    cont -= 1
  else:
    if [ch >> 7]: # high bit set?
      c = [ch = 0x20] && [$current = 0xE000] && [$current = 0x10000] && [$current 

Chủ Đề