Php remove all non utf 8 characters

Question

If you apply utf8_encode() to an already UTF8 string it will return a garbled UTF8 output.

Nội dung chính Show

Not the answer you're looking for? Browse other questions tagged php regex or ask your own question.
How do I remove a non UTF
How do I remove all non printable characters in a string?
How do I remove a non UTF
How do you remove a non UTF

I made a function that addresses all this issues. It´s called Encoding::toUTF8().

You dont need to know what the encoding of your strings is. It can be Latin1 (ISO8859-1), Windows-1252 or UTF8, or the string can have a mix of them. Encoding::toUTF8() will convert everything to UTF8.

I did it because a service was giving me a feed of data all messed up, mixing those encodings in the same string.

Usage:

require_once('Encoding.php'); 
use \ForceUTF8\Encoding;  // It's namespaced now.

$utf8_string = Encoding::toUTF8($mixed_string);

$latin1_string = Encoding::toLatin1($mixed_string);

I've included another function, Encoding::fixUTF8(), which will fix every UTF8 string that looks garbled product of having been encoded into UTF8 multiple times.

Usage:

require_once('Encoding.php'); 
use \ForceUTF8\Encoding;  // It's namespaced now.

$utf8_string = Encoding::fixUTF8($garbled_utf8_string);

Examples:

echo Encoding::fixUTF8("FÃ©dÃ©ration Camerounaise de Football");
echo Encoding::fixUTF8("FÃÂ©dÃÂ©ration Camerounaise de Football");
echo Encoding::fixUTF8("FÃÂÃÂ©dÃÂÃÂ©ration Camerounaise de Football");
echo Encoding::fixUTF8("FÃÂ©dération Camerounaise de Football");

will output:

Fédération Camerounaise de Football
Fédération Camerounaise de Football
Fédération Camerounaise de Football
Fédération Camerounaise de Football

Download:

https://github.com/neitanod/forceutf8

Jason Aller

3,48128 gold badges40 silver badges37 bronze badges

answered Aug 19, 2010 at 11:44

8

Using a regex approach:

$regex = <<<'END'
/
  (
    (?: [\x00-\x7F]                 # single-byte sequences   0xxxxxxx
    |   [\xC0-\xDF][\x80-\xBF]      # double-byte sequences   110xxxxx 10xxxxxx
    |   [\xE0-\xEF][\x80-\xBF]{2}   # triple-byte sequences   1110xxxx 10xxxxxx * 2
    |   [\xF0-\xF7][\x80-\xBF]{3}   # quadruple-byte sequence 11110xxx 10xxxxxx * 3 
    ){1,100}                        # ...one or more times
  )
| .                                 # anything else
/x
END;
preg_replace($regex, '$1', $text);

It searches for UTF-8 sequences, and captures those into group 1. It also matches single bytes that could not be identified as part of a UTF-8 sequence, but does not capture those. Replacement is whatever was captured into group 1. This effectively removes all invalid bytes.

It is possible to repair the string, by encoding the invalid bytes as UTF-8 characters. But if the errors are random, this could leave some strange symbols.

$regex = <<<'END'
/
  (
    (?: [\x00-\x7F]               # single-byte sequences   0xxxxxxx
    |   [\xC0-\xDF][\x80-\xBF]    # double-byte sequences   110xxxxx 10xxxxxx
    |   [\xE0-\xEF][\x80-\xBF]{2} # triple-byte sequences   1110xxxx 10xxxxxx * 2
    |   [\xF0-\xF7][\x80-\xBF]{3} # quadruple-byte sequence 11110xxx 10xxxxxx * 3 
    ){1,100}                      # ...one or more times
  )
| ( [\x80-\xBF] )                 # invalid byte in range 10000000 - 10111111
| ( [\xC0-\xFF] )                 # invalid byte in range 11000000 - 11111111
/x
END;
function utf8replacer($captures) {
  if ($captures[1] != "") {
    // Valid byte sequence. Return unmodified.
    return $captures[1];
  }
  elseif ($captures[2] != "") {
    // Invalid byte of the form 10xxxxxx.
    // Encode as 11000010 10xxxxxx.
    return "\xC2".$captures[2];
  }
  else {
    // Invalid byte of the form 11xxxxxx.
    // Encode as 11000011 10xxxxxx.
    return "\xC3".chr(ord($captures[3])-64);
  }
}
preg_replace_callback($regex, "utf8replacer", $text);

EDIT:

!empty(x) will match non-empty values ("0" is considered empty).
x != "" will match non-empty values, including "0".
x !== "" will match anything except "".

x != "" seem the best one to use in this case.

I have also sped up the match a little. Instead of matching each character separately, it matches sequences of valid UTF-8 characters.

answered Sep 9, 2009 at 19:49

Markus JarderotMarkus Jarderot

84.4k20 gold badges134 silver badges137 bronze badges

0

This function removes all NON ASCII characters, it's useful but not solving the question:
This is my function that always works, regardless of encoding:

function remove_bs($Str) {  
  $StrArr = str_split($Str); $NewStr = '';
  foreach ($StrArr as $Char) {    
    $CharNo = ord($Char);
    if ($CharNo == 163) { $NewStr .= $Char; continue; } // keep £ 
    if ($CharNo > 31 && $CharNo < 127) {
      $NewStr .= $Char;    
    }
  }  
  return $NewStr;
}

How it works:

echo remove_bs('Hello õhowå åare youÆ?'); // Hello how are you?

John

7,1233 gold badges49 silver badges50 bronze badges

answered Nov 20, 2013 at 17:50

David DDavid D

1,20916 silver badges22 bronze badges

3

try this:

$string = iconv("UTF-8","UTF-8//IGNORE",$string);

According to the iconv manual, the function will take the first parameter as the input charset, second parameter as the output charset, and the third as the actual input string.

If you set both the input and output charset to UTF-8, and append the //IGNORE flag to the output charset, the function will drop(strip) all characters in the input string that can't be represented by the output charset. Thus, filtering the input string in effect.

answered Dec 17, 2014 at 15:24

technoaryatechnoarya

2133 silver badges5 bronze badges

4

UConverter can be used since PHP 5.5. UConverter is better the choice if you use intl extension and don't use mbstring.

function replace_invalid_byte_sequence($str)
{
    return UConverter::transcode($str, 'UTF-8', 'UTF-8');
}

function replace_invalid_byte_sequence2($str)
{
    return (new UConverter('UTF-8', 'UTF-8'))->convert($str);
}

htmlspecialchars can be used to remove invalid byte sequence since PHP 5.4. Htmlspecialchars is better than preg_match for handling large size of byte and the accuracy. A lot of the wrong implementation by using regular expression can be seen.

function replace_invalid_byte_sequence3($str)
{
    return htmlspecialchars_decode(htmlspecialchars($str, ENT_SUBSTITUTE, 'UTF-8'));
}

answered Jun 3, 2013 at 4:04

masakielasticmasakielastic

4,4501 gold badge37 silver badges42 bronze badges

1

Hi There you can use simple regex

$text = preg_replace('/[\x00-\x1F\x80-\xFF]/', '', $text);

It would truncate all non UTF-8 characters from string

answered Mar 17, 2021 at 16:09

HAT INCHAT INC

1011 silver badge5 bronze badges

1

I have made a function that deletes invalid UTF-8 characters from a string. I'm using it to clear description of 27000 products before it generates the XML export file.

public function stripInvalidXml($value) {
    $ret = "";
    $current;
    if (empty($value)) {
        return $ret;
    }
    $length = strlen($value);
    for ($i=0; $i < $length; $i++) {
        $current = ord($value{$i});
        if (($current == 0x9) || ($current == 0xA) || ($current == 0xD) || (($current >= 0x20) && ($current <= 0xD7FF)) || (($current >= 0xE000) && ($current <= 0xFFFD)) || (($current >= 0x10000) && ($current <= 0x10FFFF))) {
                $ret .= chr($current);
        }
        else {
            $ret .= "";
        }
    }
    return $ret;
}

benRollag

1,2214 gold badges16 silver badges21 bronze badges

answered Jul 16, 2014 at 23:46

muminmumin

711 silver badge1 bronze badge

1

Welcome to 2019 and the /u modifier in regex which will handle UTF-8 multibyte chars for you

If you only use mb_convert_encoding($value, 'UTF-8', 'UTF-8') you will still end up with non-printable chars in your string

This method will:

Remove all invalid UTF-8 multibyte chars with mb_convert_encoding
Remove all non-printable chars like \r, \x00 (NULL-byte) and other control chars with preg_replace

method:

function utf8_filter(string $value): string{
    return preg_replace('/[^[:print:]\n]/u', '', mb_convert_encoding($value, 'UTF-8', 'UTF-8'));
}

[:print:] match all printable chars and \n newlines and strip everything else

You can see the ASCII table below.. The printable chars range from 32 to 127, but newline \n is a part of the control chars which range from 0 to 31 so we have to add newline to the regex /[^[:print:]\n]/u

You can try to send strings through the regex with chars outside the printable range like \x7F (DEL), \x1B (Esc) etc. and see how they are stripped

function utf8_filter(string $value): string{
    return preg_replace('/[^[:print:]\n]/u', '', mb_convert_encoding($value, 'UTF-8', 'UTF-8'));
}

$arr = [
    'Danish chars'          => 'Hello from Denmark with æøå',
    'Non-printable chars'   => "\x7FHello with invalid chars\r \x00"
];

foreach($arr as $k => $v){
    echo "$k:\n---------\n";
    
    $len = strlen($v);
    echo "$v\n(".$len.")\n";
    
    $strip = utf8_decode(utf8_filter(utf8_encode($v)));
    $strip_len = strlen($strip);
    echo $strip."\n(".$strip_len.")\n\n";
    
    echo "Chars removed: ".($len - $strip_len)."\n\n\n";
}

https://www.tehplayground.com/q5sJ3FOddhv1atpR

answered Sep 10, 2019 at 13:16

clarkkclarkk

26.1k67 gold badges182 silver badges320 bronze badges

3

$string = preg_replace('~&([a-z]{1,2})(acute|cedil|circ|grave|lig|orn|ring|slash|th|tilde|uml);~i', '$1', htmlentities($string, ENT_COMPAT, 'UTF-8'));

answered Sep 9, 2009 at 23:53

Alix AxelAlix Axel

148k91 gold badges388 silver badges493 bronze badges

So the rules are that the first UTF-8 octlet has the high bit set as a marker, and then 1 to 4 bits to indicate how many additional octlets; then each of the additional octlets must have the high two bits set to 10.

The pseudo-python would be:

newstring = ''
cont = 0
for each ch in string:
  if cont:
    if (ch >> 6) != 2: # high 2 bits are 10
      # do whatever, e.g. skip it, or skip whole point, or?
    else:
      # acceptable continuation of multi-octlet char
      newstring += ch
    cont -= 1
  else:
    if (ch >> 7): # high bit set?
      c = (ch << 1) # strip the high bit marker
      while (c & 1): # while the high bit indicates another octlet
        c <<= 1
        cont += 1
        if cont > 4:
           # more than 4 octels not allowed; cope with error
      if !cont:
        # illegal, do something sensible
      newstring += ch # or whatever
if cont:
  # last utf-8 was not terminated, cope

This same logic should be translatable to php. However, its not clear what kind of stripping is to be done once you get a malformed character.

answered Sep 9, 2009 at 18:49

WillWill

72.3k38 gold badges165 silver badges239 bronze badges

1

From recent patch to Drupal's Feeds JSON parser module:

//remove everything except valid letters (from any language)
$raw = preg_replace('/(?:\\\\u[\pL\p{Zs}])+/', '', $raw);

If you're concerned yes it retains spaces as valid characters.

Did what I needed. It removes widespread nowadays emoji-characters that don't fit into MySQL's 'utf8' character set and that gave me errors like "SQLSTATE[HY000]: General error: 1366 Incorrect string value".

For details see https://www.drupal.org/node/1824506#comment-6881382

answered Jun 25, 2015 at 3:41

5

substr() can break your multi-byte characters!

In my case, I was using substr($string, 0, 255) to ensure a user supplied value would fit in the database. On occasion it would split a multi-byte character in half and caused database errors with "Incorrect string value".

You could use mb_substr($string,0,255), and it might be ok for MySQL 5, but MySQL 4 counts bytes instead of characters, so it would still be too long depending on the number of multi-byte characters.

To prevent these issues I implemented the following steps:

I increased the size of the field (in this case it was a log of changes, so preventing the longer input was not an option.)
I still did a mb_substring in case it was still too long
I used the accepted answer above by @Markus Jarderot to ensure if there is a really long entry with a multi-byte character right at the length limit, that we can strip out the half of a multi-byte character at the end.

answered Jan 14, 2021 at 22:18

Frank ForteFrank Forte

1,83717 silver badges18 bronze badges

1

To remove all Unicode characters outside of the Unicode basic language plane:

$str = preg_replace("/[^\\x00-\\xFFFF]/", "", $str);

Sam Hanley

4,6547 gold badges36 silver badges59 bronze badges

answered Feb 8, 2013 at 16:55

2

I tried many of the solutions presented on this topic, but non of them worked for me, in my specific case. But I found a good solution on this link: https://www.ryadel.com/en/php-skip-invalid-characters-utf-8-xml-file-string/

Basically, this is the function that solved for me:

function sanitizeXML($string)
{
    if (!empty($string)) 
    {
        // remove EOT+NOREP+EOX|EOT+ sequence (FatturaPA)
        $string = preg_replace('/(\x{0004}(?:\x{201A}|\x{FFFD})(?:\x{0003}|\x{0004}).)/u', '', $string);
 
        $regex = '/(
            [\xC0-\xC1] # Invalid UTF-8 Bytes
            | [\xF5-\xFF] # Invalid UTF-8 Bytes
            | \xE0[\x80-\x9F] # Overlong encoding of prior code point
            | \xF0[\x80-\x8F] # Overlong encoding of prior code point
            | [\xC2-\xDF](?![\x80-\xBF]) # Invalid UTF-8 Sequence Start
            | [\xE0-\xEF](?![\x80-\xBF]{2}) # Invalid UTF-8 Sequence Start
            | [\xF0-\xF4](?![\x80-\xBF]{3}) # Invalid UTF-8 Sequence Start
            | (?<=[\x0-\x7F\xF5-\xFF])[\x80-\xBF] # Invalid UTF-8 Sequence Middle
            | (?= 0x20) && ($current <= 0xD7FF)) ||
                (($current >= 0xE000) && ($current <= 0xFFFD)) ||
                (($current >= 0x10000) && ($current <= 0x10FFFF)))
            {
                $result .= chr($current);
            }
            else
            {
                $ret;    // use this to strip invalid character(s)
                // $ret .= " ";    // use this to replace them with spaces
            }
        }
        $string = $result;
    }
    return $string;
}

Hope it will help some of you.

answered May 14, 2021 at 19:51

Slightly different to the question, but what I am doing is to use HtmlEncode(string),

pseudo code here

var encoded = HtmlEncode(string);
encoded = Regex.Replace(encoded, "&#\d+?;", "");
var result = HtmlDecode(encoded);

input and output

"Headlight\x007E Bracket, { Cafe Racer<> Style,Â Stainless Steel 中文呢？"
"Headlight~ Bracket, { Cafe Racer<> Style, Stainless Steel 中文呢？"

I know it's not perfect, but does the job for me.

answered Dec 12, 2013 at 2:26

misaximisaxi

5682 silver badges10 bronze badges

Maybe not the most precise solution, but it gets the job done with a single line of code:

echo str_replace("?","",(utf8_decode($str)));

utf8_decode will convert the characters to a question mark;
str_replace will strip out the question marks.

סטנלי גרונן

2,84923 gold badges46 silver badges65 bronze badges

answered Dec 26, 2019 at 17:18

2

static $preg = <<<'END'
%(
[\x09\x0A\x0D\x20-\x7E]
| [\xC2-\xDF][\x80-\xBF]
| \xE0[\xA0-\xBF][\x80-\xBF]
| [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}
| \xED[\x80-\x9F][\x80-\xBF]
| \xF0[\x90-\xBF][\x80-\xBF]{2}
| [\xF1-\xF3][\x80-\xBF]{3}
| \xF4[\x80-\x8F][\x80-\xBF]{2}
)%xs
END;
if (preg_match_all($preg, $string, $match)) {
    $string = implode('', $match[0]);
} else {
    $string = '';
}

it work on our service

answered Jan 15, 2020 at 11:44

llluollluo

11 bronze badge

1

The next sanitizing works for me:

$string = mb_convert_encoding($string, 'UTF-8', 'UTF-8');
$string = iconv("UTF-8", "UTF-8//IGNORE", $string);

answered May 26, 2021 at 11:39

Adam PeryAdam Pery

1,80821 silver badges19 bronze badges

How about iconv:

http://php.net/manual/en/function.iconv.php

Haven't used it inside PHP itself but its always performed well for me on the command line. You can get it to substitute invalid characters.

answered Sep 9, 2009 at 19:53

BenBen

6233 silver badges12 bronze badges

Not the answer you're looking for? Browse other questions tagged php regex or ask your own question.

How do I remove a non UTF

This method will: Remove all invalid UTF-8 multibyte chars with mb_convert_encoding..

! empty(x) will match non-empty values ( "0" is considered empty)..

x != "" will match non-empty values, including "0" ..

x !== "" will match anything except "" ..

How do I remove all non printable characters in a string?

Instead use this to delete the non-printable characters 0-31 and 127: $string = preg_replace('/[\x00-\x1F\x7F]/', '', $string);.

This well remove characters like quotes, brackets, etc. ... .

this is wonderful! ... .

This can be useful when only pure words are needed..

How do I remove a non UTF

You can get rid of anything outside the printable ASCII range using the following regex: string = string. replaceAll("[^\\x20-\\x7e]", ""); 2) I get xml as an array of bytes - how to handle this operation safely in that case?

How do you remove a non UTF

To automatically find and delete non-UTF-8 characters, we're going to use the iconv command. It is used in Linux systems to convert text from one character encoding to another.

programming php Mb_convert_encoding

Php remove all non utf 8 characters

method:

Not the answer you're looking for? Browse other questions tagged php regex or ask your own question.

How do I remove a non UTF

How do I remove all non printable characters in a string?

How do I remove a non UTF

How do you remove a non UTF

Bài Viết Liên Quan

Quảng Cáo

Có thể bạn quan tâm

Toplist được quan tâm

Quảng cáo

Xem Nhiều

Quảng cáo

Chúng tôi

Điều khoản

Trợ giúp

Mạng xã hội