Php convert string to utf8mb4

mb_convert_encoding is the answer. another option is the iconv function - but this is assuming $var is not already in utf8 - you must first find out what characterset $var is encoded in. and if your variable is indeed hardcoded into the PHP script itself, then either:

it's already utf-8 encoded

OR

your php script starts with



which worked perfectly until I upgraded, so I had to use


Hope it helps someone else out

francois at bonzon point com

13 years ago

aaron, to discard unsupported characters instead of printing a ?, you might as well simply set the configuration directive:

mbstring.substitute_character = "none"

in your php.ini. Be sure to include the quotes around none. Or at run-time with

aaron at aarongough dot com

13 years ago

My solution below was slightly incorrect, so here is the correct version [I posted at the end of a long day, never a good idea!]

Again, this is a quick and dirty solution to stop mb_convert_encoding from filling your string with question marks whenever it encounters an illegal character for the target encoding.



Hope this helps someone! [Admins should feel free to delete my previous, incorrect, post for clarity]
-A

vasiliauskas dot agnius at gmail dot com

4 years ago

When you need to convert from HTML-ENTITIES, but your UTF-8 string is partially broken [not all chars in UTF-8] - in this case passing string to mb_convert_encoding[$string, 'UTF-8', 'HTML-ENTITIES']; - corrupts chars in string even more. In this case you need to replace html entities gradually to preserve character good encoding. I wrote such closure for this job :

Stephan van der Feest

17 years ago

To add to the Flash conversion comment below, here's how I convert back from what I've stored in a database after converting from Flash HTML text field output, in order to load it back into a Flash HTML text field:

function htmltoflash[$htmlstr]
{
  return str_replace["<br />","\n",
    str_replace["",">",
        mb_convert_encoding[html_entity_decode[$htmlstr],
        "UTF-8","ISO-8859-1"]]]];
}

Daniel Trebbien

13 years ago

Note that `mb_convert_encoding[$val, 'HTML-ENTITIES']` does not escape '\'', '"', '', or '&'.

Rainer Perske

1 month ago

Text-encoding HTML-ENTITIES will be deprecated as of PHP 8.2.

To convert all non-ASCII characters into entities [to produce pure 7-bit HTML output], I was using:



I can get the identical result with:



The output contains well-known named entities for some often used characters and numeric entities for the rest.

bmxmale at qwerty dot re

7 months ago

/**
* Convert Windows-1250 to UTF-8
* Based on //www.php.net/manual/en/function.mb-convert-encoding.php#112547
*/
class TextConverter
{
    private const ENCODING_TO = 'UTF-8';
    private const ENCODING_FROM = 'ISO-8859-2';

    private array $mapChrChr = [
        0x8A => 0xA9,
        0x8C => 0xA6,
        0x8D => 0xAB,
        0x8E => 0xAE,
        0x8F => 0xAC,
        0x9C => 0xB6,
        0x9D => 0xBB,
        0xA1 => 0xB7,
        0xA5 => 0xA1,
        0xBC => 0xA5,
        0x9F => 0xBC,
        0xB9 => 0xB1,
        0x9A => 0xB9,
        0xBE => 0xB5,
        0x9E => 0xBE
    ];

    private array $mapChrString = [
        0x80 => '€',
        0x82 => '‚',
        0x84 => '„',
        0x85 => '…',
        0x86 => '†',
        0x87 => '‡',
        0x89 => '‰',
        0x8B => '‹',
        0x91 => '‘',
        0x92 => '’',
        0x93 => '“',
        0x94 => '”',
        0x95 => '•',
        0x96 => '–',
        0x97 => '—',
        0x99 => '™',
        0x9B => '’',
        0xA6 => '¦',
        0xA9 => '©',
        0xAB => '«',
        0xAE => '®',
        0xB1 => '±',
        0xB5 => 'µ',
        0xB6 => '¶',
        0xB7 => '·',
        0xBB => '»'
    ];

    /**
     * @param $text
     * @return string
     */
    public function execute[$text]: string
    {
        $map = $this->prepareMap[];

        return html_entity_decode[
            mb_convert_encoding[strtr[$text, $map], self::ENCODING_TO, self::ENCODING_FROM],
            ENT_QUOTES,
            self::ENCODING_TO
        ];
    }

    /**
     * @return array
     */
    private function prepareMap[]: array
    {
        $maps[] = $this->arrayMapAssoc[function [$k, $v] {
            return [chr[$k], chr[$v]];
        }, $this->mapChrChr];

        $maps[] = $this->arrayMapAssoc[function [$k, $v] {
            return [chr[$k], $v];
        }, $this->mapChrString];

        return array_merge[[], ...$maps];
    }

    /**
     * @param callable $function
     * @param array $array
     * @return array
     */
    private function arrayMapAssoc[callable $function, array $array]: array
    {
        return array_column[
            array_map[
                $function,
                array_keys[$array],
                $array
            ],
            1,
            0
        ];
    }
}

urko at wegetit dot eu

10 years ago

If you are trying to generate a CSV [with extended chars] to be opened at Exel for Mac, the only that worked for me was:


I also tried this:



But the first one didn't show extended chars correctly, and the second one, did't separe fields correctly

me at gsnedders dot com

13 years ago

It appears that when dealing with an unknown "from encoding" the function will both throw an E_WARNING and proceed to convert the string from ISO-8859-1 to the "to encoding".

mac.com@nemo

16 years ago

For those wanting to convert from $set to MacRoman, use iconv[]:



['macintosh' is the IANA name for the MacRoman character set.]

Tom Class

16 years ago

Why did you use the php html encode functions? mbstring has it's own Encoding which is [as far as I tested it] much more usefull:

HTML-ENTITIES

Example:

$text = mb_convert_encoding[$text, 'HTML-ENTITIES', "UTF-8"];

nicole

6 years ago

// convert UTF8 to DOS = CP850
//
// $utf8_text=UTF8-Formatted text;
// $dos=CP850-Formatted text;

// have fun

$dos = mb_convert_encoding[$utf8_text, "CP850", mb_detect_encoding[$utf8_text, "UTF-8, CP850, ISO-8859-15", true]];

katzlbtjunk at hotmail dot com

14 years ago

Clean a string for use as filename by simply replacing all unwanted characters with underscore [ASCII converts to 7bit]. It removes slightly more chars than necessary. Hope its useful.

$fileName = 'Test:!"$%&/[]=ÖÄÜöäüwin, if input in win-encoding already, function recode[] returns unchanged string]

nospam at nihonbunka dot com

14 years ago

rodrigo at bb2 dot co dot jp wrote that inconv works better than mb_convert_encoding, I find that when converting from uft8 to shift_jis
$conv_str = mb_convert_encoding[$str,$toCS,$fromCS];
works while
$conv_str = iconv[$fromCS,$toCS.'//IGNORE',$str];
removes tildes from $str.

aofg

15 years ago

When converting Japanese strings to ISO-2022-JP or JIS on PHP >= 5.2.1, you can use "ISO-2022-JP-MS" instead of them.
Kishu-Izon [platform dependent] characters are converted correctly with the encoding, as same as with eucJP-win or with SJIS-win.

David Hull

15 years ago

As an alternative to Johannes's suggestion for converting strings from other character sets to a 7bit representation while not just deleting latin diacritics, you might try this:



The only disadvantage is that it does not convert "ä" to "ae", but it handles punctuation and other special characters better.
--
David

jamespilcher1 - hotmail

18 years ago

be careful when converting from iso-8859-1 to utf-8.

even if you explicitly specify the character encoding of a page as iso-8859-1[via headers and strict xml defs], windows 2000 will ignore that and interpret it as whatever character set it has natively installed.

for example, i wrote char #128 into a page, with char encoding iso-8859-1, and it displayed in internet explorer [& mozilla] as a euro symbol.

it should have displayed a box, denoting that char #128 is undefined in iso-8859-1. The problem was it was displaying in "Windows: western europe" [my native character set].

this led to confusion when i tried to convert this euro to UTF-8 via mb_convert_encoding[]

IE displays UTF-8 correctly- and because PHP correctly converted #128 into a box in UTF-8, IE would show a box.

so all i saw was mb_convert_encoding[] converting a euro symbol into a box. It took me a long time to figure out what was going on.

gullevek at gullevek dot org

12 years ago

If you want to convert japanese to ISO-2022-JP it is highly recommended to use ISO-2022-JP-MS as the target encoding instead. This includes the extended character set and avoids ? in the text. For example the often used "1 in a circle" ① will be correctly converted then.

StigC

14 years ago

For the php-noobs [like me] - working with flash and php.

Here's a simple snippet of code that worked great for me, getting php to show special Danish characters, from a Flash email form:

rodrigo at bb2 dot co dot jp

14 years ago

For those who can´t use mb_convert_encoding[] to convert from one charset to another as a metter of lower version of php, try iconv[].

I had this problem converting to japanese charset:

$txt=mb_convert_encoding[$txt,'SJIS',$this->encode];

And I could fix it by using this:

$txt = iconv['UTF-8', 'SJIS', $txt];

Maybe it´s helpfull for someone else! ;]

phpdoc at jeudi dot de

16 years ago

I\'d like to share some code to convert latin diacritics to their
traditional 7bit representation, like, for example,

- à,ç,é,î,... to a,c,e,i,...
- ß to ss
- ä,Ä,... to ae,Ae,...
- ë,... to e,...

[mb_convert \"7bit\" would simply delete any offending characters].

I might have missed on your country\'s typographic
conventions--correct me then.
<?php
/**
* @args string $text line of encoded text
*       string $from_enc [encoding type of $text, e.g. UTF-8, ISO-8859-1]
*
* @returns 7bit representation
*/
function to7bit[$text,$from_enc] {
    $text = mb_convert_encoding[$text,\'HTML-ENTITIES\',$from_enc];
    $text = preg_replace[
        array[\'/ß/\',\'/&[..]lig;/\',
             \'/&[[aouAOU]]uml;/\',\'/&[.][^;]*;/\'],
        array[\'ss\',\"$1\",\"$1\".\'e\',\"$1\"],
        $text];
    return $text;
}  
?>

Enjoy :-]
Johannes

==
[EDIT BY danbrown AT php DOT net: Author provided the following update on 27-FEB-2012.]
==

An addendum to my "to7bit" function referenced below in the notes.
The function is supposed to solve the problem that some languages require a different 7bit rendering of special [umlauted] characters for sorting or other applications. For example, the German ß ligature is usually written "ss" in 7bit context. Dutch ÿ is typically rendered "ij" [not "y"].

The original function works well with word [alphabet] character entities and I've seen it used in many places. But non-word entities cause funny results:
E.g., "©" is rendered as "c", "­" as "s" and "&rquo;" as "r".
The following version fixes this by converting non-alphanumeric characters [also chains thereof] to '_'.

<?php
/**
* @args string $text line of encoded text
*       string $from_enc [encoding type of $text, e.g. UTF-8, ISO-8859-1]
*
* @returns 7bit representation
*/
function to7bit[$text,$from_enc] {
    $text = preg_replace[/W+/,'_',$text];
    $text = mb_convert_encoding[$text,'HTML-ENTITIES',$from_enc];
    $text = preg_replace[
        array['/ß/','/&[..]lig;/',
             '/&[[aouAOU]]uml;/','/ÿ/','/&[.][^;]*;/'],
        array['ss',"$1","$1".'e','ij',"$1"],
        $text];
    return $text;

?>

Enjoy again,
Johannes

qdb at kukmara dot ru

10 years ago

mb_substr and probably several other functions works faster in ucs-2 than in utf-8. and utf-16 works slower than utf-8. here is test, ucs-2 is near 50 times faster than utf-8, and utf-16 is near 6 times slower than utf-8 here:

output:
өх. 12416. 1.71738100052
өх. 12416. 0.0211279392242
өх. 12416. 11.2330229282

DanielAbbey at Hotmail dot co dot uk

8 years ago

When using the Windows Notepad text editor, it is important to note that when you select 'Save As' there is an Encoding selection dropdown. The default encoding is set to ANSI, with the other two options being Unicode and UTF-8. Since most text on the web is in UTF-8 format it could prove vital to save the .txt file with this encoding, since this function does not work on ANSI-encoded text.

Stephan van der Feest

17 years ago

Here's a tip for anyone using Flash and PHP for storing HTML output submitted from a Flash text field in a database or whatever.

Flash submits its HTML special characters in UTF-8, so you can use the following function to convert those into HTML entity characters:

function utf8html[$utf8str]
{
  return htmlentities[mb_convert_encoding[$utf8str,"ISO-8859-1","UTF-8"]];
}

Edward

14 years ago

If mb_convert_encoding doesn't work for you, and iconv gives you a headache, you might be interested in this free class I found. It can convert almost any charset to almost any other charset. I think it's wonderful and I wish I had found it earlier. It would have saved me tons of headache.

I use it as a fail-safe, in case mb_convert_encoding is not installed. Download it from //mikolajj.republika.pl/

This is not my own library, so technically it's not spamming, right? ;]

Hope this helps.

mightye at gmail dot com

14 years ago

To petruzanauticoyahoo?com!ar

If you don't specify a source encoding, then it assumes the internal [default] encoding.  ñ is a multi-byte character whose bytes in your configuration default [often iso-8859-1] would actually mean ñ.  mb_convert_encoding[] is upgrading those characters to their multi-byte equivalents within UTF-8.

Try this instead:

Of course this function does no work [for the most part - it can actually be used to strip characters which are not valid for UTF-8].

jackycms at outlook dot com

8 years ago

// mb_convert_encoding[$input,'UTF-8','windows-874'];  error : Illegal character encoding specified
// so convert Thai to UTF-8 is better use iconv instead

How to set UTF

PHP UTF-8 Encoding – modifications to your php. The first thing you need to do is to modify your php. ini file to use UTF-8 as the default character set: default_charset = "utf-8"; [Note: You can subsequently use phpinfo[] to verify that this has been set properly.]

What is UTF

The utf8_encode[] function is an inbuilt function in PHP which is used to encode an ISO-8859-1 string to UTF-8. Unicode has been developed to describe all possible characters of all languages and includes a lot of symbols with one unique number for each symbol/character.

What does mb_ Convert_ encoding do?

Converts string from from_encoding , or the current internal encoding, to to_encoding . If string is an array, all its string values will be converted recursively.

How do I change utf8mb4 to UTF

To solve the problem open the exported SQL file, search and replace the utf8mb4 with utf8 , after that search and replace the utf8mb4_unicode_520_ci with utf8_general_ci . Save the file and import it into your database. After that, change the wp-config. php charset option to utf8 , and the magic starts.

Chủ Đề