How to handle special characters in python

Python will check the first or second line for an emacs/vim-like encoding specification.

More precisely, the first or second line must match the regular expression "coding[:=]\s*([-\w.]+)". The first group of this expression is then interpreted as encoding name. If the encoding is unknown to Python, an error is raised during compilation.

Source: PEP 263

(A BOM would also make Python interpret the source as UTF-8.

I would recommend, you use this over .decode('utf8')

# -*- encoding: utf-8 -*-
special_char_string = u"äöüáèô"

In any case, special_char_string will then contain a unicode object, no longer a str. As you can see, they're both semantically equivalent:

>>> u"äöüáèô" == "äöüáèô".decode('utf8')
True

And the reverse:

>>> u"äöüáèô".encode('utf8')
'\xc3\xa4\xc3\xb6\xc3\xbc\xc3\xa1\xc3\xa8\xc3\xb4'
>>> "äöüáèô"
'\xc3\xa4\xc3\xb6\xc3\xbc\xc3\xa1\xc3\xa8\xc3\xb4'

There is a technical difference, however: if you use u"something", it will instruct the parser that there is a unicode literal, it should be a bit faster.


Escape Characters

To insert characters that are illegal in a string, use an escape character.

An escape character is a backslash \ followed by the character you want to insert.

An example of an illegal character is a double quote inside a string that is surrounded by double quotes:

Example

You will get an error if you use double quotes inside a string that is surrounded by double quotes:

txt = "We are the so-called "Vikings" from the north."

Try it Yourself »

To fix this problem, use the escape character \":

Example

The escape character allows you to use double quotes when you normally would not be allowed:

txt = "We are the so-called \"Vikings\" from the north."

Try it Yourself »

Other escape characters used in Python:

CodeResultTry it
\' Single Quote Try it »
\\ Backslash Try it »
\n New Line Try it »
\r Carriage Return Try it »
\t Tab Try it »
\b Backspace Try it »
\f Form Feed
\ooo Octal value Try it »
\xhh Hex value Try it »



String operations on string array containing strings with accented &/or special characters alongside regular ascii strings can be quiet an annoyance

These days I am involved with web/mobile automation. Other day I had a challenge to parse all strings on page for a generic automation library I am writing.

Since I was supposed to write a generic library to parse all strings on page, I didnot have the luxury of using ids for specific control/component on page. So I used the reliable xpath//*[@name] to parse strings in an android application page. This would extract all the text attributes on the page which was a good enough solution for me.(I would like to know a better solution using css selectors esp., if you have one!!)

As the solution was so easy I found it difficult to believe that the code had handled all the edge cases. To clear my doubts, I went about testing it on different applications with different inputs, until I hit a road block where the page was returning a mixture of accented strings, strings containing special characters and regular ascii strings. Here is how the array looked like

strs = ["hell°", "hello", "tromsø", "boy", "stävänger", "ölut", "world"]

If you have hit similar challenge read on for the solution.

Strings with accented or special characters are unicode strings while regular one’s ascii. So to handle unicode strings as regular ascii strings one has to convert unicode strings to ascii. (For a history on unicode read a detailed article)

To convert unicode to ascii; one has to encode unicode strings to utf-8

Here is how you do in python

text = text.encode(‘utf-8’)

Simple isn’t it!!

But wait you need to strip out extra escape characters to do string operations. here is how you can strip those out

import redef extract_word(text):
print "Input Text::{}".format(text)
regex = r"(\w|\s)*"
matches = re.finditer(regex, text, re.DOTALL)
newstr = ''
for matchNum, match in enumerate(matches):
matchNum = matchNum + 1
newstr = newstr + match.group()
print "Output Text::{}".format(newstr)
return newstr

With the returned string, now you are good to go and do other string operations on the array.

(If this has helped you guys, do let me know in comment section…)

How do you use special characters in Python?

In Python strings, the backslash "\" is a special character, also called the "escape" character. It is used in representing certain whitespace characters: "\t" is a tab, "\n" is a newline, and "\r" is a carriage return. Conversely, prefixing a special character with "\" turns it into an ordinary character.

How do you avoid special characters in Python?

Using 're..
“[^A-Za-z0–9]” → It'll match all of the characters except the alphabets and the numbers. ... .
All of the characters matched will be replaced with an empty string..
All of the characters except the alphabets and numbers are removed..

Can we use special characters in Python?

Python3. An identifier in Python cannot use any special symbols like !, @, #, $, % etc.

How do I stop characters from escaping Python?

To ignoring escape sequences in the string, we make the string as "raw string" by placing "r" before the string. "raw string" prints as it assigned to the string. Python | How to print double quotes with the string variable?