Python convert ansi to utf-8


I have around 600,000 files encoded in ANSI and I want to convert them to UTF-8. I can do that individually in NOTEPAD++, but i can't do that for 600,000 files.Can i do this in R or Python?

I have found this link but the Python script is not running: notepad++ converting ansi encoded file to utf-8

asked Jul 17, 2015 at 8:05

6

Why don't you read the file and write it as UTF-8? You can do that in Python.

#to support encodings
import codecs

#read input file
with codecs.open[path, 'r', encoding = 'utf8'] as file:
  lines = file.read[]

#write output file
with codecs.open[path, 'w', encoding = 'utf8'] as file:
  file.write[lines]

answered Jul 17, 2015 at 8:13

3Ducker3Ducker

3231 silver badge9 bronze badges

3

I appreciate that this is an old question but having just resolved a similar problem recently I thought I would share my solution.

I had a file being prepared by one program that I needed to import in to an sqlite3 database but the text file was always 'ANSI' and sqlite3 requires UTF-8.

The ANSI encoding is recognised as 'mbcs' in python and therefore the code I have used, ripping off something else I found is:

blockSize = 1048576
with codecs.open["your ANSI source file.txt","r",encoding="mbcs"] as sourceFile:
    with codecs.open["Your UTF-8 output file.txt","w",encoding="UTF-8"] as targetFile:
        while True:
            contents = sourceFile.read[blockSize]
            if not contents:
                break
            targetFile.write[contents]

The below link contains some information on the encoding types that I found on my research

//docs.python.org/2.4/lib/standard-encodings.html

answered Dec 19, 2018 at 17:27

  1. Home
  2. Python How Tos
  3. How to Convert a String to UTF-8 in Python?

In this article, we will learn to convert a string to UTF-8 in Python. We will use some built-in functions and some custom code as well. Let's first have a quick look over what is a string in Python.

Python String

The String is a type in python language just like integer, float, boolean, etc. Data surrounded by single quotes or double quotes are said to be a string. A string is also known as a sequence of characters.

string1 = "apple"
string2 = "Preeti125"
string3 = "12345"
string4 = "pre@12"

What is UTF-8 in Python?

UTF is “Unicode Transformation Format”, and ‘8’ means 8-bit values are used in the encoding. It is one of the most efficient and convenient encoding formats among various encodings. In Python, Strings are by default in utf-8 format which means each alphabet corresponds to a unique code point. utf-8 encodes a Unicode string to bytes. The user receives string data on the server instead of bytes because some frameworks or library on the system has implicitly converted some random bytes to string and it happens due to encoding.

A user might encounter a situation where his server receives utf-8 characters but when he tries to retrieve it from the query string, he gets ASCII coding. Therefore, in order to convert the plain string to utf-8, we will use the encode[] method to convert a string to utf-8 in python 3.

Use encode[] to convert a String to UTF-8

The encode[] method returns the encoded version of the string. In case of failure, a UnicodeDecodeError exception may occur.

Syntax

string.encode[encoding = 'UTF-8', errors = 'strict']

Parameters

encoding - the encoding type like 'UTF-8', ASCII, etc.

errors - response when encoding fails.

There are six types of error responses:

  • strict - default response which raises a UnicodeDecodeError exception on failure
  • ignore - ignores the unencodable Unicode from the result
  • replace - replaces the unencodable Unicode to a question mark?
  • xmlcharrefreplace - inserts XML character reference instead of unencodable Unicode
  • backslashreplace - inserts a \uNNNN escape sequence instead of unencodable Unicode
  • namereplace - inserts a \N{...} escape sequence instead of unencodable Unicode

By default, the encode[] method does not take any parameters.

Example

# unicode string
string = 'pythön!'
# default encoding to utf-8
string_utf = string.encode[]
print['The encoded version is:', string_utf]


The encoded version is: b'pyth\xc3\xb6n!'

Conclusion

In this article, we learned to convert a plain string to utf-8 format using encode[] method. You can also try using different encoding formats and error parameters.

How do I change the encoding to UTF

How to Convert a String to UTF-8 in Python?.
string1 = "apple" string2 = "Preeti125" string3 = "12345" string4 = "pre@12".
string. encode[encoding = 'UTF-8', errors = 'strict'].
# unicode string string = 'pythön!' # default encoding to utf-8 string_utf = string. encode[] print['The encoded version is:', string_utf].

Is UTF

ANSI and UTF-8 are both encoding formats. ANSI is the common one byte format used to encode Latin alphabet; whereas, UTF-8 is a Unicode format of variable length [from 1 to 4 bytes] which can encode all possible characters.

Is UTF

ANSI is a superset of utf-8, and so there are no characters in this category.

Which is better ANSI or UTF

UTF-8 is superior in every way to ANSI. There is no reason to choose ANSI over UTF-8 in creating new applications as all computers can decode it. The only reason to be using ANSI is when you are forced to run an old application that you do not have any replacement for.

Chủ Đề