I'm trying to read a text file into python, but it seems to use some very strange encoding. I try the usual:
file = open['data.txt','r']
lines = file.readlines[]
for line in lines[0:1]:
print line,
print line.split[]
Output:
0.0200197 1.97691e-005
['0\x00.\x000\x002\x000\x000\x001\x009\x007\x00', '\x001\x00.\x009\x007\x006\x009\x001\x00e\x00-\x000\x000\x005\x00']
Printing the line works fine, but after I try to split the line so that I can convert it into a float, it looks crazy. Of course, when I try to convert those strings to floats, this produces an error. Any idea about how I can convert these back into numbers?
I put the sample datafile here if you would like to try to load it: //dl.dropboxusercontent.com/u/3816350/Posts/data.txt
I would like to simply use numpy.loadtxt or numpy.genfromtxt, but they also do not want to deal with this crazy file.
asked Oct 11, 2013 at 23:45
DanHicksteinDanHickstein
6,29612 gold badges50 silver badges86 bronze badges
5
I'm willing to bet this is a UTF-16-LE file, and you're reading it as whatever your default encoding is.
In UTF-16, each character takes two bytes.* If your characters are all ASCII, this means the UTF-16 encoding looks like the ASCII encoding with an extra '\x00' after each character.
To fix this, just decode the data:
print line.decode['utf-16-le'].split[]
Or do the same thing at the file level with the io or codecs module:
file = io.open['data.txt','r', encoding='utf-16-le']
* This is a bit of an oversimplification: Each BMP character takes two bytes; each non-BMP character is turned into a surrogate pair, with each of the two surrogates taking two bytes. But you probably didn't care about these details.
answered Oct 11, 2013 at 23:50
abarnertabarnert
340k43 gold badges571 silver badges648 bronze badges
7
Looks like UTF-16 to me.
>>> test_utf16 = '0\x00.\x000\x002\x000\x000\x001\x009\x007\x00'
>>> test_utf16.decode['utf-16']
u'0.0200197'
You can work directly off the Unicode strings:
>>> float[test_utf16]
Traceback [most recent call last]:
File "", line 1, in
ValueError: null byte in argument for float[]
>>> float[test_utf16.decode['utf-16']]
0.020019700000000001
Or encode them to something different, if you prefer:
>>> float[test_utf16.decode['utf-16'].encode['ascii']]
0.020019700000000001
Note that you need to do this as early as possible in your processing. As your comment noted, split
will behave incorrectly on the utf-16 encoded form. The utf-16 representation of the space character ' '
is ' \x00'
, so split
removes the whitespace but leaves the null byte.
The 2.6 and later io
library can handle this for you, as can the older codecs
library. io
handles linefeeds better, so it's preferable if available.
answered Oct 11, 2013 at 23:48
Peter DeGlopperPeter DeGlopper
35.2k7 gold badges87 silver badges82 bronze badges
2
This is really just @abarnert's suggestion, but I wanted to post it as an answer since this is the simplest solution and the one that I ended up using:
file = io.open[filename,'r',encoding='utf-16-le']
data = np.loadtxt[file,skiprows=8]
This demonstrates how you can create a file object using io.open using whatever crazy encoding your file happens to have, and then pass that file object to np.loadtxt [or np.genfromtxt] for quick-and-easy loading.
answered Sep 3, 2015 at 15:41
DanHicksteinDanHickstein
6,29612 gold badges50 silver badges86 bronze badges
This piece of code will do the necessary
file_handle=open[file_name,'rb']
file_first_line=file_handle.readline[]
file_handle.close[]
print file_first_line
if '\x00' in file_first_line:
file_first_line=file_first_line.replace['\x00','']
print file_first_line
When you try to use 'file_first_line.split[]' before replacing, the output would contain '\x00' i just tried replacing '\x00' with empty and it worked.
answered Jan 31, 2017 at 12:09
0