Hướng dẫn python read utf-16 file
I'm trying to read a text file into python, but it seems to use some very strange encoding. I try the usual:
Output:
Printing the line works fine, but after I try to split the line so that I can convert it into a float, it looks crazy. Of course, when I try to convert those strings to floats, this produces an error. Any idea about how I can convert these back into numbers? I put the sample datafile here if you would like to try to load it: https://dl.dropboxusercontent.com/u/3816350/Posts/data.txt I would like to simply use numpy.loadtxt or numpy.genfromtxt, but they also do not want to deal with this crazy file. asked Oct 11, 2013 at 23:45
DanHicksteinDanHickstein 6,29612 gold badges50 silver badges86 bronze badges 5 I'm willing to bet this is a UTF-16-LE file, and you're reading it as whatever your default encoding is. In UTF-16, each character takes two bytes.* If your characters are all ASCII, this means the UTF-16 encoding looks like the ASCII encoding with an extra '\x00' after each character. To fix this, just decode the data:
Or do the same thing at the file level with the io or codecs module:
* This is a bit of an oversimplification: Each BMP character takes two bytes; each non-BMP character is turned into a surrogate pair, with each of the two surrogates taking two bytes. But you probably didn't care about these details. answered Oct 11, 2013 at 23:50
abarnertabarnert 340k43 gold badges571 silver badges648 bronze badges 7 Looks like UTF-16 to me.
You can work directly off the Unicode strings:
Or encode them to something different, if you prefer:
Note that you need to do this as early as possible in your processing. As your comment noted, The 2.6 and later
answered Oct 11, 2013 at 23:48
Peter DeGlopperPeter DeGlopper 35.2k7 gold badges87 silver badges82 bronze badges 2 This is really just @abarnert's suggestion, but I wanted to post it as an answer since this is the simplest solution and the one that I ended up using:
This demonstrates how you can create a file object using io.open using whatever crazy encoding your file happens to have, and then pass that file object to np.loadtxt (or np.genfromtxt) for quick-and-easy loading. answered Sep 3, 2015 at 15:41
DanHicksteinDanHickstein 6,29612 gold badges50 silver badges86 bronze badges This piece of code will do the necessary
When you try to use 'file_first_line.split()' before replacing, the output would contain '\x00' i just tried replacing '\x00' with empty and it worked. answered Jan 31, 2017 at 12:09
0 Not the answer you're looking for? Browse other questions tagged python numpy encoding utf-16le or ask your own question. |