I want to read some quite huge files[to be precise: the google ngram 1 word dataset] and count how many times a character occurs. Now I wrote this script:
import fileinput
files = ['../../datasets/googlebooks-eng-all-1gram-20090715-%i.csv' % value for value in range[0,9]]
charcounts = {}
lastfile = ''
for line in fileinput.input[files]:
line = line.strip[]
data = line.split['\t']
for character in list[data[0]]:
if [not character in charcounts]:
charcounts[character] = 0
charcounts[character] += int[data[1]]
if [fileinput.filename[] is not lastfile]:
print[fileinput.filename[]]
lastfile = fileinput.filename[]
if[fileinput.filelineno[] % 100000 == 0]:
print[fileinput.filelineno[]]
print[charcounts]
which works fine, until it reaches approx. line 700.000 of the first file, I then get this error:
../../datasets/googlebooks-eng-all-1gram-20090715-0.csv
300000
400000
500000
600000
700000
Traceback [most recent call last]:
File "charactercounter.py", line 5, in
for line in fileinput.input[files]:
File "C:\Python31\lib\fileinput.py", line 254, in __next__
line = self.readline[]
File "C:\Python31\lib\fileinput.py", line 349, in readline
self._buffer = self._file.readlines[self._bufsize]
File "C:\Python31\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode[input,self.errors,decoding_table][0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 7771: cha
racter maps to
To solve this I searched the web a bit, and came up with this code:
import fileinput
files = ['../../datasets/googlebooks-eng-all-1gram-20090715-%i.csv' % value for value in range[0,9]]
charcounts = {}
lastfile = ''
for line in fileinput.input[files,False,'',0,'r',fileinput.hook_encoded['utf-8']]:
line = line.strip[]
data = line.split['\t']
for character in list[data[0]]:
if [not character in charcounts]:
charcounts[character] = 0
charcounts[character] += int[data[1]]
if [fileinput.filename[] is not lastfile]:
print[fileinput.filename[]]
lastfile = fileinput.filename[]
if[fileinput.filelineno[] % 100000 == 0]:
print[fileinput.filelineno[]]
print[charcounts]
but the hook I now use tries to read the entire, 990MB, file into the memory at once, which kind of crashes my pc. Does anyone know how to rewrite this code so that it actually works?
p.s: the code hasn't even run all the way yet, so I don't even know if it does what it has to do, but for that to happen I first need to fix this bug.
Oh, and I use Python 3.2
On this page: open[], file.read[], file.readlines[], file.write[], file.writelines[].
Opening and Closing a "File Object"
As seen in Tutorials #12 and #13, file IO [input/output] operations are done through a file data object. It typically proceeds as follows:- Create a file object using the open[] function. Along with the file name, specify:
- 'r' for reading in an existing file [default; can be dropped],
- 'w' for creating a new file for writing,
- 'a' for appending new content to an existing file.
- Do something with the file object [reading, writing].
- Close the file object by calling the .close[] method on the file object.
myfile = open['alice.txt', 'r'] # Reading. 'r' can be omitted # ... read from myfile ... myfile.close[] # Closing file foo.pyBelow, myfile is opened for writing. In the second instance, the 'a' switch makes sure that the new content is tacked on at the end of the existing text file. Had you used 'w' instead, the original file would have been overwritten.
myfile = open['results.txt', 'w'] # The file is newly created where foo.py is # ... write to myfile ... myfile.close[] # Closing file. VERY IMPORTANT! myfile = open['results.txt', 'a'] # 'a': appending instead of overwriting. # ... add text to the file ... myfile.close[] # Closing file. DON'T FORGET! foo.pyThere is one more piece of crucial information: encoding. Some files may have to be read as a particular encoding type, and sometimes you need to write out a file in a specific encoding system. For such cases, the open[] statement should include an encoding spcification, with the encoding='xxx' switch:
myfile = open['alice.txt', encoding='utf-8'] # Reading a UTF-8 file; 'r' is omitted myfile = open['results.txt', 'w', encoding='utf-8'] # File will be written in UTF-8 foo.pyMostly, you will need 'utf-8' [8-bit Unicode], 'utf-16' [16-bit Unicode], or 'utf-32' [32-bit], but it may be something different, especially if you are dealing with a foreign language text. Here is a full list of encodings.
Reading from a File
OK, we know how to open and close a file object. But what are the actual commands for reading? There are multiple methods.First off,
.read[] reads in the entire text content of the file as a single string. Below, the file is read into a variable named marytxt, which ends up being a string-type object. Download mary-short.txt and try out yourself.
>>> f = open['mary-short.txt'] >>> marytxt = f.read[] # Using .read[] >>> f.close[] >>> marytxt 'Mary had a little lamb,\nHis fleece was white as snow,\nAnd everywhere that Mary went,\nThe lamb was sure to go.\n' >>> type[marytxt] # marytxt is string type >>> len[marytxt] # marytxt has 110 characters 110 >>> print[marytxt[0]] M |
>>> f = open['mary-short.txt'] >>> marylines = f.readlines[] # Using .readlines[] >>> f.close[] >>> marylines ['Mary had a little lamb,\n', 'His fleece was white as snow,\n', 'And everywhere that Mary went,\n', 'The lamb was sure to go.\n'] >>> type[marylines] # marylines is list type >>> len[marylines] # marylines has 4 lines 4 >>> print[marylines[0]] Mary had a little lamb, |
f = open['bible-kjv.txt'] # This is a big file for line in f: # Using 'for ... in' on file object if 'smite' in line: print[line,] # ',' keeps print from adding a line break f.close[] foo.py
Writing to a File
Writing methods also come in a pair: .write[] and .writelines[]. Like the corresponding reading methods, .write[] handles a single string, while .writelines[] handles a list of strings.Below,
.write[] writes a single string each time to the designated output file:
>>> fout = open['hello.txt', 'w'] >>> fout.write['Hello, world!\n'] # .write[str] >>> fout.write['My name is Homer.\n'] >>> fout.write["What a beautiful day we're having.\n"] >>> fout.close[] |
>>> tobuy = ['milk\n', 'butter\n', 'coffee beans\n', 'arugula\n'] >>> fout = open['grocerylist.txt', 'w'] >>> fout.writelines[tobuy] # .writelines[list] >>> fout.close[] |
Common Pitfalls
"No such file or directory" error
>>> f = open['mary-short.txt'] Traceback [most recent call last]: File "", line 1, in IOError: [Errno 1] No such file or directory: 'mary-short.txt' |
Issues with encoding
>>> f = open['mary-short.txt'] # need encoding='utf-8' >>> marytxt = f.read[] Traceback [most recent call last]: File "", line 1, in marytxt = f.read[] File "C:\Program Files [x86]\Python35-32\lib\encodings\cp1252.py", line 23, in decode return codecs.charmap_decode[input,self.errors,decoding_table][0] UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 36593: character maps to |
Entire file content can be read in only ONCE per opening
>>> f = open['mary-short.txt'] >>> marytxt = f.read[] # Reads in entire file content >>> marylines = f.readlines[] # Nothing left to read, returns nothing >>> f.close[] >>> len[marytxt] 110 >>> len[marylines] # marylines is empty! 0 |
Only the string type can be written
>>> pi = 3.141592 >>> fout = open['math.txt', 'w'] >>> fout.write["Pi's value is "] >>> fout.write[pi] # trying to write float, doesn't work Traceback [most recent call last]: File "", line 1, in TypeError: expected a character buffer object >>> fout.write[str[pi]] # turn number into string using str[] >>> |
Your output file is empty
This happens to everyone: you write something out, open up the file to view, only to find it empty. In other times, the file content may be incomplete. Curious, isn't it? Well, the cause is simple: YOU FORGOT .close[]. Writing out happens in buffers; flushing out the last writing buffer does not happen until you close your file object. ALWAYS REMEMBER TO CLOSE YOUR FILE OBJECT. [Windows] Line breaks do not show up
If you open up your text file in Notepad app in Windows and see everything in one line, don't be alarmed. Open the same text file in Wordpad or, even better,
Notepad++, and you will see that the line breaks are there after all. See this FAQ for details.