Parsing Large CSV Blobs on Google App Engine

When parsing a blob on Google App Engine using the Python CSV library, the simplest approach is to pass the BlobReader straight into the CSV reader. However, unlike when with opening a normal file, there is no option to handle universal newline characters. In order to handle all the different kinds of newline characters, the string’s splitlines method can be used. However, doing that without loading the entire file into memory can be tricky. Google recommends blobs should be read 1MB at a time, so ideally, you could load 1MB into a buffer, split the lines and then feed the CSV reader one line at a time. That’s what this class does:

class BlobIterator:
    """Because the python csv module doesn't like strange newline chars and
    the google blob reader cannot be told to open in universal mode, then
    we need to read blocks of the blob and 'fix' the newlines as we go"""

    def __init__(self, blob_reader):
        self.blob_reader = blob_reader
        self.last_line = ""
        self.line_num = 0
        self.lines = []
        self.buffer = None

    def __iter__(self):
        return self

    def next(self):
        if not self.buffer or len(self.lines) == self.line_num + 1:
            self.buffer =  # 1MB buffer
            self.lines = self.buffer.splitlines()
            self.line_num = 0

            # Handle special case where our block just happens to end on a new line
            if self.buffer[-1:] == "\n" or self.buffer[-1:] == "\r":

        if not self.buffer:
            raise StopIteration

        if self.line_num == 0 and len(self.last_line) > 0:
            result = self.last_line + self.lines[self.line_num] + "\n"
            result = self.lines[self.line_num] + "\n"

        self.last_line = self.lines[self.line_num + 1]
        self.line_num += 1

        return result

Having defined this class, you can call it like this:

    blob_reader = blobstore.BlobReader(blob_key)
    blob_iterator = BlobIterator(blob_reader)
    reader = csv.reader(blob_iterator)

The BlobIterator supports the __iter__ convention but behind the scenes loads 1MB of the blob into memory, splits the lines and then keeps track of the last partial line so it can combine it with the first partial line of the next 1MB block.

UPDATE: 2014-08-14: Many thanks to Javier Carrascal for his help in spotting an issue with the first version of the BlobIterator. The code above has been updated with a fix. His post explains the process he went through.