Python: Load Dict Fast from File
================================

Python ``wordsegment`` uses two text files to store unigram and bigram
count data. The files currently store records separated by newline
characters with fields separated by tabs.

.. code:: python

    with open('../wordsegment_data/unigrams.txt', 'r') as reader:
        print repr(reader.readline())


.. parsed-literal::

    'the\\t23135851162\\n'


When the ``wordsegment`` module is imported these files are read from
disk and used to construct a Python ``dict`` mapping ``word`` to
``count`` pairs.

That function works like so:

.. code:: python

    # %%timeit
    with open('../wordsegment_data/unigrams.txt') as reader:
        lines = (line.split('\t') for line in reader)
        dict((word, float(number)) for word, number in lines)


.. parsed-literal::

    1 loops, best of 3: 286 ms per loop


Since we're talking about performance, here's some details about my
platform.

.. code:: python

    import subprocess
    print subprocess.check_output([
        '/usr/sbin/sysctl', '-n', 'machdep.cpu.brand_string'
    ])

    import sys
    print sys.version


.. parsed-literal::

    Intel(R) Core(TM) i7-4960HQ CPU @ 2.60GHz

    2.7.10 (default, May 25 2015, 13:06:17)
    [GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.56)]


Loading the files in about a second is plenty fast for me but I wondered
if there was a faster way. Here's a few things I tried.

Simply reading all the lines from the file takes 27ms:

.. code:: python

    # %%timeit
    with open('../wordsegment_data/unigrams.txt') as reader:
        lines = [line for line in reader]


.. parsed-literal::

    10 loops, best of 3: 26.6 ms per loop


Another way to accomplish the same:

.. code:: python

    # %%timeit
    with open('../wordsegment_data/unigrams.txt') as reader:
        lines = reader.read().split('\n')


.. parsed-literal::

    10 loops, best of 3: 20.7 ms per loop


That's 30% faster but it's a small part of 286ms. What takes the
majority of the time?

.. code:: python

    # %%timeit
    with open('../wordsegment_data/unigrams.txt') as reader:
        lines = (line.split('\t') for line in reader)
        for word, number in lines:
            pass


.. parsed-literal::

    10 loops, best of 3: 115 ms per loop


So splitting each line takes nearly 90ms. That's a bit surprising to me.
What else takes so long?

.. code:: python

    # %%timeit
    with open('../wordsegment_data/unigrams.txt') as reader:
        lines = (line.split('\t') for line in reader)
        for word, number in lines:
            float(number)


.. parsed-literal::

    10 loops, best of 3: 167 ms per loop


Wow, 51ms to convert strings to floats. Maybe later we can optimize
that. Finally, the last chunk must be constructing the ``dict``.

.. code:: python

    # %%timeit
    with open('../wordsegment_data/unigrams.txt') as reader:
        lines = (line.split('\t') for line in reader)
        result = dict()
        for word, number in lines:
            result[word] = float(number)


.. parsed-literal::

    1 loops, best of 3: 254 ms per loop


By calling ``__setitem__`` repeatedly we avoid the construction of the
tuple used to construct the dict using its constructor. Let's experiment
with that.

.. code:: python

    # %%timeit
    with open('../wordsegment_data/unigrams.txt') as reader:
        lines = (line.split('\t') for line in reader)
        dict((word, float(number)) for word, number in lines)


.. parsed-literal::

    1 loops, best of 3: 303 ms per loop


This isn't Python 2.6 compatible but what about a ``dict``
comprehension?

.. code:: python

    # %%timeit
    with open('../wordsegment_data/unigrams.txt') as reader:
        lines = (line.split('\t') for line in reader)
        {word: float(number) for word, number in lines}


.. parsed-literal::

    1 loops, best of 3: 275 ms per loop


It's a bit disappointing that the constructor is slower than calling
``__setitem__`` repeatedly. But maybe that just reflects how much
optimization has gone into making ``__setitem__`` really fast.

Here's a breakdown of how long various steps are taking:

+--------------------------------+---------+
| Operation                      | Time    |
+================================+=========+
| Read file and parse lines      | 26ms    |
+--------------------------------+---------+
| Split lines by tab character   | 90ms    |
+--------------------------------+---------+
| Convert strings to floats      | 50ms    |
+--------------------------------+---------+
| Creating ``dict(...)``         | 135ms   |
+--------------------------------+---------+

Unfortunately, constructing the ``dict`` is hard to optimize. So let's
look at the other steps. If we stored the counts on disk in binary
format then we could avoid parsing them. If we did so, we might likewise
store the words in a separate file. Let's convert our unigrams file into
two.

.. code:: python

    with open('../wordsegment_data/unigrams.txt') as reader:
        pairs = [line.split('\t') for line in reader]
        words = [pair[0] for pair in pairs]
        counts = [float(pair[1]) for pair in pairs]

        with open('words.txt', 'wb') as writer:
            writer.write('\n'.join(words))

        from array import array
        values = array('d')
        values.fromlist(counts)
        with open('counts.bin', 'wb') as writer:
            values.tofile(writer)

Now we have two files: ``words.txt`` and ``counts.bin``. The first
stores words separated by newline characters in ascii. The latter stores
double-precision floating-point numbers in binary. Together we can use
these to construct our ``dict``.

.. code:: python

    from itertools import izip as zip

.. code:: python

    # %%timeit
    with open('words.txt', 'rb') as lines, open('counts.bin', 'rb') as counts:
        words = lines.read().split('\n')
        values = array('d')
        values.fromfile(counts, 333333)
        dict(zip(words, values))


.. parsed-literal::

    10 loops, best of 3: 106 ms per loop


Wow. We started at a time of 286ms and worked down to 106ms. That's 62%
faster. The key to the speedup is separating the ``dict`` keys and
values and using fast methods for parsing each. Reading words from a
file now uses ``str.split`` which is actually faster than Python's
built-in buffered-file readline mechanism. The ``array`` module parses
counts directly from a binary-formatted file. Finally, the ``dict``
constructor is used with arguments izipped together. I tried using the
``__setitem__`` trick here but results were within error of one another
and I prefer this style.

At the end of the day, I'm not that impressed. 62% is faster but I
expected to improve things by 10x not 2x. Even with this speedup, you'll
notice a delay on module import. And now the format of the files is
funky. They don't play nice with grep, etc. I'm going to leave things
as-is for now.

I'd be happy to hear what others have tried. Note in this case that I
don't care how long it takes to write the files. That would be another
interesting thing to benchmark.

I also tried formatting the ``dict`` in a Python module which would be
parsed on import. This was actually a little slower than the initial
code. My guess is the Python interpreter is doing roughly the same
thing.