A look at Cython18 May 2010 15:47
I have some code which reads historical stock prices from files a bit like this:
Date,Open,High,Low,Close,Volume,Adj Close 2009-12-15,505.00,515.50,504.00,514.00,5009300,514.00 2009-12-14,503.50,505.50,500.50,503.50,2413100,503.50 2009-12-11,495.20,503.50,493.70,501.00,3022800,501.00
My initial attempt was quite slow; apparently Python's strptime is now a pure-Python implementation and hence quite slow but much more consistent that using the underlying system library.
Anyway, since the format is fixed, I thought I could save a little time and ended up with something like this:
Day = namedtuple('Day', 'date open high low close volume adjclose') def parseLine(l): d = l.split(',')[1:] return Day(date(int(l[0:4]),int(l[5:7]),int(l[8:10])), *[float(x) for x in d])
Yes, more beautiful code has been written, but this works; if this horrifies you, I recommend that you stop reading now. This code was many times faster than using Python's strptime, but still not fast. The above function was taking about 2.2 seconds to read nearly 200,000 lines of data. That's not bad, but not really fast either. Optimising wasn't necessary, but as a learning exercise I thought I'd give it a go.
My first attempt was to use PyPy. With JIT, PyPy beats CPython in the majority of benchmarks. Regular expressions are a known slow point, and I don't have any of those, so I expected good things. Sadly it was about 5 times slower. (Plus I had to find an implementation of namedtuple and import the with statement since PyPy currently implements Python 2.5)
Then I remembered that I wanted to have a play with Cython at some point. Cython is really amazing because it enables you to mix Python and C functions with ease.
The first thing I did was to move the above code into a .pyx file and import it using pyximport which automatically converts the code to C and compiles it on import. This trivial change gave me a small speed-up, around 20%.
This wasn't enough to be worth the effort, so the next step was to avoid some of the Python string processing. C's atoi function parses until the first non-numeric character, so I could point it at the right point in the Python string and have it work. Sounds too good to be true? I thought so, but here it is:
cdef extern from "stdlib.h": int atoi(char*) def parseLine(char* l): d = l.split(',')[1:] return Day(date(atoi(l),atoi(l+5),atoi(l+8)), *[float(x) for x in d])
[Aside: I'm very impressed that Pygments already supports Cython syntax highlighting]
This saved a bit of time, but some profiling showed that this function was still the slowest in my code. Could I also do the rest of the string parsing in C? Well, C doesn't have a nice split function. But it turns out that it does have a function strtod which converts a string to a double and updates a pointer to the last character parsed.
Yes, sadly this gets quite ugly now, but anyway here's the code:
from collections import namedtuple from datetime import date Day = namedtuple('Day', 'date open high low close volume adjclose') cdef extern from "stdlib.h": int atoi(char*) double strtod(char*, char**) def parseLine(char* l): cdef char** j j = &l return Day(date(atoi(l),atoi(l+5),atoi(l+8)), strtod(l+11,j), strtod(j+1,j), strtod(j+1,j), strtod(j+1,j), strtod(j+1,j), strtod(j+1,j))
As you can see we declare j to be a pointer to a pointer into a character array, and initialise it to point to l; we could initialise it to anything but l is a handy unused pointer which strtod can tromple over before it gets garbage collected when the function exits. To get a reference we use & as in C, but to dereference we have to use  since * already has a meaning in Python.
This is clearly a but ugly and has made the code more fragile, but it actually gives a 3-times speed increase over the pure-Python implementation.
Cython is usually used for increasing the speed of numerical calculations (where improvements of the order of 1000 times have been reported), and there are probably not that many cases where it is suitable for string processing, which after all is something Python is particularly good at. But the ease with which Python and C can be mixed is really impressive, and I will certainly be happier programming heavy numerical procedures in Python knowing that I can likely get a good speed up with some minor changes if I need to.
After eliminating the string processing, the next slowest thing was instantiating named tuples. This made me a little sad as I think named tuples are one of the nicest features Python has grown recently. However, doing something similar in Cython isn't too hard. Here's what I ended up with instead of the namedtuple definition:
cdef class DayClass: cdef readonly object date cdef readonly float open cdef readonly float high cdef readonly float low cdef readonly float close cdef readonly float volume cdef readonly float adjclose cpdef DayClass Day(date, open, high, low, close, volume, adjclose): cdef DayClass instance = DayClass.__new__(DayClass) instance.date = date instance.open = open instance.high = high instance.low = low instance.close = close instance.volume = volume instance.adjclose = adjclose return instance
Here I first have a class definition; to make attributes available from Python they need to be declared either public or readonly. Below that is a factory which instantiates the class and sets the appropriate attributes. This still isn't incredibly fast, but the change more then doubled the speed which I could load data again, so it did make a significant difference.
Leave a comment