A look at Cython

18 May 2010 15:47

I have some code which reads historical stock prices from files a bit like this:

Date,Open,High,Low,Close,Volume,Adj Close
2009-12-15,505.00,515.50,504.00,514.00,5009300,514.00
2009-12-14,503.50,505.50,500.50,503.50,2413100,503.50
2009-12-11,495.20,503.50,493.70,501.00,3022800,501.00

My initial attempt was quite slow; apparently Python's strptime is now a pure-Python implementation and hence quite slow but much more consistent that using the underlying system library.

Anyway, since the format is fixed, I thought I could save a little time and ended up with something like this:

Day = namedtuple('Day', 'date open high low close volume adjclose')

def parseLine(l):
    d = l.split(',')[1:]
    return Day(date(int(l[0:4]),int(l[5:7]),int(l[8:10])), *[float(x) for x in d])

Yes, more beautiful code has been written, but this works; if this horrifies you, I recommend that you stop reading now. This code was many times faster than using Python's strptime, but still not fast. The above function was taking about 2.2 seconds to read nearly 200,000 lines of data. That's not bad, but not really fast either. Optimising wasn't necessary, but as a learning exercise I thought I'd give it a go.

My first attempt was to use PyPy. With JIT, PyPy beats CPython in the majority of benchmarks. Regular expressions are a known slow point, and I don't have any of those, so I expected good things. Sadly it was about 5 times slower. (Plus I had to find an implementation of namedtuple and import the with statement since PyPy currently implements Python 2.5)

Then I remembered that I wanted to have a play with Cython at some point. Cython is really amazing because it enables you to mix Python and C functions with ease.

The first thing I did was to move the above code into a .pyx file and import it using pyximport which automatically converts the code to C and compiles it on import. This trivial change gave me a small speed-up, around 20%.

This wasn't enough to be worth the effort, so the next step was to avoid some of the Python string processing. C's atoi function parses until the first non-numeric character, so I could point it at the right point in the Python string and have it work. Sounds too good to be true? I thought so, but here it is:

cdef extern from "stdlib.h":
   int atoi(char*)

def parseLine(char* l):
   d = l.split(',')[1:]
   return Day(date(atoi(l),atoi(l+5),atoi(l+8)), *[float(x) for x in d])

[Aside: I'm very impressed that Pygments already supports Cython syntax highlighting]

This saved a bit of time, but some profiling showed that this function was still the slowest in my code. Could I also do the rest of the string parsing in C? Well, C doesn't have a nice split function. But it turns out that it does have a function strtod which converts a string to a double and updates a pointer to the last character parsed.

Yes, sadly this gets quite ugly now, but anyway here's the code:

from collections import namedtuple
from datetime import date

Day = namedtuple('Day', 'date open high low close volume adjclose')

cdef extern from "stdlib.h":
   int atoi(char*)
   double strtod(char*, char**)

def parseLine(char* l):
   cdef char** j
   j = &l
   return Day(date(atoi(l),atoi(l+5),atoi(l+8)),
              strtod(l+11,j),
              strtod(j[0]+1,j),
              strtod(j[0]+1,j),
              strtod(j[0]+1,j),
              strtod(j[0]+1,j),
              strtod(j[0]+1,j))

As you can see we declare j to be a pointer to a pointer into a character array, and initialise it to point to l; we could initialise it to anything but l is a handy unused pointer which strtod can tromple over before it gets garbage collected when the function exits. To get a reference we use & as in C, but to dereference we have to use [0] since * already has a meaning in Python.

This is clearly a but ugly and has made the code more fragile, but it actually gives a 3-times speed increase over the pure-Python implementation.

Cython is usually used for increasing the speed of numerical calculations (where improvements of the order of 1000 times have been reported), and there are probably not that many cases where it is suitable for string processing, which after all is something Python is particularly good at. But the ease with which Python and C can be mixed is really impressive, and I will certainly be happier programming heavy numerical procedures in Python knowing that I can likely get a good speed up with some minor changes if I need to.

After eliminating the string processing, the next slowest thing was instantiating named tuples. This made me a little sad as I think named tuples are one of the nicest features Python has grown recently. However, doing something similar in Cython isn't too hard. Here's what I ended up with instead of the namedtuple definition:

cdef class DayClass:
   cdef readonly object date
   cdef readonly float open
   cdef readonly float high
   cdef readonly float low
   cdef readonly float close
   cdef readonly float volume
   cdef readonly float adjclose

cpdef DayClass Day(date, open, high, low, close, volume, adjclose):
   cdef DayClass instance = DayClass.__new__(DayClass)
   instance.date = date
   instance.open = open
   instance.high = high
   instance.low = low
   instance.close = close
   instance.volume = volume
   instance.adjclose = adjclose
   return instance

Here I first have a class definition; to make attributes available from Python they need to be declared either public or readonly. Below that is a factory which instantiates the class and sets the appropriate attributes. This still isn't incredibly fast, but the change more then doubled the speed which I could load data again, so it did make a significant difference.

Comments

Bob wrote on 9 June 2010:

You might consider typing the arguments in your Day function, i.e.

cpdef DayClass Day(date, float open, float high, float low, float close, float volume, float adjclose):

which could help a lot. At that point constructing the date object would probably be the bottleneck.

Graham wrote on 15 June 2010:

Hi Bob, I tried making the change you suggested and got all sorts of weird results. I eventually figured that it was because the evaluation order of arguments to a C function is up to the compiler. Once I fixed that, I ended up with a speed-up of about 20%. Thanks for the tip!

This was what I ended up with:

def parseLine(char* l):
   cdef char** j
   cdef double v[6]
   cdef int i
   j = &l
   d = date(atoi(l),atoi(l+5),atoi(l+8))
   j[0] = l+10
   for i in range(6):
       v[i] = strtod(j[0]+1,j)
   return Day(d, v[0], v[1], v[2], v[3], v[4], v[5])

Stefan Behnel wrote on 8 February 2011:

"the evaluation order of arguments to a C function is up to the compiler"

That's actually a bug in current Cython. I've been working on a fix, but it turns out that this is non-trivial because there is code that depends on the current behaviour. We will try to keep the impact low when we fix the evaluation order for Cython 0.15.

"there are probably not that many cases where it is suitable for string processing"

I would object to that, Cython is quite suitable for string processing:

http://docs.cython.org/src/tutorial/strings.html

And I am always interested in improving its support (it's Open-Source, BTW, you can make it better, too!). But I agree that most of Python's advanced string processing features (regexes, anyone?) would require you to create Python strings from your data, which will always slow down your code.

Leave a comment