Dadd, ErrorEmail and CacheControl Releases

I’ve written a couple new bits of code that semeed like they could be helpful to others.

Dadd

Dadd (pronounced Daddy) is a tool to help administer daemons.

Most deployment systems are based on the idea of long running processes. You want to release a new version of some service. You build a package, upload it somewhere and tell your package manager to grab it. Then you tell your process manager to restart it to get the new code.

Dadd works differently. Dadd lets you define a short spec that includes the process you want to run. A dadd worker then will use that spec to download any necessary files, create a temporary directory to run in and start the process. When the process ends, assuming everything went well, it will clean up the temp directory. If there was an error, it will upload the logs to the master and send an email.

Where this sort of system comes in handy is when you have scripts that take a while to run and that shouldn’t be killed when new code is released. For example, at work I manage a ton of ETL processes to get our data into a data warehouse we’ve written. These ETL processes are triggered with Celery tasks, but they typically will ssh into a specific host, create a virtaulenv, install some dependencies, and copy files before running a deamon and disconnecting. Dadd, makes this kind of processing more automatic where it can run these processes on any host in our cluster. Also, because the dadd worker builds the environment, it means we can run a custom script without having to go through the process of a release. This is extremely helpful for running backfills or custom updates to migrate old data.

I have some ideas for Dadd such as incorporating a more involved build system and possibly using lxc containers to run the code. Another inspriation for Dadd is for setting up nodes in a cluster. Often times it would be really easy to just install a couple python packages but most solutions are either too manual or require a specific image to use things like chef, puppet, etc. With Dadd, you could pretty easily write a script to install and run it on a node and then let it do the rest regarding setting up an environment and running some code.

But, for the moment, if you have code you run by copying some files, Dadd works really well.

ErrorEmail

ErrorEmail was written specifically for Dadd. When you have a script to run and you want a nice traceback email when things fail, give ErrorEmail a try. It doesn’t do any sort rate limiting an the server config is extremely basic, but sometimes you don’t want to install a bunch of packages just to send an email on an error.

When you can’t install django or some other framework for an application, you can still get nice error emails with ErrorEmail.

CacheControl

The CacheControl 0.10.6 release includes support for calling close on the cache implementation. This is helpful when you are using a cache via some client (ie Redis) and that client needs to safely close the connection.

Ugly Attributes

At some point in my programming career I recognized that Object Oriented Programming is not all it’s cracked up to be. It can be a powerful tool, especially in a statically typed language, but in the grand scheme of managing complexity, it often falls short of the design ideals that we were taught in school. One area where this becomes evident is object attributes.

Attributes are just variables that are “attached” to an object. This simplicity, unfortunately, makes attributes require a good deal more complexity to manage in a system. The reason being is that languages do not provide any tools to respect the perceived boundaries that an attribute appears to provide.

Let’s look at a simple example.

class Person(object):

    def __init__(self, age):
        self. age = age

We have a simple Person object. We want to be able to access the person’s age by way of an attribute. The first change we’ll want to make is to make this attribute a property.

class Person(object):
    def __init__(self, year, month, day):
        self.year = year
        self.month = date
        self.day = day

    @property
    def age(self):
        age = datetime.now() - datetime(self.year, self.month, self.day)
        return age.days / 365

So far, this feels pretty simple. But lets get a little more realistic and presume that this Person is not a naive object but one that talks to a RESTful service in order to get is values.

A Quick Side Note

Most of the time you’d see a database and an ORM for this sort of code. If you are using Django or SQLAlchemy (and I’m sure other ORMs are the same) you’d see something like.

user = User.query.get(id)

You might have a nifty function on your model that calculates the age. That is, until you realize you stored your data in a non-timezone aware date field and now that you’re company has started supporting Europe, some folks are complaining that they are turning 30 a day earlier than they expected...

The point being is that ORMs do an interesting thing that is your only logical choice if you want to ensure your attribute access is consistent with the database. ORMs MUST create new instances for each query and provide a SYNC method or function to ensure they are updated. Sure, they might have an eagercommit mode or something, but Stack Overflow will most likely provide plenty of examples where this falls down.

I’d like to keep this reality in mind moving forward as it presents a fact of life when working with objects that is important to understand as your program gets more complex.

Back to Our Person

So, we want to make this Person object use a RESTful service as our database. Lets change how we load the data.

class Person(ServiceModel):
    # We inherit from some ServiceModel that has the machinery to
    # grab our data form our service.

    @classmethod
    def by_id(cls, id):
        doc = conn.get('people', id=id).pop()
        return cls(**doc)

    @property
    def age(self):
        age = datetime.now() - datetime(self.year, self.month, self.day)
        return age.days / 365

    # This would probably be implemented in the ServiceModel, but
    # I'll add it hear for clarity.
    def __getattr__(self, name):
        if name in self.doc:
            return self.doc[name]
        raise AttributeError('%s is not in the resource.' % name)

Now assuming we get a document that has a year, month, day, our age function would still work.

So far, this all feels pretty reasonable. But what happens when things change? Fortunately in the age use case, people rarely change their birth date. But, unfortunately, we do have pesky time zones that we didn’t want to think about when we had 100 users and everyone lived on the west coast. The “least viable product” typically doesn’t afford thinking ahead that far, so these are issues you’ll need to deal with after you have a lot of code.

Also, the whole point of all this work has been to support an attribute on an object. We haven’t sped anything up. These are not new features. We haven’t even done some clever with meta classes or generators! The reality is that you’ve refactored your code 4 or 5 five times to support a single call in a template.

{{ person.age }}

Let’s take a step back for a bit.

Taking a Step Back

Do not feel guilty for going down this rabbit hole. I’ve taken the trip hundreds of times! But maybe it is time to reconsider how we think about object oriented design.

When we think back to when we were just learning OO there was a zoo. In this zoo we had the mythical Animal class. We’d have new animals show up at the zoo. We’d get a Lion, Tiger and Bear they would all need to eat. This modeling feels so right it can’t be wrong! An in many ways it isn’t.

If we take a step back, there might be a better way.

Let’s first acknowledge that that our Animal does need to eat. But lets really think about what it means to our zoo. The Animals will eat, but so will the Visitors. I’m sure the Employee would like to have some food now and then as well. The reason we want to know about all this sustenance is because we need to Order food and track it’s cost. If we reconsider this in the code, what if, and this is a big what if, we didn’t make eat a method on some class. What if we passed our object to our eat method.

eat(Person)

While that looks cannibalistic at first, we can reconsider our original age method as well.

age(Person)

And how about our Animals?

age(Lion)

Looking back at our issues with time zones, because our zoo has grown and people come from all over the world, we can even update our code without much trouble.

age(adjust_for_timezones(Person))

Assuming we’re using imports, here is a more realistic refactoring.

from myapp.time import age

age(Lion)

Rather than rewriting all our age calls for timezone awareness, we can change our myapp/time.py.

def age(obj):
   age = utc.now() - adjust_for_timezones(obj.birthday())
   return age / 365

In this idealized world, we haven’t thrown out objects completely. We’ve simply adjusted how we use them. Our age depends on a birthday method. This might be a Mixin class we use with our Models. We also could still have our classic Animal base class. Age might even be relative where you’d want to know how old an Animal is in “person years”. We might create a time.animal.age function that has slightly different requirements.

In any case, by reconsidering our object oriented design, we can remove quite a bit of code related to ugly attributes.

The Real World Conclusions

While it might seem obvious now how to implement a system using these ideas, it requires a different set of skills. Naming things is one of the two hard things in computer science. We don’t have obvious design patterns for grouping functions in dynamic languages where it becomes clear the expectations. Our age function above likely would need some check to ensure that the object has a birthday method. You wouldn’t want every age call to be wrapped in a try/except.

You also wouldn’t want to be too limiting on type, especially in a dynamic language like Python (or Ruby, JavaScript, etc.). Even though there has been some rumbling for type hints in Python that seem reasonable, right now you have to make some decisions on how you want to handle the communication that some function foo expects some object of type of Bar or has a method baz. These are trivial problems at a technical level, but socially, they are difficult to enforce without formal language support.

There are also some technical issues to consider. In Python, function calls can be expensive. Each function call requires its own lexical stack such that many small nested functions, while designed well, can become slow. There are tools to help with this, but again, it is difficult to make this style obvious over time.

There is never a panacea, but it seems that there is still room for OO design to grow and change. Functional programming, while elegant, is pretty tough to grok, especially when you have a dynamic language code sitting in your editor, allowing you to mutate everything under the sun. Still, there are some powerful themes in Functional Programming that can make your Object Oriented code more helpful in managing complexity.

Finally

Programming is really about layering complexity. It is taking concepts and modeling them to a language that computers can take and, eventually, consider in terms of voltage. As we model our systems we need to consider the data vs. the functionality, which means avoiding ugly attributes (and methods) in favor of orthogonal functionality that respects the design inherit in the objects.

It is not easy by any stretch, but I believe by adopting the techniques mentioned above, we can move past the kludgy parts of OO (and functional programming) into better designed and more maintainable software.

Functional Programming in Python

While Python doesn’t natively support some essential traits of an actual functional programming language, it is possible to use a functional style (rather than object oriented) to write programs. What makes it hard is that some of the constraints functional programming requires must be followed done manually.

First off, lets talk about what Python does well that makes functional programming possible.

Python has first class functions that allow passing a function around the same way that you’d pass around a normal variable. First class functions make it possible to do things like currying and complex list processing. Fortunately, the standard library provides the excellent functools library. For example

>>> from functools import partial
>>> def add(x, y): return x + y
>>> add_five = partial(add, 5)
>>> map(add_five, [10, 20, 30])
[15, 25, 35]

The next thing critical functional tool that Python provides is iteration. More specifically, Python generators provide a tool to process data lazily. Generators allow to you create functions that can create data on demand rather than forcing the creation of an entire set. Again, the standard library provides some helpful tools via the itertools library.

>>> from itertools import count, imap, islice
>>> nums = islice(imap(add_five, count(10, 10)), 0, 3)
>>> nums
<itertools.islice object at 0xb7cf6dc4>
>>> nums.next()
15
>>> nums.next()
25
>>> nums.next()
35

In this example each of the functions only calculates and returns a value when it is required.

Python also has other functional concepts built in such as list comprehensions and decorators that when used with first class functions and generators makes programming a functional style feasible.

Where Python does not make functional programming easy is dealing with immutable data. In Python, critical core datatypes such as lists and dicts are mutable. In functional languages, all variables are immutable. The result is you often create value based on some initial immutable variable that then has functions applied to it.

(defn add-markup [price]
  (+ price (* .25 price)))

(defn add-tax [total]
  (+ total (* .087 total)))

(defn get-total [initial-price]
  (add-tax (add-markup initial-price)))

In each of the steps above, the argument is passed in by value and can’t be changed. When you need to use the total described from get-total, rather than storing it in a variable, you’d often times always call the get-total function. Typically a functional language will optimize these calls. In Python we can mimic this by memoizing the result.

import functools
import operator

def memoize(f):
    cache = {}
    @functools.wraps(f)
    def wrapper(*args, **kw):
        key = (args, sorted(kw.iteritems()))
        if key not in cache:
            cache[key] = f(*args, **kw)
        return cache[key]
    return wrapper

@memoize
def factorial(num):
    return reduce(operator.mul, range(1, num + 1))

Now, calls to the function will re-use previous results without having to execute the body of the function.

Another pattern seen in functional languages such as LISP is to re-use a core data type, such as a list, as a richer object. For example, associates lists act like dictionaries in Python, but they are essentially still just lists that have functions to access them as a dictionary such that you can access random keys. In other functional languages such as haskell or clojure, you create actual types that are similar to a struct to communicate more complex type information.

Obviously in Python we have actual objects. The problem is that objects are mutable. In order to make sure we’re using immutable types we can use Python’s immutable data type, the tuple. What’s more, we can replicate richer types by using a named tuple.

from collections import namedtuple

User = namedtuple('User', ['name', 'email', 'password'])

def update_password(user, new_password):
    return User(user.name, user.email, new_password)

I’ve found that using named tuples often helps close the mental gap of going from object oriented to a functional style.

While Python is most definitely not a functional language, it has many tools that make using a functional paradigm possible. Functional programming can be a powerful model to consider as there are a whole class of bugs that disappear in an immutable world. Functional programming is also a great change of pace from the typical object oriented patterns you might be used to. Even if you don’t refactor all your code to a functional style, there is much to learn, and fortunately, Python makes it easy to get started in a language you are familiar with.

Parallel Processing

It can be really hard to work with data programmatically. There is some moment when working with a large dataset where you realize you need to process the data in parallel. As a programmer, this sounds like it could be a fun problem, and in many cases it is fun to get all your cores working hard crunching data.

The problem is parallel processing never is purely a matter of distributing your work across CPUs. The hard part ends up being getting the data organized before sending it to your workers and doing something with the results. Tools like Hadoop boast processing terabytes of data, but it’s a little misleading because there is most likely a ton of code on either end of that processing.

The input and output code (I/O) can also have big impact on the processing itself. The input often needs to consider what the atomic unit is as well as what the “chunk” of data needs to be. For example, if you have 10 million tiny messages to process, you probably want to chunk up the million messages into 5000 messages when sending it to your worker nodes, yet the workers will need to know it is getting a chunk of messages vs. 1 message. Similarly, for some applications the message:chunk ratio needs to be tweaked.

In hadoop this sort of detail can be dealt with via HDFS, but hadoop is not trivial to set up. Not to mention if you have a bunch of data that doesn’t live in HDFS. The same goes for the output. When you are done, where does it go?

The point being is that “data” always tends towards spcificity. You can’t abstract away data. Data always ends up being physical at its core. Even if the processing happens in parallel, the I/O will always be a challenging constraint.

View Source

I watched a bit of this fireside chat with Steve Jobs. It was pretty interesting to hear Steve discuss the availability of the network and how it changes the way we can work. Specifically, he mentioned that because of NFS (presumably in the BSD family of unices), he could share his home directory on every computer he works on without ever having to think about back ups or syncing his work.

What occurred to me was how much of the software we use is taken for granted. Back in the day and educational license for Unix was around $1800! I can only imagine the difficulties becoming a software developer back then when all the interesting tools like databases or servers were prohibitively expensive!

It reminds me of when I first started learning about HTML and web development. I could always view the source to see what was happening. It became an essential part of how I saw the web and programming in general. The value of software was not only in its function, but in its transparency. The ability to read the source and support myself as a user allowed me the opportunity to understand why the software was so valuable.

When I think about how difficult it must have been to become a hacker back in the early days of personal computing, it is no wonder that free software and open source became so prevalent. These early pioneers had to learn the systems without reading the source! Learning meant reading through incomplete, poorly written manuals. When the manual was wrong or out of date, I can only imagine the hair pulling that must have occurred. The natural solution to this problem was to make the source available.

The process of programming is still very new and very detailed, while still being extremely generic. We are fortunate as programmers that the computing landscape was not able to enclose software development within proprietary walls like so many other technical fields. I’m very thankful I can view the source!

Property Pattern

I’ve found myself doing this quite a bit lately and thought it might be helpful to others.

Often times when I’m writing some code I want to access something as an attribute, even though it comes from some service or database. For example, say we want to download a bunch of files form some service and store them on our file system for processing.

Here is what we’d like the processing code to look like:

def process_files(self):
    for fn in self.downloaded_files:
        self.filter_and_store(fn)

We don’t really care what the filter_and_store method does. What we do care about is downloaded_files attribute.

Lets step back and see what the calling code might look like:

processor = MyServiceProcessor(conn)
processor.process_files()

Again, this is pretty simple, but now we have a problem. When do we actually download the files and store them on the filesystem. One option would be to do something like this in our process_files method.

def process_files(self):
    self.downloaded_files = self.download_files()
    for fn in self.downloaded_files:
        self.filter_and_store(fn)

While it may not seem like a big deal, we just created a side effect. The downloaded_files attribute is getting set in the process_files method. There is a good chance the downloaded_files attribute is something you’d want to reuse. This creates an odd coupling between the process_files method and the downloaded_files method.

Another option would be to do something like this in the constructor:

def __init__(self, conn):
    self.downloaded_files = self.download_files()

Obviously, this is a bad idea. Anytime you instantiate the object it will seemingly try to reach out across some network and download a bunch of files. We can do better!

Here are some goals:

  1. keep the API simple by using a simple attribute, downloaded_files
  2. don’t download anything until it is required
  3. only download the files once per-object
  4. allow injecting downloaded values for tests

The way I’ve been solving this recently has been to use the following property pattern:

class MyServiceProcessor(object):

    def __init__(self, conn):
        self.conn = conn
        self._downloaded_files = None

    @property
    def downloaded_files(self):
        if not self._downloaded_files:
            self._downloaded_files = []
            tmpdir = tempfile.mkdtemp()
            for obj in self.conn.resources():
                self._downloaded_files.append(obj.download(tmpdir))
        return self._downloaded_files

    def process_files(self):
        result = []
        for fn in self.downloaded_files:
            result.append(self.filter_and_store(fn))
        return result

Say we wanted to test our process_files method. It becomes much easier.

def setup(self):
    self.test_files = os.listdir(os.path.join(HERE, 'service_files'))
    self.conn = Mock()
    self.processor = MyServiceProcessor(self.conn)

def test_process_files(self):
    # Just set the property variable to inject the values.
    self.processor._downloaded_files = self.test_files

    assert len(self.processor.process_files()) == len(self.test_files)

As you can see it was realy easy to inject our stub files. We know that we don’t perform any downloads until we have to. We also know that the downloads are only performed once.

Here is another variation I’ve used that doesn’t required setting up a _downloaded_files.

@property
def downloaded_files(self):
    if not hasattr(self, '_downloaded_files'):
        ...
    return self._downloaded_files

Generally, I prefer the explicit _downloaded_files attribute in the constructor as it allows more granularity when setting a default value. You can set it as an empty list for example, which helps to communicate that the property will need to return a list.

Similarly, you can set the value to None and ensure that when the attribute is accessed, the value may become an empty list. This small differentiation helps to make the API easier to use. An empty list is still iterable while still being “falsey”.

This technique is nothing technically interesting. What I hope someone takes from this is how you can use this technique to write clearer code and encapsulate your implementation, while exposing a clear API between your objects. Just because you don’t publish a library, keeping your internal object APIs simple and communicative helps make your code easier to reason about.

One caveat is that this method can add a lot of small property methods to your classes. There is nothing wrong with this, but it might give a reader of your code the impression the classes are complex. One method to combat this is to use mixins.

class MyWorkerMixinProperties(object):

    def __init__(self, conn):
        self.conn = conn
        self._categories = None
        self._foo_resources = None
        sef._names = None

    @property
    def categories(self):
        if not self._categories:
            self._categories = self.conn.categories()
        return self._categories

    @property
    def foo_resources(self):
        if not self._foo_resources:
            self._foo_resources = self.conn.resources(name='foo')
        return self._foo_resources

    @property
    def names(self):
        if not self._names:
            self._names = [r.meta()['name'] for r in self.resources]



class MyWorker(MyWorkerMixinProperties):

    def __init__(self, conn):
        MyWorkerMixinProperties.__init__(self, conn)

    def run(self):
        for resource in self.foo_resources:
            if resource.category in self.categories:
                self.put('/api/foos', {
                    'real_name': self.names[resource.name_id],
                    'values': self.process_values(resource.values),
                })

This is a somewhat contrived example, but the point being is that we’ve taken all our service based data and made it accessible via normal attributes. Each service request is encapsulated in a function, while our primary worker class has a reasonably straightforward implementation of some algorithm.

The big win here is clarity. You can write an algorithm by describing what it should do. You can then test the algorithm easily by injecting the values you know should produce the expected results. Furthermore, you’ve decoupled the algorithm from the I/O code, which is typically where you’ll see a good deal of repetition in the case of RESTful services or optimization when talking to databases. Lastly, it becomes trivial to inject values for testing.

Again, this isn’t rocket science. It is a really simple technique that can help make your code much clearer. I’ve found it really useful and I hope you do too!

Iterative Code Cycle

TDD prescribes a simple process for working on code.

  1. Write a failing test
  2. Write some code to get the test to pass
  3. Refactor
  4. Repeat

If we consider this cycle more generically, we see a typical cycle every modern software developer must use when writing code.

  1. Write some code
  2. Run the code
  3. Fix any problems
  4. Repeat

In this generic cycle you might use a REPL, a stand alone script, a debugger, etc. to quickly iterate on the code.

Personally, I’ve found that I do use a test for this iteration because it is integrated into my editor. The benefit of using my test suite is that I often have a repeatable test when I’m done that proves (to some level of confidence) the code works as I expect it to. It may not be entirely correct, but at least it codifies that I think it should work. When it does break, I can take a more TDD-like approach and fix the test, which makes it fail, and then fix the actual bug.

The essence then of any developer’s work is to make this cycle as quick as possible, no matter what tool you use to run and re-run your code. The process should be fluid and help get you in the flow when programming. If you do use tests for this process, it may be a helpful design tool. For example, if you are writing a client library for some service, you write an idealistic API you’d like to have without letting the implementation drive the design.

TDD has been on my mind recently as I’ve written a lot of code recently and have questioned whether or not my testing patterns have truly been helpful. It has been helpful in fixing bugs and provides a quick coding cycle. I’d argue the code has been improved, but at the same time, I do wonder if by making things testable I’ve introduced more abstractions than necessary. I’ve had to look back on some code that used these patterns and getting up to speed was somewhat difficult. At the same time, anytime you read code you need to put in effort in order to understand what is happening. Often times I’ll assume if code doesn’t immediately convey exactly what is happening it is terrible code. The reality is code is complex and takes effort to understand. It should be judged based on how reasonable it is fix once it is understood. In this way, I believe my test based coding cycle has proven itself to be valuable.

Obviously, the next person to look at the code will disagree, but hopefully once they understand what is going on, it won’t be too bad.

TDD

I watched DHH’s keynote at Railsconf 2014. A large part of his talk discusses the misassociation of TDD on metrics and making code “testable” rather than stepping back an focusing on clarity, as an author would when writing.

If you’ve ever tried to do true TDD, you might have a similar feeling that you’re doing it wrong. I know I have. Yet, I’ve also seen the benefit of iterating on code via writing tests. The faster the code / test cycle, the easier it is to experiment and write the code. Similarly, I’ve noticed more bugs show up in code that is not as well covered by tests. It might not be clear how DHH’s perspective then fits in with the benefits of testing and facets of TDD.

What I’ve found is that readability and clarity in code often comes by way of being testable. Tests and making code testable can go along way in finding the clarity that DHH describes. It can become clear very quickly that your class API is actually really difficult to use by writing a test. You can easily spot odd dependencies in a class by the number of mocks you are required to deal with in your tests. Sometimes I find it easier to write a quick test rather than spin up a repl to run and rerun code.

The point being is that TDD can be a helpful tool to write clear code. As DHH points out, it is not a singular path to a well thought out design. Unfortunately, just as people take TDD too literally, people will feel that any sort of granular testing is a waste of time. The irony here is that DHH says very clearly that we, as software writers, need to practice. Writing tests and re-writing tests are a great way to become a better writer. Just because the ideals presented in TDD might be a bit too extreme, the mechanism of a fast test suite and the goal for 100% coverage are still valuable in that they force you to think about and practice writing code.

The process of thinking about code is what is truly critical in almost all software development exercises. Writing tests first is just another way to slow you down and force you to think about your problem before hacking out some code. Some developers can avoid tests, most likely because they are really good about thinking about code before writing it. These people can likely iterate on ideas and concepts in their head before turning to the editor for the actual implementation. The rest of us can use the opportunity of writing tests, taking notes, and even drawing a diagram as tools to force us to think about our system before hacking some ugly code together.

Concurrency Transitions

Glyph, the creator of Twisted wrote an interesting article discussing the intrinsic flaws of using threads. The essential idea is that unless you know explicitly when you are switching contexts, it is extremely difficult to effectively reason about concurrency in code.

I agree that this is one way to handle concurrency. Glyph also provides a clear perspective into the underlying constraints of concurrent programming. The biggest constraint is that you need a way to guarantee a set of statements happens atomically. He suggests an event driven paradigm as how best to do this. In a typical async system, the work is built up using small procedures that run atomically, yielding back control to the main loop as they finish. The reason the async model works so well is because you eliminate all CPU based concurrency and allow work to happen while waiting for I/O.

There are other valid ways to achieve as similar effect. The key in all these methods, async included, is to know when you transition from atomic sequential operations to potentially concurrent, and often parallel, operations.

A great example of this mindset is found in functional programming, and specifically, in monads. A monad is essentially a guarantee that some set of operations will happen atomically. In a functional language, functions are considered “pure” meaning they don’t introduce any “side effects”, or more specifically, they do not change any state. Monads allow functional languages a way to interact with the outside world by providing a logical interface that the underlying system can use to do any necessary work to make the operation safe. Clojure, for example, uses a Software Transactional Memory system to safely apply changes to state. Another approach might be to use locking and mutexes. No matter the methodology, the goal is to provide a safe way to change state by allowing the developer an explicit way to identify portions of code that change external state.

Here is a classic example in Python of where mutable state can cause problems.

In Python, and the vast majority of languages, it is assumed that a function can act on a variable of a larger scope. This is possible thanks to mutable data structures. In the example above, calling the function multiple time doesn’t re-initialize argument to an empty list. It is a mutable data structure that exists as state. When the function is called that state changes and that change of state is considered a “side effect” in functional programming. This sort of issue is even more difficult in threaded programming because your state can cross threads in addition to lexical boundaries.

If we generalize the purpose of monads and Clojure’s reference types, we can establish that concurrent systems need to be able to manage the transitions between pure functionality (no state manipulation) and operations that effect state.

One methodology that I have found to be effective managing this transition is to use queues. More generally, this might be called message passing, but I don’t believe message passing guarantees the system understands when state changes. In the case of a queue, you have an obvious entrance and exit point for the transition between purity and side effects to take place.

The way to implement this sort of system is to consider each consumer of a queue as a different process. By considering consumers / producers as processes we ensure there is a clear boundary between them that protects shared memory, and more generally shared state. The queue then acts as bridge to cross the “physical” border. The queue also provides the control over the transition between pure functionality and side effects.

To relate this back to Glyph’s async perspective, when state is pushed onto the queue it is similar to yielding to the reactor in an async system. When state is popped off the queue into a process, it can be acted upon without worry of causing side effects that could effect other operations.

Glyph brought up the scenario where a function might yield multiple times in order to pass back control to the managing reactor. This becomes less necessary in the queue and process system I describe because there is no chance of a context switch interrupting an operation or slowing down the reactor. In a typical async framework, the job of the reactor is to order each bit of work. The work still happens in series. Therefore, if one operation takes a long time, it stops all other work from happening, assuming that work is not doing I/O. The queue and process system doesn’t have this same limitation as it is able to yield control to the queue at the correct logical point in the algorithm. Also, in terms of Python, the GIL is mitigated by using processes. The result is that you can program in a sequential manner for your algorithms, while still tackle problems concurrently.

Like anything, this queue and process model is not a panacea. If your data is large, you often need to pass around references to the data and where it can be retrieved. If that resource is not something that tries to handle concurrent connections, the file system for example, you still may run into concurrency issue accessing some resource. It also can be difficult to reason about failures in a queue based system. How full is too full? You can limit the queue size, but that might cause blocking issues that may be unreasonable.

There is no silver bullet, but if you understand the significance of transitions between pure functionality and side effects, you have a good chance of producing a reasonable system no matter what concurrency model you use.

A Sample CherryPy App Stub

In many full stack frameworks, there is a facility to create a new application via some command. In django for example, you use django-admin.py startproject foo. The startproject command will create some directories and files to help you get started.

CherryPy tries very hard to avoid making decisions for you. Instead CherryPy allows you to setup and configure the layout of your code however you wish. Unfortunately, if you are unfamiliar with CherryPy, it can feel a bit daunting setting up a new application.

Here is how I would set up a CherryPy application that is meant to serve basic site with static resources and some handlers.

The File System

Here is what the file system looks like.

├── myproj
│   ├── __init__.py
│   ├── config.py *
│   ├── controllers.py
│   ├── models.py
│   ├── server.py
│   ├── static
│   ├── lib *
│   └── views
│       └── base.tmpl
├── setup.py
└── tests

First off, it is a python package with a setup.py. If you’ve never created a python package before, here is a good tutorial.

Next up is the project directory. This is where all you code lives. Inside this directory we have a few files and directories.

  • config.py : Practically every application is going to need some configuration and a way to load it. I put that code in config.py and typically import it when necessary. You can leave this out until you need it.
  • controllers.py : MVC is a pretty good design pattern to follow. The controllers.py is where you put your objects that will be mounted on the cherrypy.tree.
  • models.py : Applications typically need to talk to a database or some other service for storing persistent data. I highly recommend SQLAlchemy for this. You can configure the models referred to in the SQLAlchemy docs here, in the models.py file.
  • server.py : CherryPy comes with a production ready web server that works really well behind a load balancing proxy such as Nginx. This web server should be used for development as well. I’ll provide a simple example what might go in your server.py file.
  • static : This is where your css, images, etc. will go.
  • lib : CherryPy does a good job allowing you to write plain python. Once the controllers start becoming more complex, I try to move some of that functionality to well organized classes / function in the lib directory.
  • views : Here is where you keep your template files. Jinja2 is a popular choice if you don’t already have a preference.

Lastly, I added a tests directory for adding unit and functional tests. If you’ve never done any testing in Python, I highly recommend looking at pytest to get started.

Hooking Things Together

Now that we have a bunch of files and directories, we can start to write our app. We’ll start with the Hello World example on the CherryPy homepage.

In our controllers.py we’ll add our HelloWorld class

# controllers.py
import cherrypy


class HelloWorld(object):
    def index(self):
        return 'Hello World!'
    index.exposed = True

Our server.py is where we will hook up our controller with the webserver. The server.py is also how we’ll run our code in development and potentially in production

import cherrypy

# if you have a config, import it here
# from myproj import config

from myproj.controllers import HelloWorld

HERE = os.path.dirname(os.path.abspath(__file__))


def get_app_config():
    return {
        '/static': {
            'tools.staticdir.on': True,
            'tools.staticdir.dir': os.path.join(HERE, 'static'),
        }


def get_app(config=None):
    config = config or get_config()
    cherrypy.tree.mount(HelloWorld(), '/', config=config)
    return cherrypy.tree


def start():
    get_app()
    cherrypy.engine.signals.subscribe()
    cherrypy.engine.start()
    cherrypy.engine.block()

if __name__ == '__main__':
    start()

Obviously, this looks more complicated than the example on the cherrypy homepage. I’ll walk you through it to let you know why it is a little more complex.

First off, if you have a config.py that sets up any configuration object or anything we import that first. Feel free to leave that out until you have a specific need.

Next up we import our controller from our controllers.py file.

After our imports we setup a variable HERE that will be used to configure any paths. The static resources is the obvious example.

At this point we start defining a few functions. The get_app_config function returns a configuration for the application. In the config, we set up the staticdir tool to point to our static folder. The default configuration is to expose these files via /static.

This default configuration is defined in a function to make it easier to test. As you application grows, you will end up needing to merge different configuration details together depending on configuration passed into the application. Starting off by making your config come from a function will help to make your application easier to test because it makes changing your config for tests much easier.

In the same way we’ve constructed our config behind a function, we also have our application available behind a function. When you call get_app it has the side effect of mounting the HelloWorld controller the cherrypy.tree, making it available when the server starts. The get_app function also returns the cherrypy.tree. The reason for this is, once again, to allow easier testing for tools such as webtest. Webtest allows you to take a WSGI application and make requests against it, asserting against the response. It does this without requiring you start up a server. I’ll provide an example in a moment.

Finally we have our start function. It calls get_app to mount our application and then calls the necessary functions to start the server. The quickstart method used in the homepage tutorial does this under the hood with the exception of also doing the mounting and adding the config. The quickstart can become less helpful as your application grows because it assumes you are mounting a single object at the root. If you prefer to use quickstart you certainly can. Just be aware that it can be easy clobber your configuration when mixing it with cherrypy.tree.mount.

One thing I haven’t addressed here is the database connection. That is outside the scope of this post, but for a good example of how to configure SQLAlchemy and CherryPy, take a look at the example application, Twiseless. Specifically you can see how to setup the models and connections. I’ve chosen to provide a file system organization that is a little closer to other frameworks like Django, but please take liberally from Twiseless to fill in the gaps I’ve left here.

Testing

In full stack frameworks like Django, testing is part of the full package. While many venture outside the confines of whatever the defaults are (using pytest vs. django’s unittest based test runner), it is generally easy to test things like requests to the web framework.

CherryPy does not take any steps to make this easier, but fortunately, this default app configuration lends itself to relatively easy testing.

Lets say we want to test our HelloWorld controller. First off, we’ll should set up an environment to develop with. For this we’ll use virtualenv. I like to use a directory called venv. In the project directory:

$ virtualenv venv

Virtualenv comes bundled with a pip. Pip has a helpful feature where you can define requirements in a single test file. Assuming you’ve already filled in your setup.py file with information about your package, we’ll create a dev_requirements.txt to make it easy to get our environment setup.

# dev_requirements.txt

-e .  # install our package

# test requirements
pytest
webtest

Then we can install these into our virtualenv by doing the following in the shell:

$ source venv/bin/activate
(venv) $ pip install -r dev_requirements.txt

Once the requirements are all installed, we can add our test.

We’ll create a file in tests called test_controller_hello_world.py. Here is what it will look like:

import pytest
import webtest

from myproj.server import get_app


@pytest.fixture(scope='module')
def http():
    return webtest.WebTest(get_app())


class TestHelloWorld(object):

    def test_hello_world_request(self, http):
        resp = http.get('/')
        assert resp.status_int == 200
        assert 'Hello World!' in resp

In the example, we are using a pytest fixture to inject webtest into our test. WebTest allows you to perform requests against a WSGI application without having to start up a server. The request.get call in our test then is the same as if we had started up the server and made the request in our web browser. The resulting response from the request can be used to make assertions.

We can run the tests via the py.test command:

(venv) $ py.test tests/

It should be noted that we also could test the response by simply instantiating our HelloWorld class and asserting the result of the index method is correct. For example

from myproj.controllers import HelloWorld


def test_hello_world_index():
    controller = HelloWorld()
    assert controller.index() == 'Hello World!'

The problem with directly using the controller objects is when you use more of CherryPy’s features, you end up using more of cherrypy.request and other cherrypy objects. This progression is perfectly natural, but it makes it difficult to test the handler methods without also patching much of the cherrypy framework using a library like mock. Mock is a great library and I recommend it, but when testing controllers, using WebTest to handle assertions on responses is preferred.

Similarly, I’ve found pytest fixtures to be a powerful way to introduce external services into tests. You are free to use any other method you’d like to utilize WebTest in your tests.

Conclusions

CherryPy is truely an unopinionated framework. The purpose of CherryPy is to create a simple gateway between HTTP and plain Python code. The result is that there are often many questions of how to do common tasks as there are few constraints. Hopefully the above folder layout along side the excellent Twiseless example provides a good jumping off point for getting the job done.

Also, if you don’t like the layout mentioned above, you are free to change it however you like! That is the beauty of cherrypy. It allows you to organize and structure your application the way you want it structured. You can feel free to be creative and customize your app to your own needs without fear of working against the framework.