Ionrock Dot Org

by Eric Larson

My Weblog

Starting Fresh

At work I work on a project that has been a project for quite a while now. As such, we’ve tried to add a lot of new features, but the reality is almost all the work is maintenance. Fortunately though, I was informed yesterday that our next iteration was on the horizon and the current version was going to have a strict feature freeze in the new year! This is really exciting for me because while I’ve done some new code at work, the vast majority of my responsibility is maintaining our current system.
A theme of this maintenance was actually that often times it was better to rewrite a feature rather than edit the code. The code isn’t bad, but it is from a different time. It started with a different Python version, testing wasn’t necessarily a top priority (I’m still personally working this aspect), and the company was in startup mode where new features were often more important making sure you went back and cleaned up the code. Along similar lines, we now are a lot larger and some of the designs have been stressed by the load. Making changes to add features usually ends up meaning schema changes (in one way or another, even though we are using MongoDB), which are rarely cheap. Now is definitely a good time to consider a reset to see where we can improve our design and prepare for the future.

There is a different between starting fresh and restarting. To use a computer term, restarting shuts down the computer and brings it back up again. Anything left in memory is gone. Refreshing is more a process of taking what you have now and using it a reference as you recreate it. Consider it like a disk defrag or reindex in the database. Your goal is not to change the function, but rather clean up the debris left behind through years of use.

What is also interesting is that we will finally have an opportunity to truly change how we store and accept data. This second iteration of our application went from using a home-made storage engine based on bsddb to using MongoDB. While MongoDB is pretty cool, it has its warts. Honestly, I’m not sure it is the right path either. The big plus with MongoDB is that you can query it. When there are fires to put out finding information is critical and queries in MongoDB can be lifesavers. Outside of that though, MongoDB feels somewhat dangerous. We added indexes and realized that we are more write heavy than we realized. That or we are write heavy enough to plague MongoDB. You also need a lot of machines to really fulfill MongoDB’s replication expectations. This might be a good thing, but outside losing power and natural disasters, there are people called users that make things called “typos” that can end up “killing” the wrong process. MongoDB doesn’t protect against this “threat” and that makes things a little scary at times.

All this is not to say we won’t use MongoDB though. What I know I’d like to implement is a queue that also acts like a bus. My take away from our old design is that we have different use cases for our data. We need some layer of abstraction over our data, but at the same time, we need to do some more setup before trusting a single DB system to just work supporting all our use cases. The idea then is to have some somewhere to queue data coming in, while at the same time notify the systems that need it. As new data comes in some processes want it right away, while others need to wait until more data in its set have completed. We work on finding what people think, so there are multiple levels of this data. There is the overarching topic described by the survey, individual questions and how each person’s opinion fits within the sample. The bus then should support keeping the data while each of these use cases are supported.

There are a lot of details yet to understand, but the process should be helpful and fun. As a developer you are always trying to improve your skills. The goal should be to write functioning code that is easy to maintain, but that is a difficult thing to practice if you a) never maintain code or b) never have to write code that gets maintained by others. This gives me personally a chance to write some new code that others will need to maintain at some point in the future. I’m also excited to hopefully consider using a newer version of Python, writing tests from scratch and generally getting to try out some new-ish tools to make life easier. It should be a great new year!


Posted Wed Dec 8 17:10:19 2010 by Eric Larson

Recognizing Patterns and Macros

If you’ve ever had to write a language for a user you’ve probably had a vision of how you could make things easier for the user to write in domain specific language (DSL). The thing about many DSLs is that they inherit from the parent language. I’m specifically looking at you Ruby, even though that is no where close to the real use case I’m describing. No, my use case is much closer to the many declarative XML based languages. These are all languages that, aside from the parser, create syntax and structure from scratch.

The question then is how do you recognize when your initial structure has outgrown its humble roots. More importantly, how do you meet your users requirements without increasing complexity? I should also mention that complexity is the only real measure that we should be addressing. This is my opinion, but it is based in the idea that no matter the syntax, complexity is what you are battling against. Complexity is also not simply a measure of how many different tokens or keywords, but rather the number of specific details that must be kept in the forefront of your mind in order to get the correct meaning.

Lets take a look at a piece of code:


eligible_rangliste1_counter=int[]
eligible_rangliste2_counter=int[]
eligible_rangliste2a_counter=int[]
eligible_rangliste3_counter=int[]
eligible_rangliste4_counter=int[]
eligible_rangliste5_counter=int[]
eligible_rangliste5a_counter=int[]
eligible_rangliste6_counter=int[]

This looks pretty nasty. The first optimization would be to allow something like a dict. I’m going to focus on Python references b/c not only is that what I use every day, but it is also the parent language. If we improve things by introducing a dict, then that makes sense for the initial variable definitions. But then the question is how you use those variables. The above code block is actually within a set of code defined between some curly braces (think wiki syntax as opposed to a C function). Outside of the curly braces the syntax changes.

Using one of the variables in the traditional scope, the core of the special language, we prefix its usage with a $. Again, it is very similar to a template language in this regard. The problem is things like square brackets already have meaning within the parser in the normal scope. This makes it somewhat difficult to simply add features like a dict that would generally improve the use of complicated or repetitive patterns.

This is the challenge in having a DSL. On the one hand it makes things much simpler. You can write a simple language that doesn’t have to look like HTML or other more visually noisy languages that have a subtle parsing requirement that doesn’t really help authors. On the other hand, the parser must be written in a way to support later features that might conflict with the current syntax that is in the field. Backwards compatibility is a must in these situations because unless you’ve written your parser and objects in such a way as to allow lossless serialization, fixing old scripts ends up being a bug ridden exercise in regex.

Beyond the practical challenges of syntax, there are still questions as to what is truly easier for users. Take an idea such as modules, again as in Python. How do you allow including them in the code? Do they get included where they effectively become written inline or can we import the code, adding thing via a virtual context. How does the editor play a role in the whole operation? In our case, the language is not something people interact with on the command line but rather via a simple web interface. Therefore, things like imports/includes involve not only the mechanical functionality, but the UI for writing, validating and storing them within their own scope. When you consider the environment you have the consideration of whether or not the include actually becomes like a macro when the code gets saved. Likewise, macros are another tact to take in order to make things easier for scripters to reuse code.

In some ways the answer is really all of the above, but that still begs the question of whether moving the complexity outside of the basic language and script has simplified things or in fact just moved the complexity. What you want to do is remove complexity by allowing the user to think at a higher level. This means abstractions that create a contract with the more detailed lower level aspects lets the user work without the need to consider whether or not some lower level piece gets done. Adding things like imports/includes and macros may all do that, but they are dependent on how they are used. Some fancy user might end up writing scripts like this:


include opening
include b2b_12
include b2b_13
include b2b_14
include gen_opt_6
include footer
include postproc

At this point you’ve successfully created something extremely opaque. The complexity is not gone, but simply moved.

Just like in programming you notice patterns and develop tactics for abstraction, when writing a DSL you have a similar task. The difference is that instead of writing it for your own use cases in a known language where users are expected to understand a larger environment (build system, vcs, editor, etc.), you are defining a language for a potentially non-technical user. These users don’t read blogs on writing code. They don’t go to conferences for your language. The users don’t look forward to the latest version that includes closures. Instead they rely on you to guide their options in a way that lets them get their work done quickly. It is your responsibility to take their use cases, find the patterns and figure out a way of adding abstractions that actually help improve the complexity. It is anything but easy.


Posted Tue Dec 14 03:39:51 2010 by Eric Larson

Micro Frameworks

I came across Bottle today and thought it was kind of silly. Not in the sense that the actual framework design or functionality is silly, but rather that there are so many attempts to make stripped down frameworks. There is really nothing wrong with making these frameworks. I’m sure the authors learn a lot and they scratch an itch. Every time one comes up though, I wonder about something similar built on CherryPy and I’m reminded that CP is really the original microframework and works even better than ever.

Even though CP has become my framework of choice, others may not realize how it really is similar to the other micro frameworks out there with the main difference being it has been tested in the real world for years. Lets take a really simple example of templates and see how we can make it easy to use Mako with CherryPy.

First off, lets write a little controller that will be our application. I’m going to use the CP method dispatcher.


import cherrypy

class SayHello(object):
    exposed = True # the handler is exposed or else a 404 is raised. very pythonic!

    def GET(self, user, id):
        some_obj = db.find(user, id)
        return {
            'model': some_obj
        }

    def POST(self, user, id, new_foo, *args, **kw):
        updated_foo = SomeModel(user, id, new_foo)
        updated_foo.save()
        raise cherrypy.HTTPRedirect(cherrypy.request.path_info)

I’ve kind of stacked the deck a little bit here with my ‘GET’ method. It is returning a dict because we are going to use that to pass info into a render function that renders the template. There are many ways you could do this, but since I like to reuse the template look up, I’ll make a subclass that includes a render function.


import os
import cherrypy
import json

from mako.template import Template
from mako.lookup import TemplateLookup

__here__ = os.path.dirname(os.path.abspath(__file__))

class RenderTemplate(object):
    def __init__(self):
        self.directories = [
            os.path.normpath(os.path.join(__here__, 'view/'))
        ]
        self.theme = TemplateLookup(
            directories=self.directories,
            output_encoding='utf-8'
        )

        self.constants = {
            'req': cherrypy.request,
        }

    def __call__(self, template, **params):
        tmpl = self.theme.get_template(template)
        kw = self.constants.copy()
        kw.update(params)
        return tmpl.render(**kw)

_render = RenderTemplate()

class PageMixin(object):
    def render(self, tmpl, params=None):
        params = params or {}
        params.update(dict([
            (name, getattr(self, name))
             for name in dir(self)
             if not name.startswith('_')
        ]))
        return _render(tmpl, **params)

    def json(self, obj):
        cherrypy.response.headers['Content-Type'] = 'application/json'
        return json.dumps(obj)

There is a bunch of extra code here but what I’m doing is setting up a simple wrapper around the Mako template and template look up. I could have use pgk_resources as well here. You’ll also notice that the handler will automatically get the cherrypy.request as a constant called ‘req’ for use in the template. Below our renderer is a PageMixin. I do this b/c it is easy to add simple functions to make certain aspects faster, for example, quickly returning JSON.

Here is how our controller class’ GET method would change.


    def GET(self, user, id):
        some_obj = db.find(user, id)
        return self.render('foo.mako', {
            'model': some_obj
        })

Pretty simple really. I could try to get more clever by automatically passing in our locals() or do some other tricks to make things a little more magic, but that is really not the point. The point here is that I’m just using Python. I don’t have to use CherryPy Tools to make major changes to the way everything works. Including a library is just an import away. If I wanted to write my render function as a decorator that is possible since it would just be a matter of writing the wrapper. If we wanted to do some sort of a cascaded look up on template files, no problem. It is all just Python.

To wrap things up, the other day I started looking into writing a Tool for CherryPy. After messing with things a bit, I came to the conclusion I wasn’t really a huge fan of the Tool API. After thinking of ways I could improve it and getting some good ideas from Bob, something struck me. The Tool API has been around for a long time and yet it never has been a really important part of my writing apps with CherryPy. The reason is really simple. I can write Python with CherryPy. Python has decorators, itertools, functools, context managers and a whole host of facilities for doing things like wrapping function calls. It doesn’t mean I can’t write a tool, but I don’t have to. The framework is asking me to either. When I used WSGI, I would write my whole application as bits of middleware and compose the pieces. It felt reusable and very powerful, but it also ended up being a pain in the neck. Frameworks have a tendency to be opinionated and while CherryPy is seemingly rather unbiased, I’d argue the real opinion it reflects is “quick messing with frameworks and get things done”. I like that.


Posted Tue Dec 21 16:39:59 2010 by Eric Larson

The Query Queue

A really basic data structure is a queue. You put things in the queue at one end and grab things off the queue at the other end of the line. In terms of making highly scalable web applications, queues allow you to set up work to be done by some other process in order to get the response to the user faster.

I had an idea based on some slightly different use cases. The idea is that instead of simply popping off the last item in the queue, you instead query the queue to get the last item. The querying it allows you to have different types of workers utilizing the queue without stepping on each others toes. This can solve an issue of granularity. If you are saving some set of data that gets collected in steps, there is a good chance that step has meaning. In a survey for example, there is value in the set of answers to all the questions, but there is also value in the single questions as well. This is especially true if the entire survey wasn’t completed.

There might be other situations where it would be beneficial. When you register for some service, they might need to verify an email address or do other operations that cross communication lines (sending a sms message). If you queue the progress in a query queue, each component could query for the unsent emails or sms messages while the actual registration process waits for finalized and confirmed registrations.

I obviously don’t have all the details worked out. For example, what happens when the queue gets full? What happens if not all queries are fulfilled? Using our registration example, if the person never verifies their account, it just sits in the queue. It should probably be expunged from the queue some point. How should that happen?

This might be a horrible idea or it might be something someone else has implemented or found a different solution to. Already I have considered that you could just create more simple queues for each operation. That would probably get around a good portion of the problems, but you inherently lose the more natural continuation type pattern, which is the benefit of this kind of system.

If you know of any similar systems or people who have tried this kind of design please let me know. Part of me feels it would be worth trying out, but at the same time I have a nagging feeling someone really smart sees a much different and better pattern available that makes the whole idea moot. That wouldn’t bother me in the slightest because the problem is getting solved.

Update:

I just read about the end of Nsyght. This is exactly the sort of problem that I think a query queue could support. The river of data coming really quickly and multiple services reading off it as fast as possible getting the information they need. Some look for images, others links, others focus on indexing text, while others focus on the relationships between the atomic units. Again, my idea for a query queue could be totally off, but it is an idea.


Posted Thu Dec 23 05:16:48 2010 by Eric Larson

Appreciating Mercurial

There is a lot of buzz around git. Since I’ve never spent much time with it, I can’t really say whether it is warranted or not. I ended up using mercurial and never had a reason to change.

One thing that consistently happens when using a DVCS is that you reconsider how you work with version control. There are some larger concepts that are largely static such as tagging releases and branching for features or bug fixes, but past that the world is wide open. This is a blessing and a curse. The options are never ending, so like vim or emacs, you can always tinker with your version control. The downside is that it can be really difficult to find a canonical method of use.

Some people might wonder why you’d want a “canonical” way to use your version control system. After all, a DVCS lets you program on planes so your work flow doesn’t effect anyone else, right? In theory this is true, but in practice I’d argue that isn’t the case. The reality is a DVCS is a complicated beast and in most dev environments you really don’t need the extra complications. A successful colleague of mine expressed his appreciation for Perforce, a known target for version control bigotry. His point was not necessarily that Perforce was such a perfect design but rather the constraints of it were reasonable and helped get things done efficiently. It had a very clear canonical way of working with it that made everything from getting new employees up to speed just as simple as pushing out new releases. Unfortunately, the git vs. hg debates usually come from radically different environments where this idea of a canonical use doesn’t easily apply and the result is that there seems to be a farther discrepancy between the two than there really needs to be.

I read this article regarding how git gets branching more correct than mercurial. Looking at the context, the author’s work flow requires accepting and reviewing patches before applying it to the project code base. His perspective is that losing the context of where a patch came from in terms of the branch doesn’t really matter compared to the ability to disassociate patches with branches. I might have misunderstood his point but it doesn’t really matter. The use case of pushing forward development via the submission of patches is a very specific use case that doesn’t happen in many situations.

Most open source programmers have day jobs and I’ve yet to see the situation where fixing bugs in an organization goes through some maintainer that reviews the patches and applies them to the main branch. It is more common that developers work on specific bugs and features within the context of some time period. At the end of the time period, there is a release event that tags the current stable state of the repo and the cycle continues. One option would be to create a release manager position that is responsible for integrating patches to make sure they work and don’t cause problems, but the smarter way to deal with this is via automation and continuous integration.

Hopefully it is clear that the biggest difference between this traditional organization based model and open source experiences is that in an organization your are responsible for the code. In an open source situation you can submit patches all day long and there is no obligation for anyone to pay attention. The open source developer has to politic one way or another in order to be heard where as in an organization, your obligation is to communicate and produce code. This distinction is critical because in addition to using a tool, an organization can specify the “right” way to use it such that it reduces issues associated with random features colliding. This is important because by specifying the correct way to use the tool you open the door for other assumptions to be made.

A really good example of this would be in a release process. If you as a group decided to always add a “closed #{bug}” format in commit messages, writing a script that compiles the release notes and posts them to a wiki would be pretty trivial. In a similar fashion, you could add flags to your commit messages that hooks in the VCS use to do things like post back to a ticket/bug page. This is something a developer at our company recently started working on. It would be impossible to things like this in an open source model.

I’m not trying to argue that one system is better or worse than the other. My goal is to simply make it clear that you can’t simply read blogs about git or hg and assume that you’re finding a consensus on what is the best tool. It is not the tool, but how you use it that really matters most. Personally, I’d stick with mercurial because, as many git fans have mentioned, the UI is easier to use. My perspective is that you can work around the vast majority of subtle issues by simply specifying the best way to use the tool.

As a side note, in this branching model post, one thing that might help in mercurial to avoid many branches in default is to only push to default or the production branch. If I have two features I’m working with their own named branches and I finish one, I can choose to merge that branch into default and push only the default branch changes. That way you can experiment and create branches as needed without polluting the canonical repo. Does mercurial do anything to help this work flow? Nope, it is just something you have to tell the team to do. Some smart folks say that constraints can be good and this is simply an example of that concept.


Posted Thu Dec 30 17:25:44 2010 by Eric Larson

Twitter

Links

Reading

Created using Python, jQuery and Emacs