cc.engine

Converting cc.engine from ZPT to Jinja2 and i18n logical keys to english keys

cwebber, September 2nd, 2011

Some CC-specfic background

Right now I’m in the middle of retooling of our translation infrastructure. cc.engine and related tools have a long, complex history (dating back, as I understand, to TCL scripts running on AOL server software). The short of it is, CC’s tools have evolved a lot over the years, and sometimes we’re left with systems and tools that require a lot of organization-specific knowledge for historical reasons.

This has been the case with CC’s translation tools. Most of the world these days uses english-key translations. CC used logical key translations. This means that if you marked up a bit of text for translation, instead of the key being the actual text being translated (such as “About The Licenses”), the key would be an identifier code which mapped to said english string, like “util.View_Legal_Code”. What’s the problem with this? Actually, there are a number of benefits that I’ll miss and that I won’t get into here, but the real problem is that the rest of the translation world mostly doesn’t work this way. We use Transifex (and previously used Pootle) as a tool for our translators managing our translations. Since these tools don’t expect logical keys we had to write tools to convert from logical keys to english keys on upload and english keys to logical keys back and a whole bunch of other crazy custom tooling.

Another time suck has been that we’d love to be able to just dynamically extract all translations from our python code and templates, but this also turns out to be impossible with our current setup. A strange edge-case in ZPT means that certain situations with dynamic attributes in ZPT-translated-HTML means that we have to edit certain translations after they’re extracted, meaning we can’t rely on an auto-extracted set of translations.

So we’d like to move to a future with no or very few custom translation tools (which means we need English keys) and auto-extraction of translations (which means because of that edge case, no ZPT). Since we need to move to a new templating engine, I decided that we should go with my personal favorite templating engine, Jinja2.

ZPT vs Jinja2

Aside from the issue I’ve described above, briefly I’d like to describe the differences between ZPT and Jinja2, as they’re actually my two favorite templating languages.

ZPT (Zope Page Templates) is an XML-based templating system where your tags and elements actually become part of the templating logic and structure. For example, here’s an example of us looping over a list of license versions on our “helpful” 404 pages for when you type in the wrong license URL (like at http://creativecommons.org/by/2.33333/):

  <h4>Were you looking for:</h4>

  <ul class="archives" id="suggested_licenses">
    <li tal:repeat="license license_versions">
      <a tal:attributes="href license/uri">
        <b tal:content="python: license.title(target_lang)"></b>
      </a>
    </li>
  </ul>

As you can see, the for loop, the attributes, and the content are actually elements of the (X)HTML tree. The neat thing about this is that you can be mostly sure that you won’t end up with tag soup. It’s also pretty neat conceptually.

Now, let’s look at the same segment of code in Jinja2:

  <h4>Were you looking for:</h4>

  <ul class="archives" id="suggested_licenses">
    {% for license in license_versions %}
      <li>
        <a href="{{ license.uri }}">
          <b>{{ license.title(target_lang) }}</b>
        </a>
      </li>
    {% endfor %}
  </ul>

If you’ve used Django’s templating system before, this should look very familiar, because that’s the primary source of inspiration for Jinja2. There are a few things I like about Jinja2 though that Django’s templating system doesn’t have, but the biggest and clearest of these things is the ability to pass arguments into functions, as you can see that we’re doing here with license.title(target_lang). Anyway, it massively beats making a template tag every time you want to pass an argument into a function.

The conversion process

Not too much to say about converting from ZPT to Jinja2. It’s really just a lot of manual work, combing through everything and moving it around.

More interestingly might be our translation conversion process. Simply throwing out old translations and re-extracting with new ones is not an option… it’s a lot of effort for translators to go through and translate things and asking them to do it all over again is simply too much to ask and just not going to happen. Pass 1 was to simply get the templates moved over rather than try to both convert templates and the logical->english key system all at once (this move away from logical keys has been tried and fizzled before, probably because there are simply too many moving parts across our codebase… so we wanted to take this incrementally, and this seemed like the best place to go first). We’re simply doing stuff like this:

  <h3>{{ cctrans(locale, "deed.retired")|safe }}</h3>

Where cctrans is a simple logical key translation function. Next steps:

  • Create a script that converts all our .po files to eliminate the logical keys and move them to English-only.
  • Write a script to auto-interpolate {{ cctrans() }} calls in templates to {% trans %}{% endtrans %} Jinja2 tags.
  • Do all the many manual changes to all our python codebases.

At that point, we should be able to wrap this all up.

1 Comment »

More helpful 404 pages

cwebber, January 28th, 2011

This is one of those little features that tends to go into the license engine that runs on the creativecommons.org website which are helpful and small, but not too noticeable if not pointed out. I usually do a pretty bad job of making note of these when they go out, but this time, I’m doing better!

Even most people who don’t know anything about HTTP know that a 404 status code on the web somehow means that the thing you were looking for isn’t actually there. How frustrating! But if it’s not there, maybe we have enough information to help you find what you actually wanted.

That’s the idea between the work that went into Issue 255: “Smart” 404 pages. Maybe we didn’t find a license (or public domain tool) under the URL you put in, but we might be able to help you find a license that does exist. For example, licenses listed under /licenses/ on creativecommons.org are parsed out like /licenses/{code}/{version}/ or /licenses/{code}/{version}/{jurisdiction}/. Knowing that, we can give a list of licenses for what licenses someone might have meant when they:

The pages mostly look like a normal creativecommons.org 404 page, but with just a bit more contextually helpful information (the “were you looking for” section). And, of course, they still return a 404 status code!

No Comments »

cc.engine and web non-frameworks

cwebber, January 13th, 2010

The sanity overhaul has included a number of reworkings, one of them being a rewrite of cc.engine, which in its previous form was a Zope 3 application. Zope is a full featured framework and we already knew we weren’t using many of its features (most notably the ZODB); we suspected that something simpler would serve us better, but weren’t certain what. Nathan suggested one of two directions: either we go with Django (although it wasn’t clear this was “simpler”, it did seem to be where a large portion of the python knowledge and effort in the web world is pooling), or we go with repoze.bfg, a minimalist WSGI framework that pulls in some Zope components. After some discussion we both agreed: repoze.bfg seemed like the better choice for a couple of reasons: for one, Django seemed like it would be providing quite a bit more than necessary… in cc.engine we don’t have a traditional database (we do have an RDF store that we query, but no SQL), we don’t have a need for a user model, etc… the application is simple: show some pages and apply some specialized logic. Second, repoze.bfg built upon and reworked Zope infrastructure and paradigms, and so in that sense it looked like an easier transition. So we went forward with that.

As I went on developing it, I started to feel more and more like, while repoze.bfg certainly had some good ideas, I was having to create a lot of workarounds to support what I needed. For one thing, the URL routing is unordered and based off a ZCML config file. It was at the point where, for resolving the license views, I had to route to a view method that then called other view methods. We also needed a type of functionality as Django provides with its “APPEND_SLASH=True” feature. I discussed with the repoze.bfg people, and they were accommodating to this idea and actually applied it to their codebase for the next release. There were some other components they provided that were well developed but were not what we really needed (and were besides technically decoupled from repoze.bfg the core framework). As an example, the chameleon zpt engine is very good, but it was easier to just pull Zope’s template functionality into our system than make the minor conversions necessary to go with chameleon’s zpt.

Repoze was also affecting the Zope queryutility functionality in a way that made internationalization difficult. Once again, this was done for reasons that make sense and are good within a certain context, but make did not seem to mesh well with our existing needs. I was looking for a solution and reading over the repoze.bfg documentation when I came across these lines:

repoze.bfg provides only the very basics: URL to code mapping, templating, security, and resources. There is not much more to the framework than these pieces: you are expected to provide the rest.

But if we weren’t using the templating, we weren’t using the security model, and we weren’t using the resources, the URL mapping was making things difficult, and those were the things that repoze.bfg was providing on top of what was otherwise just WSGI + WebOb, how hard would it be to just strip things down to just the WSGI + WebOb layer? It turns out, not too difficult, and with an end result of significantly cleaner code.

I went through Ian Bicking’s excellent tutorial Another Do-It-Yourself Framework and applied those ideas to what we already had in cc.engine. Within a night I had the entire framework replaced with a single module, cc/engine/app.py, which contained these few lines:

import sys
import urllib

from webob import Request, exc

from cc.engine import routing

def load_controller(string):
    module_name, func_name = string.split(':', 1)
    __import__(module_name)
    module = sys.modules[module_name]
    func = getattr(module, func_name)
    return func

def ccengine_app(environ, start_response):
    """
    Really basic wsgi app using routes and WebOb.
    """
    request = Request(environ)
    path_info = request.path_info
    route_match = routing.mapping.match(path_info)
    if route_match is None:
        if not path_info.endswith('/') \
                and request.method == 'GET' \
                and routing.mapping.match(path_info + '/'):
            new_path_info = path_info + '/'
            if request.GET:
                new_path_info = '%s?%s' % (
                    new_path_info, urllib.urlencode(request.GET))
            redirect = exc.HTTPTemporaryRedirect(location=new_path_info)
            return request.get_response(redirect)(environ, start_response)
        return exc.HTTPNotFound()(environ, start_response)
    controller = load_controller(route_match['controller'])
    request.start_response = start_response
    request.matchdict = route_match
    return controller(request)(environ, start_response)

def ccengine_app_factory(global_config, **kw):
    return ccengine_app

The main method of importance in this module is ccengine_app. This is a really simple WSGI application: it takes routes as defined in cc.engine.routes (which uses the very enjoyable Routes package) and sees if the current URL (or, the path_info portion of it) matches that URL. If it finds a result, it loads that controller and passes a WebOb-wrapped request into it, with any special URL matching data tacked into the matchdict attribute. And actually, the only reason that this method is even so long at all is because of the “if route_match is None” block in the middle: that whole part is providing APPEND_SLASH=True type functionality, as one would find in Django. (Ie, if you’re visiting the url “/licenses”, and that doesn’t resolve to anything, but the URL “/licenses/” does, redirect to /licenses/.) The portions before and after are just getting the controller for a url and passing the request into it. That’s all! (The current app.py is a few lines longer than this, utilizing a callable class rather than a method in place of ccengine_app for the sake of configurability and attaching a few more things onto the request object, but not longer or complicated by much. The functionality otherwise is pretty much the same.)

Most interesting is that I swapped in this code, changed over the routing, fired up the server and.. it pretty much just worked. I swapped out a framework for about a 50 line module and everything was just as nice and functioning as it was. In fact, with the improved routing provided by Routes, I was able to cut out the fake routing view, and thus the amount of code was actually *less* than what it was before I stripped out the framework. Structurally there was no real loss either; the application still looks familiar to that you’d see in a pylons/django/whatever application.

I’m still a fan of frameworks, and I think we are very fortunate to *have* Zope, Pylons, Django, Repoze.bfg, and et cetera. But in the case of cc.engine I do believe that the position we are at is the right one for us; our needs are both minimal and special case, and the number of components out there for python are quite rich and easily tied together. So it seems the best framework for cc.engine turned out to be no framework at all, and in the end I am quite happy with it.

ADDENDUM: Chris McDonough’s comments below are worth reading.  It’s quite possible that the issues I experienced were my own error, and not repoze.bfg’s.  I also hope that in no way did I give the impression that we moved away from repoze.bfg because it was a bad framework, because repoze.bfg is a great framework, especially if you are using a lot of zope components and concepts.  It’s also worth mentioning that the type of setup that we ended up at, as I described, probably wouldn’t have happened unless I had adapted my concepts directly from repoze.bfg, which does a great job of showing just how usable Zope components are without using the entirety of Zope itself.  Few ideas are born without prior influence; repoze.bfg was built on ideas of Zope (as many Python web frameworks are in some capacity), and so too was the non-framework setup I described here based on the ideas of repoze.bfg.  It is best for us to be courteous to giants as we step on their shoulders, but it is also easier to forget or unintentionally fail to extend that courtesy as I may have done here.  Thankfully I’ve talked to Chris offline and he didn’t seem to have taken this as an offense, so for that I am glad.

2 Comments »

Caching deeds for peak performance

Nathan Yergler, January 6th, 2010

As Chris mentioned, he’s been working on improving the license chooser, among other things simplifying it and making it a better behaved WSGI citizen. That code also handles generating the license deeds. For performance reasons we like to serve those from static files; I put together some details about wsgi_cache, a piece of WSGI middleware I wrote this week to help with this, on my personal blog:

The idea behind wsgi_cache is that you create a disk cache for results, caching only the body of the response. We only cache the body for a simple reason—we want something else, something faster, like Apache or other web server, to serve the request when it’s a cache hit. We’ll use mod_rewrite to send the request to our WSGI application when the requested file doesn’t exist; otherwise it hits the on disk version. And cache “invalidation” becomes as simple as rm (and as fine grained as single resources).

You can read the full entry here, find wsgi_cache documentation on PyPI, and get the source code from our git repository.

No Comments »

Understanding the State of Sanity (via whiteboards and ascii art)

cwebber, December 18th, 2009

Since I started working at Creative Commons a number of months ago, I’ve been primarily focused on something we refer to as the “sanity overhaul”.  In this case, sanity refers to try and simplify what is kind of a long and complicated code history surrounding Creative Commons’ licenses, both as in terms of the internal tooling to modifying, deploying, and querying licenses and the public facing web interfaces for viewing and downloading them.  Efforts toward the sanity overhaul started before I began working here, executed by both Nathan Yergler and Frank Tobia, but for a long time they were in a kind of state of limbo as other technical efforts had to be dedicated to other important tasks.  The good news is that my efforts have been permitted to be (almost) entirely dedicated toward the sanity overhaul since I have started, and we are reaching a point where all of those pieces are falling into place and we are very close to launch.

To give an idea of the complexity of things as they were and how much that complexity has been reduced, it is useful to look at some diagrams.  When Nathan Kinkade first started working at Creative Commons (well before I did), Nathan Yergler took some time to draw on the whiteboard what the present infrastructure looked like:

as well as what he envisioned the “glorious future” (sanity) would look like:

When I started, the present infrastructure had shifted a little bit further still, but the vision of the “glorious future” (sanity) had mostly stayed the same.

This week (our “tech all-hands week”) I gave a presentation on the “State of Sanity”.  Preparing for that presentation I decided to make a new diagram.  Since I was already typing up notes for the presentation in Emacs, I thought I might try and make the most minimalist and clear ASCII art UML-like diagram that I could (my love of ASCII art is well known to anyone who hangs out regularly in #cc on Freenode).  I figured that I would later convert said diagram to a traditional image using Inkscape or Dia, but I was so pleased with the end result that I just ended up using the ASCII version:

*******************
* CORE COMPONENTS *
*******************

      .--.
     ( o_o)
     /'---\
     |USER| --.
     '----'   |
              |
              V
         ___   .---.
       .'   ','     '.
     -'               '.
    (     INTARWEBS     )
     '_.     ____    ._'
        '-_-'    '--'
              |
              |
              V
      +---------------+  Web interface user
      |   cc.engine   |  interacts with
      +---------------+
              |
              |
              V
      +---------------+  Abstraction layer for
      |  cc.license   |  license querying and
      +---------------+  pythonic license API
              |
              |
              V
      +---------------+  Actual rdf datastore and
      |  license.rdf  |  license RDF operation tools
      +---------------+  

****************
* OTHER PIECES *
****************

  +--------------+
  |  cc.i18npkg  |
  | .----------. |
  | | i18n.git | |
  +--------------+

********************************************
* COMPONENTS DEPRECATED BY SANITY OVERHAUL *
********************************************

  +------------+  +-----------+  +---------+  +-------------+
  |    old     |  | old zope  |  | licenze |  | license_xsl |
  | cc.license |  | cc.engine |  +---------+  +-------------+
  +------------+  +-----------+

This isn’t completely descriptive on its own, and I will be annotating as I include it in part of the Sphinx developer docs we are bundling with the new cc.engine.  But I think that even without annotation, it is clear how much cleaner the new infrastructure is at than the old “present infrastructure” whiteboard drawing, which means that we are making good progress!

7 Comments »