At Creative Commons sometimes we have python packages that we use a lot but which aren’t generally useful enough to put on PyPi (such as cc.license and cc.licenserdf). Luckily, as long as you and your organization don’t have a problem with having your packages available publicly, it’s fairly easy to put up a public egg basket (that is, a simple repository to put your python eggs/packages). And doing so has several advantages:
- You can avoid cluttering up PyPi with packages that have a very marginal or internal audience. cc.licenserdf matters a lot for our site, but probably is worthless to anyone searching for “RDF” tools on pypi.
- Putting a package up on PyPi suggests a certain amount of responsibility to keep that updated and keep its API stable. If it’s for internal use, maybe that responsibility is unwanted or unneeded.
- Using proper python packaging means you can take full advantage of python’s ecosystem of installation tools, including useful dependency resolution.
- Making packaged releases encourages a certain amount of responsibility toward your internal dependencies, reducing the “cowboy coding” factor (fun, but often not good in an environment that requires stability).
So, how to do it? The first step is to make a directory that’s statically served by Apache. Ours is at http://code.creativecommons.org/basket/. The directory has the sticky bit set so that everyone on the server can write to it without clobbering each other, kind of like in
/tmp/ on most GNU/Linux installs. Everyone puts their tarballs up here. As for you or your organization’s “your own” packages, I think it’s fine to put all your eggs in just one basket like this.
But what should those packages look like?
Generally at the toplevel of your package you have something that looks like:
from setuptools import setup, find_packages import sys, os setup( name='licenseclarity', version='0.1', packages=['licenseclarity'], description="Tools to clarify any licensing issue you have ever had", author='John Doe(ig)', email@example.com', license='MIT', # ... etc ... # Put your dependencies here... install_requires=[ 'setuptools', 'dependency1', 'dependency2', ], # And a link to your basket here dependency_links=[ 'http://code.creativecommons.org/basket/', ], )
Obviously, replace values with ones that make sense for your package. The main things we’re talking about here are install_requires and dependency_links. Replace install_requires with whatever list of dependencies your package requires, and put your your own egg basket in the dependency_links list. After that, I guess set up whatever other large number of attributes you can apply to the setup function that make sense to your package (entry_points maybe? zip_safe=False is a nice one).
Lastly, maybe now that you have an egg basket, you’d like to know how to make some eggs to put in that basket. Pretty easy once you have setup.py! All you really need to do is this:
python setup.py sdist
Assuming things run well, do an `ls` into your dist/ directory. Hey, what do you know! There’s a tarball there. cp or scp or (heaven forbid) ftp it to your egg basket, and there you go. You just made a release! Maybe the next time you do it you’ll want to increment the version number.
Appendix: Why you should always list your dependencies in your package
Lastly I’d like to make a comment on only using buildout’s config file or pip’s requirements file to declare requirements: it’s completely crazy, don’t do it! Well okay, it’s not completely crazy, but you really should fill out the install_requires section in setup.py regardless of whether you are also using those tools. There are a couple of reasons for this:
- Recursive dependencies: maybe someone will use your module as a library, and then they’ll have to declare your dependencies all over again if you don’t put them in install_requires. Let python’s package management tools handle the dependency graph logic, that’s where it belongs.
- What happens if you build your tool in buildout and someone wants to use virtualenv/pip or vice versa? Both of these tools will check the install_requires section anyway, so be nice and fill it out!
Of course there are some awesome things that buildout and the pip requirements file can do to; for example they can install dependencies from VCS and etc which is useful in particular cases, particularly for certain in-development packages. By all means, use these tools to do that (and lots of other cool things, because both buildout and virtualenv/pip are pretty awesome). Just be a good packager and also fill out install_requires.
Now you have your own egg basket (which is completely vegan), you know how to make packages to put in it, and everyone is happy. Horray! Today we learned things (maybe).
Thanks to Asheesh Laroia for his detailed feedback on this post.1 Comment »
Virtualenv and zc.buildout are both great ways to develop python packages and deploy collections of packages without needing to touch the system library. They are both fairly similar, but also fairly different.
The primary difference between them is that zc.buildout focuses on having a single package, and all relevant dependencies are installed automatically within that package’s directory via the buildout script (Nathan Yergler points out that you don’t have to use things this way, but that seems to me to be the way things happen in practice… anyway, I’m not a buildout expert). The buildout script is very automagical and does all the configuration and installation of dependencies for you. Since this is a build system, you can also configure it to do a number of other neat things, such as compile all your gettext catalogs, or scp the latest cheesefile.txt from themoon.example.org… whatever you need to do to build a package.
Virtualenv is mostly the same creature, but it’s like you reached your hand inside and pulled it inside out. Instead of a bunch of packages installed within a subdirectory of one package, there is a more generic directory layout that allows you to set up a number of packages within it. Installing a package and keeping it up to date is much more manual in general, but also a bit more flexible in the sense that you can switch paths around within the environment fairly easily and simultaneously developing multiple interwoven packages is not difficult.
I came to CC with a lot of experience with virtualenv and no experience with zc.buildout. Initially I could discern no differences of use case between them, but now I have a pretty good sense of when you’d want to use one over the other. An example use case, which has come up pretty often with me actually: say you have two packages, one of which is a dependency on the other. In this case, we’ll use both cc.license and cc.engine, where cc.engine has cc.license as a
Now say I’m adding a feature to cc.engine, but this feature also requires that I add something in cc.license. At this point it is
easier for me to switch to using virtualenv; I can set up both development packages in the same virtualenv and use them together. This is great because it means that I should have little to no difficulty switching back and forth between both of them. If I make a change in cc.license it is immediately available to me in cc.engine. This also prevents either having to set up a tedious to switch around configuration checking out cc.license into cc.engine and etc, or making a bunch of unnecessary releases just to make sure things work, etc. It’s easier to work on multiple packages at once in
virtualenv in my experience.
Now let’s assume that we got things in working order, cc.license has the new feature and cc.engine is able to use it properly, tests are passing, and et cetera. At this point is where I think returning to zc.buildout is a good idea. One of the things I like about zc.buildout is that it provides a certain type of integrity checking with the buildout command. If you forget to mark a dependency or even remove it from setup.py on accident or whatever, buildout will simply unlink it from your path the next time you run it. In this case, I think zc.buildout is especially useful because I might forget to make a cc.license release here or some such thing. There are some other reasons for using zc.buildout (as the name implies, buildout is a full build system, so there are a lot of neat things you can do with it), but for a forgetful person such as myself this is by far the most important to me (and the most relevant to this example).
So I’ve described use cases for both cc.engine and cc.license. How do we get them to work nicely together? Let’s assume we just want to check out these packages once. Let’s also assume that our virtualenv directory is ~/env/ccommons (because I’m clearly basing this off my own current setup currently, heh).
First, we’ll create our virtualenv environment, if we haven’t already:
$ virtualenv ~/env/ccommons
Next, we’ll check out cc.engine and cc.license into ~/devel/ and run
buildout on each:
$ cd ~/devel/ $ git clone git://code.creativecommons.org/cc.license.git $ git clone git://code.creativecommons.org/cc.engine.git
Next, we’ll buildout the packages:
$ cd ~/devel/cc.license $ wget http://svn.zope.org/*checkout*/zc.buildout/trunk/bootstrap/bootstrap.py $ python bootstrap.py $ ./bin/buildout $ cd ~/devel/cc.engine $ python bootstrap.py # the cc engine already has bootstrap.py checked in $ ./bin/buildout
Buildout can take a while, so be prepared to go grab some cookies and coffee and/or tea. But once it’s done, getting these packages set up in virtualenv is super simple.
First activate the virtualenv environment:
$ source ~/env/ccommons/bin/activate $ cd ~/devel/cc.license $ python setup.py develop $ cd ~/devel/cc.engine $ python setup.py develop
That’s it! Now we can verify that these packages are set up in virtualenv. Open python and verify that you get the following output (adjusted to your own home directory and etc):
>>> import cc.engine >>> cc.engine.__file__ '/home/cwebber/devel/cc.engine/cc/engine/__init__.pyc' >>> import cc.license >>> cc.license.__file__ '/home/cwebber/devel/cc.license-git/cc/license/__init__.pyc'
To leave virtualenv, you can simply type “deactivate”.
That’s it! Now you have a fully functional zc.buildout AND virtualenv setup, where switching back and forth is super simple.6 Comments »
The sanity overhaul has included a number of reworkings, one of them being a rewrite of cc.engine, which in its previous form was a Zope 3 application. Zope is a full featured framework and we already knew we weren’t using many of its features (most notably the ZODB); we suspected that something simpler would serve us better, but weren’t certain what. Nathan suggested one of two directions: either we go with Django (although it wasn’t clear this was “simpler”, it did seem to be where a large portion of the python knowledge and effort in the web world is pooling), or we go with repoze.bfg, a minimalist WSGI framework that pulls in some Zope components. After some discussion we both agreed: repoze.bfg seemed like the better choice for a couple of reasons: for one, Django seemed like it would be providing quite a bit more than necessary… in cc.engine we don’t have a traditional database (we do have an RDF store that we query, but no SQL), we don’t have a need for a user model, etc… the application is simple: show some pages and apply some specialized logic. Second, repoze.bfg built upon and reworked Zope infrastructure and paradigms, and so in that sense it looked like an easier transition. So we went forward with that.
As I went on developing it, I started to feel more and more like, while repoze.bfg certainly had some good ideas, I was having to create a lot of workarounds to support what I needed. For one thing, the URL routing is unordered and based off a ZCML config file. It was at the point where, for resolving the license views, I had to route to a view method that then called other view methods. We also needed a type of functionality as Django provides with its “APPEND_SLASH=True” feature. I discussed with the repoze.bfg people, and they were accommodating to this idea and actually applied it to their codebase for the next release. There were some other components they provided that were well developed but were not what we really needed (and were besides technically decoupled from repoze.bfg the core framework). As an example, the chameleon zpt engine is very good, but it was easier to just pull Zope’s template functionality into our system than make the minor conversions necessary to go with chameleon’s zpt.
Repoze was also affecting the Zope queryutility functionality in a way that made internationalization difficult. Once again, this was done for reasons that make sense and are good within a certain context, but make did not seem to mesh well with our existing needs. I was looking for a solution and reading over the repoze.bfg documentation when I came across these lines:
repoze.bfg provides only the very basics: URL to code mapping, templating, security, and resources. There is not much more to the framework than these pieces: you are expected to provide the rest.
But if we weren’t using the templating, we weren’t using the security model, and we weren’t using the resources, the URL mapping was making things difficult, and those were the things that repoze.bfg was providing on top of what was otherwise just WSGI + WebOb, how hard would it be to just strip things down to just the WSGI + WebOb layer? It turns out, not too difficult, and with an end result of significantly cleaner code.
I went through Ian Bicking’s excellent tutorial Another Do-It-Yourself Framework and applied those ideas to what we already had in cc.engine. Within a night I had the entire framework replaced with a single module, cc/engine/app.py, which contained these few lines:
import sys import urllib from webob import Request, exc from cc.engine import routing def load_controller(string): module_name, func_name = string.split(':', 1) __import__(module_name) module = sys.modules[module_name] func = getattr(module, func_name) return func def ccengine_app(environ, start_response): """ Really basic wsgi app using routes and WebOb. """ request = Request(environ) path_info = request.path_info route_match = routing.mapping.match(path_info) if route_match is None: if not path_info.endswith('/') and request.method == 'GET' and routing.mapping.match(path_info + '/'): new_path_info = path_info + '/' if request.GET: new_path_info = '%s?%s' % ( new_path_info, urllib.urlencode(request.GET)) redirect = exc.HTTPTemporaryRedirect(location=new_path_info) return request.get_response(redirect)(environ, start_response) return exc.HTTPNotFound()(environ, start_response) controller = load_controller(route_match['controller']) request.start_response = start_response request.matchdict = route_match return controller(request)(environ, start_response) def ccengine_app_factory(global_config, **kw): return ccengine_app
The main method of importance in this module is ccengine_app. This is a really simple WSGI application: it takes routes as defined in cc.engine.routes (which uses the very enjoyable Routes package) and sees if the current URL (or, the path_info portion of it) matches that URL. If it finds a result, it loads that controller and passes a WebOb-wrapped request into it, with any special URL matching data tacked into the matchdict attribute. And actually, the only reason that this method is even so long at all is because of the “if route_match is None” block in the middle: that whole part is providing APPEND_SLASH=True type functionality, as one would find in Django. (Ie, if you’re visiting the url “/licenses”, and that doesn’t resolve to anything, but the URL “/licenses/” does, redirect to /licenses/.) The portions before and after are just getting the controller for a url and passing the request into it. That’s all! (The current app.py is a few lines longer than this, utilizing a callable class rather than a method in place of ccengine_app for the sake of configurability and attaching a few more things onto the request object, but not longer or complicated by much. The functionality otherwise is pretty much the same.)
Most interesting is that I swapped in this code, changed over the routing, fired up the server and.. it pretty much just worked. I swapped out a framework for about a 50 line module and everything was just as nice and functioning as it was. In fact, with the improved routing provided by Routes, I was able to cut out the fake routing view, and thus the amount of code was actually *less* than what it was before I stripped out the framework. Structurally there was no real loss either; the application still looks familiar to that you’d see in a pylons/django/whatever application.
I’m still a fan of frameworks, and I think we are very fortunate to *have* Zope, Pylons, Django, Repoze.bfg, and et cetera. But in the case of cc.engine I do believe that the position we are at is the right one for us; our needs are both minimal and special case, and the number of components out there for python are quite rich and easily tied together. So it seems the best framework for cc.engine turned out to be no framework at all, and in the end I am quite happy with it.
ADDENDUM: Chris McDonough’s comments below are worth reading. It’s quite possible that the issues I experienced were my own error, and not repoze.bfg’s. I also hope that in no way did I give the impression that we moved away from repoze.bfg because it was a bad framework, because repoze.bfg is a great framework, especially if you are using a lot of zope components and concepts. It’s also worth mentioning that the type of setup that we ended up at, as I described, probably wouldn’t have happened unless I had adapted my concepts directly from repoze.bfg, which does a great job of showing just how usable Zope components are without using the entirety of Zope itself. Few ideas are born without prior influence; repoze.bfg was built on ideas of Zope (as many Python web frameworks are in some capacity), and so too was the non-framework setup I described here based on the ideas of repoze.bfg. It is best for us to be courteous to giants as we step on their shoulders, but it is also easier to forget or unintentionally fail to extend that courtesy as I may have done here. Thankfully I’ve talked to Chris offline and he didn’t seem to have taken this as an offense, so for that I am glad.2 Comments »
I’m greatly pleased to announce liblicense 0.8.1. Steren and Greg found a number of major issues (Greg found a consistent crasher on amd64, and Steren found a consistent crasher in the Python bindings). These issues, among
some others, are fixed by the wondrous liblicense 0.8.1. I mentioned to Nathan Y. that liblicense is officially “no longer ghetto.”
The best way enjoy liblicense is from our Ubuntu and Debian package repository, at http://mirrors.creativecommons.org/packages/. More information on what liblicense does is available on our wiki page about liblicense. You can also get them in fresh Fedora 11 packages. And the source tarball is available for download from sourceforge.net.
P.S. MERRY CHRISTMAS!
The full ChangeLog snippet goes like this:
liblicense 0.8.1 (2008-12-24):
* Cleanups in the test suite: test_predicate_rw’s path joiner finally works
* Tarball now includes data_empty.png
* Dynamic tests and static tests treat $HOME the same way
* Fix a major issue with requesting localized informational strings, namely that the first match would be returned rather than all matches (e.g., only the first license of a number of matching licenses). This fixes the Python bindings, which use localized strings.
* Add a cooked PDF example that actually works with exempi; explain why that is not a general solution (not all PDFs have XMP packets, and the XMP packet cannot be resized by libexempi)
* Add a test for writing license information to the XMP in a PNG
* Fix a typo in exempi.c
* Add basic support for storing LL_CREATOR in exempi.c
* In the case that the system locale is unset (therefore, is of value “C”), assume English
* Fix a bug with the TagLib module: some lists were not NULL-terminated
* Use calloc() instead of malloc()+memset() in read_license.c; this improves efficiency and closes a crasher on amd64
* Improve chooser_test.c so that it is not strict as to the *order* the results come back so long as they are the right licenses.
* To help diagnose possible xdg_mime errors, if we detect the hopeless application/octet-stream MIME type, fprintf a warning to stderr.
* Test that searching for unknown file types returns a NULL result rather than a segfault.
Some major updates and we have the scripts running, thanks Asheesh for the redirection idea, it works but I couldn’t get it to give me a progress bar since everything was being redirected to the file. I tried using two different functions but they needed a shared variable, so that failed, but it was nice since now I ended up with “real” python files with a main().
The journey was interesting, we went from trying >> inside python to including # -*- coding: UTF-8 -*- and # coding: UTF-8 to get it to work and after a few more bumps finally figured out the __main__
I still need to update all the scripts, but licChange which is at the forefront of all the latest developments just got bumped upto version 8.2 (which reminds me of a dire need to update GIT:Loggy!).
This also gave me an idea of how to go about getting data out of S3 for “free” … S3 to EC2 is free … SCP from EC2 is free and voila! Why would I every want to do that? Well, for starters, the EC2 AMI runs out of space around 5 GB (note: logs for i.creativecommons.org are 4.7 GB) and secondly, the scripts seem to run faster locally. The icing on the cake, I wouldn’t have to scp the result files being generated. I could possibly automate the process of running the scripts.
Thats all for now … class at 0830 Hrs in the moring (it’s criminal, I know).
I guess, I’ll just have to keep at it.Comments Off
So this is where we are.
Now we have EC2, we have S3Sync ruby scripts on the EC2 AMI to pull the data from S3 and we have updated python scripts that read one line at a time and use Geo-IP (which was suprisingly easy to install once GCC was functional and the right versions of the C and Python modules were attained). So deployment is on full throttle and one final bug fix for generating the final results and we are done.
So, now back to the python code. Now we have 4 scripts:
- License Change (Logs for i.creativecommons.org) [Version 7]
- License Chooser (Logs for creativecommons.org) [Version 5]
- CC Search (Logs for search.creativecommons.org) [Version 4]
- Deeds (Logs for creativecommons.org/licenses/*) [Version 2]
Each of which polls a directory for new logs, reads each new log in the stated directory, line by line and uses regular expressions to parse the information into usable statistics. Hitherto throughout the development phase, the results were passed on to stdout / console. With deployment, they now need to be written to a file, while interestingly is still to be resolved. (TypeError: ‘str’ object is not callable sound familiar to anyone?)
I am greatful to Asheesh (whom I should have totally bugged more). I should’ve put in more work into the project when vactioning back home, also having less to do at school would’ve helped (studies + 3 research projects is not a recommended wotk load), but if it would be easy, it wouldn’t be fun! Oh well, I learnt a fair bit through the project and with a bit more troubleshooting we’d be good to go … for now!Comments Off
Google Summer of Code 2008 approaches its end, as less than forty-eight hours are left to submit the code that will then be evaluated by mentors, therefore it is fitting to pause for a moment and sum up the work that has been done with regard to the license-oriented metadata validator and viewer and to confront it with the original proposal for the project.
A Web application capable of parsing and displaying license information embedded in both well-formed and ill-formed Web pages has been developed. It supports the following means of embedding license information: Dublin Core metadata, RDFa, RDF/XML linked externally or embedded (utilising the data URL scheme) using the link and a elements, and RDF/XML embedded in a comment or as an element (the last two being deprecated). This functionality has been proven by unit testing. The source code of a Web page can be uploaded or pasted by a user, there is also a possibility to provide a URI for the Web application to analyse it. The software has been written in Python and uses the Pylons Web Framework and the Genshi toolkit. Should you be willing to test this Lynx-friendly application, please visit its Web site.
The Web application itself uses a library called “libvalidator”, which in turn is powered by cc.license (a library developed by Creative Commons that returns information about a given license), pyRdfa (a distiller that generates the RDF triples from an (X)HTML+RDFa file), html5lib (an HTML parser/tokenizer), and RDFLib (a library for working with RDF). The choice of this set of tools has not been obvious and the library had undergone several redesigns, which included removing the code that employed encutils, XML canonicalization, µTidylib, and the BeautifulSoup. The idea of using librdf, librdfa, rdfadict has been abandoned. The source code of both the Web application (licensed under the GNU Affero General Public License version 3 or newer) and its core library (licensed under the GNU Lesser General Public License version 3 or newer) is available through the Git repositories of Creative Commons.
In contrast to the contents of the original proposal, the following goals have not been met: traversal of special links, syndication feeds parsing, statistics, and cloning the layout of the Creative Commons Web site. However, these were never mandatory requirements for the Web application. It is also worth noting that the software has been written from scratch, although a now-defunct metadata validator existed. Nevertheless, the development does not end with Google Summer of Code — these and several new features (such as validation of multimedia files via liblicense and support for different language versions) are planned to be added, albeit at a slower pace.
After the test period, the validator will be available under http://validator.creativecommons.org/.1 Comment »
Not that I am expecting much trouble coding using the Geo-IP module, but trying to get it on to the system itself has me believing that this module is out to get me! First, mac OS X (Leopard) doesn’t come with GCC installed (shocker!) and this module needs building, so I go to get it. GCC is in packaged in with the developers tool, which is about a 2 GB install and I can’t hand-pick the components … fail. So I go get myself darwin ports, and try that route. It installs, gives me the sweet *ding*, install complete sound and when I go to terminal and … fail … no such file or directory. So I give in to its terrorist demands and make room for the developers pack thinking I’ll make up for it by actually using these tools. So I wait 19 minutes for it to complete installing, I check I have GCC [i686-apple-darwin9-gcc-4.0.1] … happily I go and python setup.py build … and what followed was not nice … a screen full of Warnings and Errors and No Build. =(
I am going to find another source and try again till it finally works!
In other news, changing all my codes to methods and including append to file for results, looking to add file-list comparison as a feature. Coming soon to a GIT repository near you!Comments Off
The source code is located in two dedicated git repositories. The first being validator, which contains the source code of the Web application based on Pylons and Genshi. The second repository is libvalidator, which hosts the files that constitute the core library that the project will utilise. This is the component that the development focuses on right now.
The purpose of the aforementioned library is to parse input files, scan them for relevant license information, and output the results in a machine-readable fashion. More precisely, its workflow is the following: parse the file and associated RDF information so that a complete set of RDF data is available, filter the results with regard to license information (not only related to the document itself, but also to other objects described within it), and return the results in a manner preferable for the usage by the Web application.
pyRdfa seems to be the best tool for the parsing stage so far. It handles the current recommendation for embedding license metadata (namely RDFa) as well as other non-deprecated methods: linking to an external or embedded (using the “data” URL scheme) RDF files and utilising the Dublin Core. The significant lacking is handling of the invalid direct embedding of RDF/XML within the HTML/XHTML source code (as an element or in a comment) and this is resolved by first capturing all such instances using a regular expression and then parsing the data just as external RDF/XML files.
Once the RDF triples are extracted, one can use SPARQL to narrow the results just to the triples related to the licensed objects. Both librdf and rdflib support this language. Moreover, the RDF/XML related to the license must be parsed, so that its conditions (permissions, requirements, and restrictions) are then presented to the user.
The library takes advantage of standard Python tools such as Buildout and nose. When it is completed, the project will be all about writing a Web application that will serve as an interface to libvalidator.Comments Off
Creative Commons participates in Google Summer of Code™ and has accepted a proposal (see the abstract) of Hugo Dworak based on its description of a task to rewrite its now-defunct metadata validator. Asheesh Laroia has been assigned as the mentor of the project. The work began on May 26th, 2008 as per the project timeline. It is expected to be completed in twelve weeks. More details will be provided in the dedicated CC Wiki article and the progress will be weekly featured on this blog.
The project focuses on developing an on-line tool — free software written in Python — to validate digitally embedded Creative Commons licenses within files of different types. Files will be pasted directly to a form, identified by a URL, or uploaded by a user. The application will present the results in a human?readable fashion and notify the user if the means used to express the license terms are deprecated.1 Comment »