We’ve discovered that some of our websites, in particular those that rely on MySQL, are very sensitive to spikes in disk I/O load. Right now, we do run some non-interactive services on the same machines as some of our websites.
Subversion and git in particular seem to cause long-duration high disk load, which causes Nathan Kinkade to get paged when e.g. wiki.creativecommons.org takes too long to load. We have found that using ionice to set background activities to “idle” priority is very useful in avoiding sending text messages to NK.
However, ionice can only be run by root, meaning regular users can’t even request the system be more gentle. So I wrote a simple tool, “ionicer,” that is a setuid-root C tool that sets its parent process’s IO priority to idle.
I then used dpkg-divert to replace /usr/bin/svnserve and /usr/bin/git with simple shell wrappers that call ionicer before calling the real binaries. So the call path goes:
- user connects with svn+ssh to code.creativecommons.org
- user logs in with an SSH key and executes “svnserve.”
- svnserve is really a shell script. /bin/bash runs a script which does two things:
- Runs ionicer, which changes the shell to I/O priority class idle, and
- Executes svnserve.real with the same arguments as were passed into the wrapper.
We use Pootle for handling http://translate.creativecommons.org/, the site where our international affiliates and other CC community members can help us out by translating CC content into other languages. Currently we only request translations this way for core CC infrastructure like the license choose.
For a few months, I had been working on a replacement for Pootle that better-fit our needs. Mozilla, as it happens, has similar issues, and when I recently investigated, I found that they were working on improvements to Pootle. So enough working alone; we’ll work with them! Their project to improve Pootle is called “Verbatim,” and I encourage all interested in web-based translation software to read more at that link.
Nathan Yergler and I sent an email to our international affiliates email list, and since I have a lot to do before tomorrow, I’ll let it speak for itself:Comments Off
We just pushed a new feature set to CC Network; most of these are OpenID related, and I’ll try to write more in depth about the interesting ones in the near future:
- You can now elect to “always trust” an OpenID site
- Commoner (the code behind the CC Network) now supports the Simple Registration extension to OpenID
- We have preliminary (experimental) support for Verisign’s OpenID Seatbelt extension
- We’ve added a link that to “add another” after you register a work
- You can now select a full size or “thin” network badge (similar to the license badges)
- We’ve added a link in the
headof profile and work list pages that points to the RDF/XML file; we still consider RDFa the primary vehicle but this may help others ingest the data
Django is an amazing web framework; we built a lot of features in a very short period of time and Django [mostly] stayed out of our way. Last night as I was working on today’s feature upgrade for creativecommons.net I decided to tackle what lots of people see as its major weakness: schema migration. Rather, the lack of an integrated migration story.
I had seen some of the tools floating around and decided to watch the panel from Djangocon to get a better overview. For the record the represented tools are dmigrations, south and django-evolution. At some point while watching the video I think I was convinced each was the right solution; they all have features/use cases to recommend them.
Selecting the correct tool is an exercise in change management: it seems almost certain that Django will eventually adopt or create a “blessed” migration tool. And at that point, we need a way to move forward. Because of this I wound up choosing dmigrations. Sure, it doesn’t do some of the fancy stuff that south and django-evolution do (dependency tracking, model “fingerprinting”) but it does let us dump out the entire migration path as raw SQL and that’s something I can easily work with when it comes time to recreate our database on the “real’ platform.2 Comments »
Last week we launched the Creative Commons Network as part of our annual campaign. The Network is a platform for exploring digital copyright registries and has features we think will appeal to our community. Creators can register their works and add a badge to their web pages to show more information about their identity on the license deeds (for more information see this post on the CC blog).
Like everything we do at Creative Commons, the code behind the CC Network is free.
commoner is a Django-based Python application that runs the code behind creativecommons.net. While we don’t expect someone would want to run a site exactly like the CC Network, we think there are plenty of opportunities for different registries that serve different communities to flourish. The enhanced deeds aren’t dependent on anything in the CC infrastructure stack — we’re just consuming metadata published on the pages.
- Increase transparency regarding what we are doing and plan to do. If you find a bug or suggest an idea, we’d like to make sure it’s tracked in a publicly accessible location where everyone can follow along.
- Tangentially, we’d like to make it easier for people to contribute to the work we’re doing. We [semi-]frequently hear people say they’d like to help, but don’t know where to start. We’ve had Developer Challenges forever but they’re not easy to find and poorly maintained. I’m personally hoping that keeping the ideas in the same system we [developers] use every day will keep them in the forefront of our minds.
With respect to the first, we’re initially tracking bugs here for three projects: the license engine, Herder (a translation tool we’ll be rolling out real soon now), and CC Learn’s Universal Education Search project. Feel free to create bugs, wishes, features for any CC project; we’ll create the Project identifiers as we go.
With respect to challenges, I’ve created a
community keyword we’ll assign to projects that it’s unlikely we’ll tackle, but which might be appropriate for someone in the community who wants to contribute. Luis’ idea from earlier this week is the first. I hope we have a giant pile of ideas (and a corresponding giant pile of completed ideas) by next year’s Summer of Code.
So after setting up EC2, S3, grabbing the files from S3, SCP-ing the python scripts and running them, one would expect to see some results. Upon the polite request of Asheesh here is a sampler.
The first script (dealing with urls that change their license, named licChange.py) results in an output which lists the URLs (that change their license [type, version or jurisdiction]), the license info and the date(s) of change:
http://blog.aikawa.com.ar/ [[‘by-nc-sa’, ‘2.5’, ‘ar’], [‘by-nc-nd’, ‘2.5’, ‘ar’]] [’21/Sep/2007:11:38:56 +0000′, ’22/Sep/2007:05:40:22 +0000′]
The line above shows that the license for the URL ‘http://blog.aikawa.com.ar/’ was changed from ‘by-nc-sa 2.5 Argentina’ to ‘by-nc-nd 2.5 Argentina’ some time between 11:38:56 GMT on the 21st of September 2007 to 05:40:22 GMT on 22nd of September 2007. The format may seem a bit awkward but you can expect a facelift for the results file. I was previously planning to re-read the file to generate statistics but we can have a seperate file for storing data and another one for the stats.
Similarly, the following lines out of the results file for licChange.py from 2007-09 show license changes for ‘http://0.0.0.0:3000/’ and ‘http://127.0.0.1/actibands/castellano/licencias.htm’ and *many other internal URLs:
http://0.0.0.0:3000/ [[‘by-nc-sa’, ‘3.0’, ”], [‘by-nc-sa’, ‘3.0’, ”], [‘by-nc-sa’, ‘3.0’, ”], [‘by-nc-sa’, ‘3.0’, ”], [‘by-nc-sa’, ‘3.0’, ”], [‘by-nc-nd’, ‘3.0’, ‘nl’], [‘by-nc-nd’, ‘3.0’, ‘nl’]] [’17/Sep/2007:08:10:28 +0000′, ’17/Sep/2007:17:50:28 +0000′, ’18/Sep/2007:16:25:47 +0000′, ’19/Sep/2007:13:03:23 +0000′, ’19/Sep/2007:13:11:16 +0000′, ’20/Sep/2007:22:16:09 +0000′, ’20/Sep/2007:22:16:39 +0000′]
http://127.0.0.1/actibands/castellano/licencias.htm [[‘by-sa’, ‘2.5’, ‘es’], [‘by-nc-sa’, ‘2.5’, ‘es’], [‘by-sa’, ‘2.5’, ‘es’], [‘by-nc-sa’, ‘2.5’, ‘es’], [‘by-sa’, ‘2.5’, ‘es’], [‘by-nc-sa’, ‘2.5’, ‘es’]] [’27/Sep/2007:20:50:44 +0000′, ’27/Sep/2007:20:50:44 +0000′, ’27/Sep/2007:20:51:00 +0000′, ’27/Sep/2007:20:51:00 +0000′, ’27/Sep/2007:20:51:23 +0000′, ’27/Sep/2007:20:51:23 +0000′]
The licenses for http://0.0.0.0:3000/ are ported for Netherlands (nl) and the one for http://127.0.0.1/actibands/castellano/licencias.htm are ported for Spain (es). Note that presently all the occurences of any URL that changes its license is outputted, this will be changed in the next nightly build. This included a better formatted result file with stats on total number of URLs changing licenses and even stats distinguishing changes between license change and version change.
Akin to this (licChange.py) there are 3 more scripts, licChooser.py, licSearch.py and deedLogs.py.
licChooser.py grabs metadata usage information and generates stats in absolute numbers and percentage of all entries, eg.: “16 out of 100 items are tagged as Audio [16%] of total entries and 29% of items with Metadata”
licSearch.py grabs information from the logs for search.creativecommons.org like the query, the engine and the search options (commercial use and derivatives).
deedLogs.py looks at the logs for the deed pages, employs MaxMind GeoIP to do a location lookup and grabs the deed page being loked at.
So this is what we have so far.Comments Off
Some major updates and we have the scripts running, thanks Asheesh for the redirection idea, it works but I couldn’t get it to give me a progress bar since everything was being redirected to the file. I tried using two different functions but they needed a shared variable, so that failed, but it was nice since now I ended up with “real” python files with a main().
The journey was interesting, we went from trying >> inside python to including # -*- coding: UTF-8 -*- and # coding: UTF-8 to get it to work and after a few more bumps finally figured out the __main__
I still need to update all the scripts, but licChange which is at the forefront of all the latest developments just got bumped upto version 8.2 (which reminds me of a dire need to update GIT:Loggy!).
This also gave me an idea of how to go about getting data out of S3 for “free” … S3 to EC2 is free … SCP from EC2 is free and voila! Why would I every want to do that? Well, for starters, the EC2 AMI runs out of space around 5 GB (note: logs for i.creativecommons.org are 4.7 GB) and secondly, the scripts seem to run faster locally. The icing on the cake, I wouldn’t have to scp the result files being generated. I could possibly automate the process of running the scripts.
Thats all for now … class at 0830 Hrs in the moring (it’s criminal, I know).
I guess, I’ll just have to keep at it.Comments Off
So this is where we are.
Now we have EC2, we have S3Sync ruby scripts on the EC2 AMI to pull the data from S3 and we have updated python scripts that read one line at a time and use Geo-IP (which was suprisingly easy to install once GCC was functional and the right versions of the C and Python modules were attained). So deployment is on full throttle and one final bug fix for generating the final results and we are done.
So, now back to the python code. Now we have 4 scripts:
- License Change (Logs for i.creativecommons.org) [Version 7]
- License Chooser (Logs for creativecommons.org) [Version 5]
- CC Search (Logs for search.creativecommons.org) [Version 4]
- Deeds (Logs for creativecommons.org/licenses/*) [Version 2]
Each of which polls a directory for new logs, reads each new log in the stated directory, line by line and uses regular expressions to parse the information into usable statistics. Hitherto throughout the development phase, the results were passed on to stdout / console. With deployment, they now need to be written to a file, while interestingly is still to be resolved. (TypeError: ‘str’ object is not callable sound familiar to anyone?)
I am greatful to Asheesh (whom I should have totally bugged more). I should’ve put in more work into the project when vactioning back home, also having less to do at school would’ve helped (studies + 3 research projects is not a recommended wotk load), but if it would be easy, it wouldn’t be fun! Oh well, I learnt a fair bit through the project and with a bit more troubleshooting we’d be good to go … for now!Comments Off
Google Summer of Code 2008 approaches its end, as less than forty-eight hours are left to submit the code that will then be evaluated by mentors, therefore it is fitting to pause for a moment and sum up the work that has been done with regard to the license-oriented metadata validator and viewer and to confront it with the original proposal for the project.
A Web application capable of parsing and displaying license information embedded in both well-formed and ill-formed Web pages has been developed. It supports the following means of embedding license information: Dublin Core metadata, RDFa, RDF/XML linked externally or embedded (utilising the data URL scheme) using the link and a elements, and RDF/XML embedded in a comment or as an element (the last two being deprecated). This functionality has been proven by unit testing. The source code of a Web page can be uploaded or pasted by a user, there is also a possibility to provide a URI for the Web application to analyse it. The software has been written in Python and uses the Pylons Web Framework and the Genshi toolkit. Should you be willing to test this Lynx-friendly application, please visit its Web site.
The Web application itself uses a library called “libvalidator”, which in turn is powered by cc.license (a library developed by Creative Commons that returns information about a given license), pyRdfa (a distiller that generates the RDF triples from an (X)HTML+RDFa file), html5lib (an HTML parser/tokenizer), and RDFLib (a library for working with RDF). The choice of this set of tools has not been obvious and the library had undergone several redesigns, which included removing the code that employed encutils, XML canonicalization, µTidylib, and the BeautifulSoup. The idea of using librdf, librdfa, rdfadict has been abandoned. The source code of both the Web application (licensed under the GNU Affero General Public License version 3 or newer) and its core library (licensed under the GNU Lesser General Public License version 3 or newer) is available through the Git repositories of Creative Commons.
In contrast to the contents of the original proposal, the following goals have not been met: traversal of special links, syndication feeds parsing, statistics, and cloning the layout of the Creative Commons Web site. However, these were never mandatory requirements for the Web application. It is also worth noting that the software has been written from scratch, although a now-defunct metadata validator existed. Nevertheless, the development does not end with Google Summer of Code — these and several new features (such as validation of multimedia files via liblicense and support for different language versions) are planned to be added, albeit at a slower pace.
After the test period, the validator will be available under http://validator.creativecommons.org/.1 Comment »
previous page — next page