python
liblicense 0.8.1: The bugfixiest release ever
asheesh, December 25th, 2008
I’m greatly pleased to announce liblicense 0.8.1. Steren and Greg found a number of major issues (Greg found a consistent crasher on amd64, and Steren found a consistent crasher in the Python bindings). These issues, among
some others, are fixed by the wondrous liblicense 0.8.1. I mentioned to Nathan Y. that liblicense is officially “no longer ghetto.”
The best way enjoy liblicense is from our Ubuntu and Debian package repository, at http://mirrors.creativecommons.org/packages/. More information on what liblicense does is available on our wiki page about liblicense. You can also get them in fresh Fedora 11 packages. And the source tarball is available for download from sourceforge.net.
P.S. MERRY CHRISTMAS!
The full ChangeLog snippet goes like this:
liblicense 0.8.1 (2008-12-24):
* Cleanups in the test suite: test_predicate_rw’s path joiner finally works
* Tarball now includes data_empty.png
* Dynamic tests and static tests treat $HOME the same way
* Fix a major issue with requesting localized informational strings, namely that the first match would be returned rather than all matches (e.g., only the first license of a number of matching licenses). This fixes the Python bindings, which use localized strings.
* Add a cooked PDF example that actually works with exempi; explain why that is not a general solution (not all PDFs have XMP packets, and the XMP packet cannot be resized by libexempi)
* Add a test for writing license information to the XMP in a PNG
* Fix a typo in exempi.c
* Add basic support for storing LL_CREATOR in exempi.c
* In the case that the system locale is unset (therefore, is of value “C”), assume English
* Fix a bug with the TagLib module: some lists were not NULL-terminated
* Use calloc() instead of malloc()+memset() in read_license.c; this improves efficiency and closes a crasher on amd64
* Improve chooser_test.c so that it is not strict as to the *order* the results come back so long as they are the right licenses.
* To help diagnose possible xdg_mime errors, if we detect the hopeless application/octet-stream MIME type, fprintf a warning to stderr.
* Test that searching for unknown file types returns a NULL result rather than a segfault.
>>> py >> file … Also if __name__ == ‘__main__’:
Ankit Guglani, September 1st, 2008
Some major updates and we have the scripts running, thanks Asheesh for the redirection idea, it works but I couldn’t get it to give me a progress bar since everything was being redirected to the file. I tried using two different functions but they needed a shared variable, so that failed, but it was nice since now I ended up with “real” python files with a main().
The journey was interesting, we went from trying >> inside python to including # -*- coding: UTF-8 -*- and # coding: UTF-8 to get it to work and after a few more bumps finally figured out the __main__
I still need to update all the scripts, but licChange which is at the forefront of all the latest developments just got bumped upto version 8.2 (which reminds me of a dire need to update GIT:Loggy!).
This also gave me an idea of how to go about getting data out of S3 for “free” … S3 to EC2 is free … SCP from EC2 is free and voila! Why would I every want to do that? Well, for starters, the EC2 AMI runs out of space around 5 GB (note: logs for i.creativecommons.org are 4.7 GB) and secondly, the scripts seem to run faster locally. The icing on the cake, I wouldn’t have to scp the result files being generated. I could possibly automate the process of running the scripts.
Thats all for now … class at 0830 Hrs in the moring (it’s criminal, I know).
I guess, I’ll just have to keep at it.
No Comments »EC2, S3Sync and back to Python.
Ankit Guglani, August 31st, 2008
So this is where we are.
Now we have EC2, we have S3Sync ruby scripts on the EC2 AMI to pull the data from S3 and we have updated python scripts that read one line at a time and use Geo-IP (which was suprisingly easy to install once GCC was functional and the right versions of the C and Python modules were attained). So deployment is on full throttle and one final bug fix for generating the final results and we are done.
So, now back to the python code. Now we have 4 scripts:
- License Change (Logs for i.creativecommons.org) [Version 7]
- License Chooser (Logs for creativecommons.org) [Version 5]
- CC Search (Logs for search.creativecommons.org) [Version 4]
- Deeds (Logs for creativecommons.org/licenses/*) [Version 2]
Each of which polls a directory for new logs, reads each new log in the stated directory, line by line and uses regular expressions to parse the information into usable statistics. Hitherto throughout the development phase, the results were passed on to stdout / console. With deployment, they now need to be written to a file, while interestingly is still to be resolved. (TypeError: ’str’ object is not callable sound familiar to anyone?)
I am greatful to Asheesh (whom I should have totally bugged more). I should’ve put in more work into the project when vactioning back home, also having less to do at school would’ve helped (studies + 3 research projects is not a recommended wotk load), but if it would be easy, it wouldn’t be fun! Oh well, I learnt a fair bit through the project and with a bit more troubleshooting we’d be good to go … for now!
No Comments »License-oriented metadata validator and viewer: summertime is winding up
Hugo Dworak, August 16th, 2008
Google Summer of Code 2008 approaches its end, as less than forty-eight hours are left to submit the code that will then be evaluated by mentors, therefore it is fitting to pause for a moment and sum up the work that has been done with regard to the license-oriented metadata validator and viewer and to confront it with the original proposal for the project.
A Web application capable of parsing and displaying license information embedded in both well-formed and ill-formed Web pages has been developed. It supports the following means of embedding license information: Dublin Core metadata, RDFa, RDF/XML linked externally or embedded (utilising the data URL scheme) using the link and a elements, and RDF/XML embedded in a comment or as an element (the last two being deprecated). This functionality has been proven by unit testing. The source code of a Web page can be uploaded or pasted by a user, there is also a possibility to provide a URI for the Web application to analyse it. The software has been written in Python and uses the Pylons Web Framework and the Genshi toolkit. Should you be willing to test this Lynx-friendly application, please visit its Web site.
The Web application itself uses a library called “libvalidator”, which in turn is powered by cc.license (a library developed by Creative Commons that returns information about a given license), pyRdfa (a distiller that generates the RDF triples from an (X)HTML+RDFa file), html5lib (an HTML parser/tokenizer), and RDFLib (a library for working with RDF). The choice of this set of tools has not been obvious and the library had undergone several redesigns, which included removing the code that employed encutils, XML canonicalization, µTidylib, and the BeautifulSoup. The idea of using librdf, librdfa, rdfadict has been abandoned. The source code of both the Web application (licensed under the GNU Affero General Public License version 3 or newer) and its core library (licensed under the GNU Lesser General Public License version 3 or newer) is available through the Git repositories of Creative Commons.
In contrast to the contents of the original proposal, the following goals have not been met: traversal of special links, syndication feeds parsing, statistics, and cloning the layout of the Creative Commons Web site. However, these were never mandatory requirements for the Web application. It is also worth noting that the software has been written from scratch, although a now-defunct metadata validator existed. Nevertheless, the development does not end with Google Summer of Code — these and several new features (such as validation of multimedia files via liblicense and support for different language versions) are planned to be added, albeit at a slower pace.
After the test period, the validator will be available under http://validator.creativecommons.org/.
1 Comment »GeoIP Hates Me … phail.
Ankit Guglani, August 6th, 2008
Not that I am expecting much trouble coding using the Geo-IP module, but trying to get it on to the system itself has me believing that this module is out to get me! First, mac OS X (Leopard) doesn’t come with GCC installed (shocker!) and this module needs building, so I go to get it. GCC is in packaged in with the developers tool, which is about a 2 GB install and I can’t hand-pick the components … fail. So I go get myself darwin ports, and try that route. It installs, gives me the sweet *ding*, install complete sound and when I go to terminal and … fail … no such file or directory. So I give in to its terrorist demands and make room for the developers pack thinking I’ll make up for it by actually using these tools. So I wait 19 minutes for it to complete installing, I check I have GCC [i686-apple-darwin9-gcc-4.0.1] … happily I go and python setup.py build … and what followed was not nice … a screen full of Warnings and Errors and No Build. =(
I am going to find another source and try again till it finally works!
In other news, changing all my codes to methods and including append to file for results, looking to add file-list comparison as a feature. Coming soon to a GIT repository near you!
No Comments »License-oriented metadata validator and viewer: libvalidator
Hugo Dworak, July 8th, 2008
As the Google Summer of Code 2008 midterm evaluation deadline is approaching, it is a good time to report the progress when it comes to the license-oriented metadata validator and viewer.
The source code is located in two dedicated git repositories. The first being validator, which contains the source code of the Web application based on Pylons and Genshi. The second repository is libvalidator, which hosts the files that constitute the core library that the project will utilise. This is the component that the development focuses on right now.
The purpose of the aforementioned library is to parse input files, scan them for relevant license information, and output the results in a machine-readable fashion. More precisely, its workflow is the following: parse the file and associated RDF information so that a complete set of RDF data is available, filter the results with regard to license information (not only related to the document itself, but also to other objects described within it), and return the results in a manner preferable for the usage by the Web application.
pyRdfa seems to be the best tool for the parsing stage so far. It handles the current recommendation for embedding license metadata (namely RDFa) as well as other non-deprecated methods: linking to an external or embedded (using the “data” URL scheme) RDF files and utilising the Dublin Core. The significant lacking is handling of the invalid direct embedding of RDF/XML within the HTML/XHTML source code (as an element or in a comment) and this is resolved by first capturing all such instances using a regular expression and then parsing the data just as external RDF/XML files.
Once the RDF triples are extracted, one can use SPARQL to narrow the results just to the triples related to the licensed objects. Both librdf and rdflib support this language. Moreover, the RDF/XML related to the license must be parsed, so that its conditions (permissions, requirements, and restrictions) are then presented to the user.
The library takes advantage of standard Python tools such as Buildout and nose. When it is completed, the project will be all about writing a Web application that will serve as an interface to libvalidator.
No Comments »License-oriented metadata validator and viewer: the development has just started
Hugo Dworak, May 26th, 2008
Creative Commons participates in Google Summer of Code™ and has accepted a proposal (see the abstract) of Hugo Dworak based on its description of a task to rewrite its now-defunct metadata validator. Asheesh Laroia has been assigned as the mentor of the project. The work began on May 26th, 2008 as per the project timeline. It is expected to be completed in twelve weeks. More details will be provided in the dedicated CC Wiki article and the progress will be weekly featured on this blog.
The project focuses on developing an on-line tool — free software written in Python — to validate digitally embedded Creative Commons licenses within files of different types. Files will be pasted directly to a form, identified by a URL, or uploaded by a user. The application will present the results in a human?readable fashion and notify the user if the means used to express the license terms are deprecated.
1 Comment »Liblicense is alpha!
Scott Shawcroft, June 29th, 2007
Over the last week a lot of progress has been made on liblicense. Yesterday Jason and I got the module_read and module_write functions working with a stub io module and an XMP sidecar module. Tuesday and Wednesday I got the library’s system license functions working. Today I did some memory leak plugging and wrote out the system default functions. Nearly every part of the library works as planned. While its still rough, the bulk of the library work is done.
The most common data structure I’ve been using is a null-terminated list (really an array) of strings (char*). Yesterday I wrote out some common methods to be shared throughout the library. These are in list.c. My hope is that these common functions will allow the other code to be cleaner. Next week I plan on fixing up system_licenses.c to use the list functions. At the moment it is the largest, ugliest and leakiest of all the files. That will all be fixed Monday.
After the code cleanup on Monday the much more exciting task of creating modules and clients of the library begins. We’d like to support embedding in as many file formats as possible. Without this ability, the license tracking only works locally. One of the most useful libraries so far is Exempi which can embed in a number of formats. Jason wrote an Exempi liblicense module yesterday. On my list of clients to do is a Gnome Control Panel system default, Nautilus license select, Sugar license select and Creative Commons default license chooser. Am I missing anything important? Where could licenses be integrate besides this? Perhaps Amarok or an equivalent? ccHost? Let me know what you think.
No Comments »Palimpsest
Mike Linksvayer, May 4th, 2007
Terry Hancock, a frequent poster on the law-oriented cc-licenses list, is working on an interesting metadata library called Palimpsest:
[W]hich has a mnemonic association with what the program does, and does have a clever backronym for those who want one:
Python
Attribution &
Licensing
Information
Metadata
Processor, with
Systematic
Extensibility for
Sundry
Types
Terry’s goals for the project:
- Read/write support of Adobe XMP embedded metadata
- Read/write support of native “named field†data
- Read/write support of comments
- Read/write support of visible text labelling for formats that need it
- General adaptation to the 15 Dublin Core named fields for all data
- Discovery of attribution and licensing data in comments and annotations, if not available elsewhere
- License-aware processing (expansion of common abbreviations of terms, etc)
- Open-ended pluggable support for virtually any multimedia datatype
- Highly portable, so that it can be used on clients or servers on any operating system
- Dead-simple, so people will actually want to use it
I’m glad to see Terry tackling this project. It’ll be hard to get the abstractions right, but valuable if it works.
I love the project logo:

Not because it is a particularly great logo, but because it’s the first logo I’ve seen that could be mistaken for a captcha. Intentional or not, bound to be independently invented many times, and perhaps copied by me at least once.
