development

git-svn with svn:externals

nathan, July 21st, 2009

We’ve been slowly but surely moving projects from Subversion to git, but there are still large pieces of code that are sort of deadlocked in Subversion. We make extensive use of svn:externals in our repository as a way to pull in dependencies and shared code (both within our repository and from other repositories, like zc.buildout’s bootstrap). This means that in order to convert something like the license engine (cc.engine) we need to also convert cc.license (which uses license_xsl), license.rdf and i18n. Of course, the API also uses license_xsl and the main site uses license.rdf. Taking the time to move all of that wholesale isn’t something we have the time or desire to do right now. It’s not just converting the repository — that’s the easy part; converting the deployments and surrounding tools is the real pain.

Last week I decided I wanted to use git to work on code currently in Subversion (my supporting tool chain really is that much better for git) and decided to check it out using git-svn. And once again I was burned by our use of svn:externals. So I wrote gsc — git subversion clone. You can read more details on my blog or find the code on gitorious.

Incidentally, cleaning up that dependency graph is very much on our radar. I’m hoping to land work that we started last summer this fall that will remove some duplicated code and clean up the dependencies of the remaining code.

Comments Off

New validator released!

asheesh, January 6th, 2009

This past summer, Hugo Dworak worked with us (thanks to Google Summer of Code) on a new validator. This work was greatly overdue, and we are very pleased that Google could fund Hugo to work on it. Our previous validator had not been updated to reflect our new metadata standards, so we disabled it some time ago to avoid creating further confusion. The textbook on CC metadata is the “Creative Commons Rights Expression Language”, or ccREL, which specifies the use of RDFa on the web. (If this sounds like keyword soup, rest assured that the License Engine generates HTML that you can copy and paste; that HTML is fully compliant with ccREL.) We hoped Hugo’s work on a new validator would let us offer a validator to the Creative Commons community so that publishers can test their web pages to make sure they encode the information they intended.

Hugo’s work was a success; he announced in August 2008 a test version of the validator. He built on top of the work of others: the new validator uses the Pylons web framework, html5lib for HTML parsing and tokenizing, and RDFlib for working with RDF. He shared his source code under the recent free software license built for network services, AGPLv3.

So I am happy to announce that the test period is complete, and we are now running the new code at http://validator.creativecommons.org/. Our thanks go out to Hugo, and we look forward to the new validator gaining some use as well as hearing your feedback. If you want to contribute to the validator’s development or check it out for any reason, take a look at the documentation on the CC wiki.

1 Comment »

liblicense 0.8.1: The bugfixiest release ever

asheesh, December 25th, 2008

I’m greatly pleased to announce liblicense 0.8.1. Steren and Greg found a number of major issues (Greg found a consistent crasher on amd64, and Steren found a consistent crasher in the Python bindings). These issues, among
some others, are fixed by the wondrous liblicense 0.8.1. I mentioned to Nathan Y. that liblicense is officially “no longer ghetto.”

The best way enjoy liblicense is from our Ubuntu and Debian package repository, at http://mirrors.creativecommons.org/packages/. More information on what liblicense does is available on our wiki page about liblicense. You can also get them in fresh Fedora 11 packages. And the source tarball is available for download from sourceforge.net.

P.S. MERRY CHRISTMAS!

The full ChangeLog snippet goes like this:

liblicense 0.8.1 (2008-12-24):
* Cleanups in the test suite: test_predicate_rw’s path joiner finally works
* Tarball now includes data_empty.png
* Dynamic tests and static tests treat $HOME the same way
* Fix a major issue with requesting localized informational strings, namely that the first match would be returned rather than all matches (e.g., only the first license of a number of matching licenses). This fixes the Python bindings, which use localized strings.
* Add a cooked PDF example that actually works with exempi; explain why that is not a general solution (not all PDFs have XMP packets, and the XMP packet cannot be resized by libexempi)
* Add a test for writing license information to the XMP in a PNG
* Fix a typo in exempi.c
* Add basic support for storing LL_CREATOR in exempi.c
* In the case that the system locale is unset (therefore, is of value “C”), assume English
* Fix a bug with the TagLib module: some lists were not NULL-terminated
* Use calloc() instead of malloc()+memset() in read_license.c; this improves efficiency and closes a crasher on amd64
* Improve chooser_test.c so that it is not strict as to the *order* the results come back so long as they are the right licenses.
* To help diagnose possible xdg_mime errors, if we detect the hopeless application/octet-stream MIME type, fprintf a warning to stderr.
* Test that searching for unknown file types returns a NULL result rather than a segfault.

Comments Off

License-oriented metadata validator and viewer: summertime is winding up

hugo dworak, August 16th, 2008

Google Summer of Code 2008 approaches its end, as less than forty-eight hours are left to submit the code that will then be evaluated by mentors, therefore it is fitting to pause for a moment and sum up the work that has been done with regard to the license-oriented metadata validator and viewer and to confront it with the original proposal for the project.

A Web application capable of parsing and displaying license information embedded in both well-formed and ill-formed Web pages has been developed. It supports the following means of embedding license information: Dublin Core metadata, RDFa, RDF/XML linked externally or embedded (utilising the data URL scheme) using the link and a elements, and RDF/XML embedded in a comment or as an element (the last two being deprecated). This functionality has been proven by unit testing. The source code of a Web page can be uploaded or pasted by a user, there is also a possibility to provide a URI for the Web application to analyse it. The software has been written in Python and uses the Pylons Web Framework and the Genshi toolkit. Should you be willing to test this Lynx-friendly application, please visit its Web site.

The Web application itself uses a library called “libvalidator”, which in turn is powered by cc.license (a library developed by Creative Commons that returns information about a given license), pyRdfa (a distiller that generates the RDF triples from an (X)HTML+RDFa file), html5lib (an HTML parser/tokenizer), and RDFLib (a library for working with RDF). The choice of this set of tools has not been obvious and the library had undergone several redesigns, which included removing the code that employed encutils, XML canonicalization, µTidylib, and the BeautifulSoup. The idea of using librdf, librdfa, rdfadict has been abandoned. The source code of both the Web application (licensed under the GNU Affero General Public License version 3 or newer) and its core library (licensed under the GNU Lesser General Public License version 3 or newer) is available through the Git repositories of Creative Commons.

In contrast to the contents of the original proposal, the following goals have not been met: traversal of special links, syndication feeds parsing, statistics, and cloning the layout of the Creative Commons Web site. However, these were never mandatory requirements for the Web application. It is also worth noting that the software has been written from scratch, although a now-defunct metadata validator existed. Nevertheless, the development does not end with Google Summer of Code — these and several new features (such as validation of multimedia files via liblicense and support for different language versions) are planned to be added, albeit at a slower pace.

After the test period, the validator will be available under http://validator.creativecommons.org/.

1 Comment »

License-oriented metadata validator and viewer: libvalidator

hugo dworak, July 8th, 2008

As the Google Summer of Code 2008 midterm evaluation deadline is approaching, it is a good time to report the progress when it comes to the license-oriented metadata validator and viewer.

The source code is located in two dedicated git repositories. The first being validator, which contains the source code of the Web application based on Pylons and Genshi. The second repository is libvalidator, which hosts the files that constitute the core library that the project will utilise. This is the component that the development focuses on right now.

The purpose of the aforementioned library is to parse input files, scan them for relevant license information, and output the results in a machine-readable fashion. More precisely, its workflow is the following: parse the file and associated RDF information so that a complete set of RDF data is available, filter the results with regard to license information (not only related to the document itself, but also to other objects described within it), and return the results in a manner preferable for the usage by the Web application.

pyRdfa seems to be the best tool for the parsing stage so far. It handles the current recommendation for embedding license metadata (namely RDFa) as well as other non-deprecated methods: linking to an external or embedded (using the “data” URL scheme) RDF files and utilising the Dublin Core. The significant lacking is handling of the invalid direct embedding of RDF/XML within the HTML/XHTML source code (as an element or in a comment) and this is resolved by first capturing all such instances using a regular expression and then parsing the data just as external RDF/XML files.

Once the RDF triples are extracted, one can use SPARQL to narrow the results just to the triples related to the licensed objects. Both librdf and rdflib support this language. Moreover, the RDF/XML related to the license must be parsed, so that its conditions (permissions, requirements, and restrictions) are then presented to the user.

The library takes advantage of standard Python tools such as Buildout and nose. When it is completed, the project will be all about writing a Web application that will serve as an interface to libvalidator.

Comments Off

Flickr Image Re-Use for OpenOffice.org update

mihai, June 13th, 2008

I`m happy to announce that i succeeded in doing, in a basic manner, all the 3 requirements for this project : search photos by tags, by license and to insert one photo into a document.

Here you have a screenshot made after a search was done on tag mountains and license Attribution License :

Results after a search on a tag and a license

Also here you have the screenshot with the photo inserted into a document . As you can see the image was inserted with a default size, but this will be changed later.

What i`ll try to do next :

  • add menus to each image with the available sizes
  • improve the searching
  • inserting the image into the document with the selected size
  • adding the license into the document
  • more testing

I hope, that in less than 2 weeks i will make available a good version.

Any comments or suggestions are well appreciated.

ps : I came across this article. “I for one can’t wait.” says Andrew Min about this project. I`ll try to not disappoint him :)

1 Comment »

Wait, they test web apps now?

frank, June 10th, 2008

Greetings all. I’m Frank and I’m the other tech intern at CC for 2008. I mainly focus on writing Python, though I’m comfortable with Java and I’ll deal with C when sufficiently coerced.

For my first task, I’ll be improving the test suite for the <a href=”http://api.creativecommons.org/docs/”>web service API</a>. Currently it runs on <a href=”http://cherrypy.org/”>CherryPy</a> and the test suite is brittle and somewhat broken. I’ll be porting the tests over to <a href=”http://pythonpaste.org/”>Python Paste</a>, getting them all to pass, and checking code coverage to see where more tests would be beneficial (testing is fun after all). The overarching goal is to get the API running on <a href=”http://pylonshq.com/”>Pylons</a> so CC has fewer server stacks to maintain.

Comments Off

GSoC 2008 : Flickr Image Re-Use for OpenOffice.org

mihai, May 29th, 2008

As title might suggest, i have been selected for GSoC 2008. As mentor for this project has been assigned Nathan Yergler.

The developing will focus on 3 key functionalities:

  • ability to search photos by tags
  • filter search results by license attributes
  • insert the image into the document along with attribution information

The first 2 steps were done(of course, this will not be the final version) in small demo that i attached to my application for GSoC 2008. The OpenOffice components for which the extension will be implemented are Writer, Impress and Calc.

The application will be written in Java, using NetBeans with its plugin OpenOffice NetBeans Integration.

A short introduction : I`m Mihai Husleag, 24 years old, student in Computer Science, at Alexandru Ioan Cuza University of Iasi, Romania. My previous experience as programmer is more related to the .NET framework. Another thing about me, if in the weekends i`m not reachable then its a high probability that you will find me here.

If you have any suggestions about this project(new functionalities, things you don’t like, etc) feel free to leave a comment.

Comments Off

License-oriented metadata validator and viewer: the development has just started

hugo dworak, May 26th, 2008

Creative Commons participates in Google Summer of Code™ and has accepted a proposal (see the abstract) of Hugo Dworak based on its description of a task to rewrite its now-defunct metadata validator. Asheesh Laroia has been assigned as the mentor of the project. The work began on May 26th, 2008 as per the project timeline. It is expected to be completed in twelve weeks. More details will be provided in the dedicated CC Wiki article and the progress will be weekly featured on this blog.

The project focuses on developing an on-line tool — free software written in Python — to validate digitally embedded Creative Commons licenses within files of different types. Files will be pasted directly to a form, identified by a URL, or uploaded by a user. The application will present the results in a human?readable fashion and notify the user if the means used to express the license terms are deprecated.

1 Comment »

A patchy web server

asheesh, August 13th, 2007

The tech team here is examining how we can better-organize our web servers and web sites. If you are a DNS sleuth, you may have noticed that most of our web sites are served from a server called “apps.creativecommons.org.” But there are some problems with this setup: for example, some of our sites are lots of little static files (like the images site, i.creativecommons.org), and some serve CPU-intensive web pages (like wiki.creativecommons.org). We can handle many, many more requests to i.creativecommons.org per second than to wiki.creativecommons.org.

In order to prevent Apache from overloading the computer it’s running on, we have to limit the maximum number of web pages it can serve at once. If a thousand people at once were requesting our images, that would be no problem, but we couldn’t really handle a thousand requests to our wiki at once. So we have to set the web server to the limit that can be handled by the most difficult site – otherwise, when more people start editing our wiki at once than we can handle, service will degrade really terribly for everyone, even the clients just getting images from i.creativecommons.org!

We knew the above from experience, but we didn’t really have data on which web sites (“vhosts”) were using up the most server resources. At first I thought I’d turn to Apache’s powerful logging system. I wanted to know how much server time was elapsing between the start of a request and when the server was ready to respond to it. Alas, mod_log_config only lets you log the total time elapsed from start of request to end of request, which means that large files or users on slow links can distort the picture. A large image file going slowly to a modem user doesn’t actually preclude us from sending another few hundred such files at the same time. On the other hand, if a wiki page is taking 5 seconds before it is ready to be sent to the user, those 5 seconds will be even longer if someone else requests another wiki page. Measuring the time from start of request to the beginning of the response seemed a good (albeit rough) way to measure actual server-side resources needed to respond.

I did notice that mod_headers could send a message to the client telling him exactly what I wanted to record! So I patched Apache to have mod_headers pass its information down to mod_log_config, and then I could log how much time elapsed from request start to

Every once in a while, but as recently as this past weekend (at Super Happy Dev House, even!), I hear people say, “Open source is great, but it’s not as if anyone actually uses the source to fix problems they encounter.” In this case, software freedom was more than theoretical; it meant the ability to make the software tell us things about itself that helps us better serve the community we exist to serve.

You all can see (and distribute under the same terms as Apache) the patch and some quick reporting scripts I slapped together. Feel free to email me with any questions!

Comments Off


previous pagenext page