I’ve been considering license integration into a personal project of mine and thoughts of that have spilled over into work. And so we’ve been talking at Creative Commons recently about the current methods for licensing content managed by applications and what the future might be. The purpose of this post is to document the present state of licensing options. (A post on the future of licensing tools may come shortly afterward.)
Present, CC supported tools
To begin with, there are these three CC-hosted options:
- CC licensing web API — A mostly-RESTful interface for accessing CC licensing information. Some language-specific abstraction layers are provided. Supported and kept up to date. Lacking a JSON layer, which people seem to want. Making a request for every licensing action in your application may be a bit heavy.
- Partner interface — Oldest thing we support, part of the license engine. Typical case is that you get a popup and when the popup closes the posting webpage can access the info that’s chosen. Still gets you your chooser based interface but on your own site. Internet Archive uses it, among others.
All of these have the problem that the chooser of CC licenses is only useful if you want exactly the choices we offer (and specifically the most current version of the licenses we provide). You need to track those changes in the database anyway, which means you either are not keeping track of version used or you are and when we change you might be in for a surprise.
Going it alone
So instead there are these other routes that sites take:
- Don’t use any tools and store license choices locally — What Flickr and every other major option does: reproduce everything yourself. In the case of Flickr, the six core licenses at version 2.0. In YouTube, just one license (CC BY 3.0). That works when you have one service, when you know what you want, and what you want your users to use. It doesn’t work well when you want people to install a local copy and you don’t know what they want to use.
- Let any license you want as long as it fits site policy — and you don’t facilitate it, and it gets kind of outside the workflow of the main CMS you’re using… wiki sites are an example of this, but usually have a mechanism for adding a license to footer of media uploaded. The licenses are handled by wiki templates, anyone can make a template for any license they choose.
None of those are really useful for software you expect other people to install where you want to provide some assistance to either administrators of the software who are installing it to be used or where you want the administrator to give the user some choice or choices relevant to that particular site.
The liblicense experiment
This brings us to another solution that CC has persued:
- liblicense — Packages all licenses we provide, give an api for users to get info and metadata about them. Allows for web-request-free access to the cc licenses. It doesn’t address non-CC licenses, however, and is mostly unmaintained.
So, these are the present options that application developers have at their disposal for doing licensing of application-managed content. There’s a tradeoff with each one of them though: either you have to rely on web requests to CC for each licensing decision you make, you go it alone, or you use something unmaintained which is CC-licensing-specific anyway. Nonetheless, cc.api and the partner interface are supported if you want something from CC, and people do tend to make by with doing things offline. But none of the tools we have are so flexible, so what can software like MediaGoblin or an extension for WordPress or etc do?
There’s one more option, one that too my knowledge hasn’t really been explored, and would be extremely flexible but also well structured.
The semantic web / linked data option?
It goes like this: let either users or admins specify licenses by their URL. Assuming that page self-describes itself via some metadata (be it RDFa, providing a rel=”alternate” RDF page in your headers, or microdata), information about that license could be extracted directly from the URL and stored in the database. (This information could of course then be cached / recorded in the database.) This provides a flexible way of adding new licenses, is language-agnostic, and allows for a canonical set of information about said licenses. Libraries could be written to make the exctraction of said information easier, could even cache metadata for common licenses (and for common licenses which don’t provide any metadata at their canonical URLs…).
I’m hoping that in the near future I’ll have a post up here demonstrating how this could work with a prototypical tool and use case.
Thanks to Mike Linksvayer, for most of this post was just transforming a braindump of his into a readable blogpost.2 Comments »
Have you ever tried to implement CC licensing into a publishing platform? Would it have been helpful to know how other platforms have done it?
I’ve just added a collection of technical case studies on the CC wiki looking at how some major adopters have implemented CC. The studies look at how some major platforms have implemented CC license choosers, the license chooser partner interface, CC license marks, search by license, and license metadata.
Several of the case studies are missing some more general information about the platform, so feel free to add your own content to the pages. Also, everyone is welcome to add their own case studies to the CC wiki.
Here is a list of new technical case studies:Comments Off
As announced on the CC blog earlier this month, we’re working on a new tool to complement CC0, the Public Domain Mark (PDM). We have a set of open issues for PDM (and related improvements) that we’re working through, including developing the marking metadata that will be generated (Issue 640). I’ve put together a set of examples in the wiki that we’re looking for feedback on.
There are a couple of minor variations from past practice to note:
- We support labeling both the creator of the work (
dct:creator), as well as the person who identified it as being in the public domain and made it available. We chose
dct:publisherfor the latter, as it’s defined as “An entity responsible for making the resource available.”
- We will support the scenario where there are two license statements: PD Mark and CC0. This is generated in the case that the labeler chooses to waive rights they may have incurred in the work, as part of restoration, digitization, etc. After some discussion, we’re simply using two license assertions for this. There are arguably two subjects at work here — the actual work and the digital representation of it — but in the interest of simplicity and consistency with our past recommendations, we’re treating them as one.
- We’re planning to support non-binding usage guidelines on the deed at launch time, but not through the chooser. The example metadata is what the deeds would consume to display that link. I didn’t find a good existing predicate for this, so I propose we use
cc:usageGuidelinesand define it as a refinement of
† In the interest of completeness, I’m also planning to define
cc:morePermissions, used for CC+, as a refinement of
dc:relation. Comments Off
Day 1 and Day 2 of the DiscoverEd code sprint turned out to be very productive, and the third and final day didn’t disappoint either. All teams were able to complete or make significant contributions to useful new features.
Our team that originally developed support for arbitrary metadata and a plugin architecture to pull in external data in Day 1 and 2 switched to working on integrating branches with the main DiscoverEd source tree. They spent some time fixing bugs in new code that connects the RDFa parser with the Jena store, so that it could be merged with other branches in the code repository, including their work on the metadata plugins, the master branch, and the provenance work previously begun by Creative Commons. After merging those branches, they also took some time to update their work from Day 1 and Day 2 to store the provenance of the metadata. Towards the end of the day they began working on support for running all the DiscoverEd tests with a script. After today they plan on doing some housekeeping work, namely creating documentation about new plugin support in the CC wiki.
The “flexible query interface” team continued to debug problems with DiscoverEd. They had identified the bugs encountered in Days 1 and 2, so they worked on transitioning away from an old metadata-writing API that they suspect doesn’t work with the current version of Nutch. It appears this issue may be related to the upgrade from Nutch 0.8 to 1.1. Going forward, they plan on migrating code to the new API, and testing if it resolves the problems.
The “user generated metadata” team decided to divide and conquer smaller tasks of their project. They mocked up the form interface and began work on building it into a JSP. They were also building tests for the back end of DiscoverEd, which basically test that once you add a tag, you can get the resource ID and the tag back through a search, and that it’s been added into the Jena store such that when the Nutch index is run again the tag will be added to the Lucene index. They worked through a lot of merge conflicts, which had stalled development.
Creative Commons thanks AgShare project funders (The Gates Foundation), MSU, vuDAT, MSU Global, and to the participants in the sprint for making all of the contributions to DiscoverEd over the past three days possible.
What makes DiscoverEd exciting is that, while most search engines use algorithmic analyses of resources alone for search, DiscoverEd can incorporate facts and semantic information provided by the publishers or curators, enabling more useful search. Structured data in standardized formats such as RDFa is a powerful way for otherwise unrelated projects and resource curators to cooperate and express facts about their resources in the same way so that third-party tools (like DiscoverEd) can use that data for other purposes (like search and discovery). We look forward to deploying this innovative tool for our AgShare partners to enable search and discovery of educational resources about agriculture and hope that it’s found useful in other contexts as well.Comments Off
The team working on a plugin architecture for DiscoverEd finished their project, which was to build an extension point that enables plugins to pull arbitrary metadata from external sources. At the beginning of the day they had to finish some tests for the new support for arbitrary metadata added in Day 1, but by the end of the day were able to merge their new functionality back into the DiscoverEd repository. They were even able to build a small proof-of-concept plugin that pulled data from the delicious.com API into the Jena store, which later makes its way into the Lucene index. Tomorrow they’ll start looking at developing support for custom vocabularies and ontologies (such as AGROVOC).
The flexible query syntax team were able to commit their changes to the DiscoverEd code that makes it so you can configure arbitrary metadata queries. For example, if you had been running a DiscoverEd instance to search a feed of OpenCourseWare resources that uses Dublin Core metadata in RDFa format to express information about those resources, and wanted to enable searches on metadata which weren’t supported in DiscoverEd, you would have had to write Java code to enable that. With this change you can now just create a configuration file mapping an arbitrary tag to that metadata, which will store that metadata along side the tag in the Lucene index. This allows Nutch (which searches the Lucene index) to search for those custom tags in the config file. This code was pushed to a branch while the team spent the rest of the day chasing down a bug where only a subset of metadata gets successfully added to the Lucene index in their working environment.
The “user-generated metadata” team was able to create a test which adds a new resource into the Jena store, adds a tag to that resource, and verifies the tag was added… and it passes! Their next step is to create a test for a similar process in the Lucene index, such that a tag added to a new resource in the Jena store gets successfully added to the index and one can search for that tag through Nutch. The first test passing was a prerequisite for starting work on the second test.
Work continues into the third and final day, so look for a wrap-up report from Day 3 soon. The various repository commits can be browsed on Gitorious.1 Comment »
For those of you not familiar with DiscoverEd, it first began as a Creative Commons project to show how structured data can enable and improve search and discovery of educational resources on the web.
Search and discovery of education resources has been a hot topic within the OER (Open Educational Resources) community for some time. Given the decentralization of high-quality educational resources free to use, remix, and redistribute existing on the web, the question has been “How can learners and instructors find and discover these resources in useful ways?” CC has previously done significant work in the metadata domain, having developed a W3C-published vocabulary for publishing licensing metadata and being involved in early efforts to being semantic web concepts to scientific data. DiscoverEd not only represents a continuation of CC’s efforts to make data interoperability work, but insofar as it deals in metadata about open content, it’s an interesting synthesis of open, standardized data formats and open content.
Development for DiscoverEd has recently been supported by AgShare, an MSU-led project funded by the Gates Foundation:
The aim of the AgShare planning and pilot project is to create a scalable and sustainable collaboration of existing organizations for African publishing, localizing, and sharing of teaching and learning materials that fill critical resource gaps in African MSc agriculture curriculum and that can be modified for other downstream uses.
AgShare involves institutional collaboration for the modification and dissemination of very useful open educational content internationally. Creative Commons is supporting the project by developing, documenting, and supporting an instance of DiscoverEd educational search that will provide learners in Africa with a method to find or discover educational resources curated by participating organizations.
Which takes me to where we are today: CC and our AgShare partners at MSU put together a DiscoverEd code sprint to connect our AgShare developers with the community of programmers and downstream users who are interested in educational resource search and discovery using structured data to find common ground and areas for collaboration. Most participants have listed themselves on the sprint wiki-page.
After brief introductions, everyone spent some time instantiating the codebase, getting access to version control, sharing commonly used terms, and settling other coordination issues. That took up the first half of the day, and after lunch at a great Ethiopian restaurant, we returned to begin work on new code. Given the unfamiliarity many of the developers had with the codebase, the original plan for pair programming was abandoned to allow larger groups:
One group focused on developing code to allow users to add missing or additional metadata about resources. This would enable arbitrary tags to be injected into both the Jena triple-store and Lucene documents from the Nuch front-end. This functionality would extend the reach of Nutch’s search beyond what’s found through aggregation, to include metadata provided by users, or a community.
Another group focused on developing a plugin interface for storing and retrieving metadata from external and services and databases. This functionality could potentially supply metadata for resources without any metadata, or could supplement existing metadata with additional data from other services and databases. A splinter group of that original team took time to analyze the OpenCalais API to see if it could be integrated with the above DiscoverEd functionality to provide back useful metadata.
The last team worked on getting up to speed on the DiscoverEd architecture and creating instructions for getting it running in an Ubuntu virtual box environment on Windows, as well as test-driving new documentation for hacking DiscoverEd. This team also spent time mapping out the work needed to implement more flexible query filtering. This turned out to be an invaluable exercise as they discovered inefficiencies and unnecessary code in the DiscoverEd codebase (namely, in how DiscoverEd maps query strings for metadata to Lucene index data), and are in the process of planning for a new metadata query syntax.
Day 2 will likely be an extension of today’s work. Look for another post soon.1 Comment »
This week, development on the plugin proceeded at a faster pace. Shortly after I posted the last report, Nathan Kinkade pointed out the fix to the bug that prevented saving, a simple type error. On the next day, I implemented stylesheet support, hereby adapting three styles I originally made for my defunct microdata plugin, and an admin interface to switch between them (screenshot). Additional more or less notable changes are:
- metadata is not only saved now, but will also be retrieved to populate form fields
- multiple RDFa fixes, machine-readable data should be correct now
- the plugin has a directory structure, earlier versions were just a single file
- there is now a sample file for stylesheet development
- metadata is also saved when the media item is inserted into the post
- the plugin now uses the Creative Commons API to get the current license version
I consider this version of the plugin not finished, but functional enough for testers, who are encouraged to check out the Git repository. For the coming week, I will look into the issue surrounding alternate content and plugin directories and proceed to polish the existing features.Comments Off
Google Summer of Code 2008 approaches its end, as less than forty-eight hours are left to submit the code that will then be evaluated by mentors, therefore it is fitting to pause for a moment and sum up the work that has been done with regard to the license-oriented metadata validator and viewer and to confront it with the original proposal for the project.
A Web application capable of parsing and displaying license information embedded in both well-formed and ill-formed Web pages has been developed. It supports the following means of embedding license information: Dublin Core metadata, RDFa, RDF/XML linked externally or embedded (utilising the data URL scheme) using the link and a elements, and RDF/XML embedded in a comment or as an element (the last two being deprecated). This functionality has been proven by unit testing. The source code of a Web page can be uploaded or pasted by a user, there is also a possibility to provide a URI for the Web application to analyse it. The software has been written in Python and uses the Pylons Web Framework and the Genshi toolkit. Should you be willing to test this Lynx-friendly application, please visit its Web site.
The Web application itself uses a library called “libvalidator”, which in turn is powered by cc.license (a library developed by Creative Commons that returns information about a given license), pyRdfa (a distiller that generates the RDF triples from an (X)HTML+RDFa file), html5lib (an HTML parser/tokenizer), and RDFLib (a library for working with RDF). The choice of this set of tools has not been obvious and the library had undergone several redesigns, which included removing the code that employed encutils, XML canonicalization, µTidylib, and the BeautifulSoup. The idea of using librdf, librdfa, rdfadict has been abandoned. The source code of both the Web application (licensed under the GNU Affero General Public License version 3 or newer) and its core library (licensed under the GNU Lesser General Public License version 3 or newer) is available through the Git repositories of Creative Commons.
In contrast to the contents of the original proposal, the following goals have not been met: traversal of special links, syndication feeds parsing, statistics, and cloning the layout of the Creative Commons Web site. However, these were never mandatory requirements for the Web application. It is also worth noting that the software has been written from scratch, although a now-defunct metadata validator existed. Nevertheless, the development does not end with Google Summer of Code — these and several new features (such as validation of multimedia files via liblicense and support for different language versions) are planned to be added, albeit at a slower pace.
After the test period, the validator will be available under http://validator.creativecommons.org/.1 Comment »
The source code is located in two dedicated git repositories. The first being validator, which contains the source code of the Web application based on Pylons and Genshi. The second repository is libvalidator, which hosts the files that constitute the core library that the project will utilise. This is the component that the development focuses on right now.
The purpose of the aforementioned library is to parse input files, scan them for relevant license information, and output the results in a machine-readable fashion. More precisely, its workflow is the following: parse the file and associated RDF information so that a complete set of RDF data is available, filter the results with regard to license information (not only related to the document itself, but also to other objects described within it), and return the results in a manner preferable for the usage by the Web application.
pyRdfa seems to be the best tool for the parsing stage so far. It handles the current recommendation for embedding license metadata (namely RDFa) as well as other non-deprecated methods: linking to an external or embedded (using the “data” URL scheme) RDF files and utilising the Dublin Core. The significant lacking is handling of the invalid direct embedding of RDF/XML within the HTML/XHTML source code (as an element or in a comment) and this is resolved by first capturing all such instances using a regular expression and then parsing the data just as external RDF/XML files.
Once the RDF triples are extracted, one can use SPARQL to narrow the results just to the triples related to the licensed objects. Both librdf and rdflib support this language. Moreover, the RDF/XML related to the license must be parsed, so that its conditions (permissions, requirements, and restrictions) are then presented to the user.
The library takes advantage of standard Python tools such as Buildout and nose. When it is completed, the project will be all about writing a Web application that will serve as an interface to libvalidator.Comments Off