CC + GSoC: Integration is the Word

nathan, March 21st, 2011

Creative Commons is once again participating in the Google Summer of Code, an opportunity for students to spend the summer writing open source software. Students have the benefit of being mentored by established open source developers (we think ours are pretty good), and organizations develop their network of contributors. We’ve participated since 2006, with mixed success. Some projects go nowhere, while others, like last summer’s work on the OpenOffice.org plugin, have far exceeded our expectations.

We have a lot of ideas, and this year we’re looking for students interested in working on integrating Creative Commons into the larger ecosystem. Creative Commons licenses have revolutionized sharing on the web by reducing the friction between creators and users of work. We want to further reduce that friction by integrating CC license selection, display, and content discovery into other applications and tools. In the past we’ve had students develop extensions and add-ons for OpenOffice.org/LibreOffice, Drupal, WordPress, and Banshee. There’s some good code there, and we love the work that’s been done. We want this year’s prospective students to think about what applications they use that could integrate CC license selection or — at least as importantly — content discovery. Use our existing code as a starting point for ideas, and craft your proposal to tell us how your project will integrate CC licensed content into users’ daily life.

Not a student? We’ve started a hit list of applications on the wiki, and if we’re missing your favorite, add it along with some thoughts about how you’d like to use CC.

I’m looking forward to a great GSoC this year, and can’t wait to see the great ideas that come from students and the community!

1 Comment »

Caching Mediawiki with Varnish

nkinkade, March 18th, 2011

We run a few instances of Mediawiki, most notably the CC wiki. The machine that runs the CC wiki is not a powerhouse, but should certainly have enough to handle the amount of traffic the CC wiki receives. However, for quite some time I’ve noted that the CPU usage on the machine is fairly high, and from time to time the system will bog down and become nearly (sometimes totally) unresponsive. It has been somewhat hard to pinpoint the exact cause of the intermittent issues, first, because they are so intermittent, and secondly because there has never been a trace of evidence in the logs as to what might have happened.

Since the CC wiki is the main service that runs on that machine, I decided to start there. We run Varnish on all our servers, so I took a look at some Varnish stats using– ta-davarnishstat. It turned out that Varnish was mostly useless on that machine, with a hit-rate ratio of maybe 1 or 2 percent, sometimes approaching 0. This makes sense, since by default Varnish doesn’t cache requests that arrive with cookies, and at the very least Google Analytics cookies will arrive with virtually every request to a CC site. Varnish shouldn’t have to care about Analytics cookies, but it definitely needs to care about any login or session-related cookies from Mediawiki.

Doing a bit of searching I found that even Mediawiki.org has a page about using Varnish to cache. However, their configuration doesn’t take into account extraneous cookies like those from Google Analytics. I also eventually stumbled across a VCL file that apparently Wikia.com used to use, which takes into account the possibility of other cookies being present other than Mediawiki cookies. Between the examples I found online, and some thorough testing of cookies in Mediawiki, I found a configuration that I feel will allow many CC wiki requests to be cached, that otherwise wouldn’t have been without risking caching any per-user, session or logged in pages. Relevant snippets:

sub vcl_recv {
[...]
        if ( req.http.host ~ "wiki(-staging)?.creativecommons.org" ) {
                # If this is just an anonymous request with no session-related
                # cookies, then cache the page. Unsetting the cookie will allow
                # us to do this.
                if ( ! req.http.Cookie ~ "(session|UserID|UserName|LoggedOut)" ) {
                        remove req.http.Cookie;
                        return(lookup);
                }
        }
}

sub vcl_fetch {

        if ( req.http.host ~ "wiki(-staging)?.creativecommons.org" ) {
                if ( ! beresp.http.Set-Cookie ) {
                        set beresp.ttl = 120s;
                        return (deliver);
                }
        }

}

Our hitrage ratio is still not exceedingly high, hovering between perhaps 30% and 50%. However, what has really gone down significantly is the CPU usage and load average of this machine. For the two weeks prior to making these changes the overall average CPU usage was 68.09%. For the two weeks after the change it went down to 41.92%, nearly a 40% drop. Load average went down as well, but not as dramatically, because it was never consistently high to begin with. However, you can see a marked decline if you look at the Cacti stats for that machine, setting the dates appropriately, the change having happened on March 3, 2011.

Comments Off

Supporting tools for decentralized metadata

nathan, March 16th, 2011

Over the past couple years Creative Commons has built DiscoverEd, a prototype search and discovery tool. We built DiscoverEd to explore how search for open educational resources (OER) could be improved through the use of decentralized metadata. But DiscoverEd was never an end point. DiscoverEd is one of what we hope will be many applications developed to leverage decentralized, structured data about resources on the web. (Our license deeds are another application that use metadata published with works, in that case to provide attribution for re-users.) Recently we’ve been thinking about tools that could be developed to complement DiscoverEd to create a rich and compelling ecosystem for decentralized metadata for educational resources.

The use of decentralized metadata to drive discovery allows creators and curators to publish information about works without relying on a central authority, and allows developers to utilize that data with seeking permission from a gate keeper. However, self publishing requires a certain degree of technical expertise from creators and curators. Two tools can help ease this burden and aid deployment of the necessary metadata. A Validator would help publishers and curators understand how their resources are ingested and processed by DiscoverEd (and other tools). A Curation Tool would allow users to identify resources — individually, as an ad hoc group, or as part of an institutional team — and label them with quality, review, or other metadata.

The Validator tool would allow users to enter a URL to be checked, and return details of what information DiscoverEd or other software could extract. The results would also provide links to examples and common problems when publishing metadata. For example, how to publish information about the education level and subject matter of a resource, or about what resources were remixed in order to create the new one. A self service tool would allow users to repeatedly check the state of their resources, so they can understand how changes made to their site impact the way others interact with it. A self service tool is essential to scale adoption beyond the level possible when each publisher requires hands on assistance.

The Validation tool would also be integrated with DiscoverEd. DiscoverEd utilizes decentralized metadata to improve its search index, and allow users to search by particular facets, such as subject, education level, or language. When it does not have metadata for one of the “core” fields (education level, subject, license, language in the default configuration), it displays a help icon to indicate that some piece of information is missing. After initial development is complete, the help icons will be linked to the validation tool so that users and publishers alike can get immediate feedback about what’s missing and what’s there.

The Curation Tool would be a general purpose piece of software which would allow users to identify works, and annotate additional information about them. We imagine that common annotations might be that they meet some quality review, align to a particular standard, or simply “like”. Just as social bookmarking tools like Delicious allow users to make a list of resources, the Curation Tool would allow users to create lists, identifying why a particular resource is in the list, and possibly adding additional metadata not provided by the publisher. For example, a user might make a list of resources which they have reviewed for quality, and identify which Common Core standard each conforms to. The tool would allow users to collaborate on lists, as well. All lists would be public, and published in a way that allows DiscoverEd to ingest the information collected. The Curation tool would be open source software, so users can download a copy and run it for their own school or professional society, if they so desire.

We think that the development of supporting tools can help advance the adoption of decentralized, structured data for educational resources. Are there simple ideas we’ve missed? Twists on these we should take into account? Leave your comments below.

Comments Off

Your own python egg baskets / package repositories

cwebber, February 14th, 2011

At Creative Commons sometimes we have python packages that we use a lot but which aren’t generally useful enough to put on PyPi (such as cc.license and cc.licenserdf). Luckily, as long as you and your organization don’t have a problem with having your packages available publicly, it’s fairly easy to put up a public egg basket (that is, a simple repository to put your python eggs/packages). And doing so has several advantages:

  • You can avoid cluttering up PyPi with packages that have a very marginal or internal audience. cc.licenserdf matters a lot for our site, but probably is worthless to anyone searching for “RDF” tools on pypi.
  • Putting a package up on PyPi suggests a certain amount of responsibility to keep that updated and keep its API stable. If it’s for internal use, maybe that responsibility is unwanted or unneeded.
  • Using proper python packaging means you can take full advantage of python’s ecosystem of installation tools, including useful dependency resolution.
  • Making packaged releases encourages a certain amount of responsibility toward your internal dependencies, reducing the “cowboy coding” factor (fun, but often not good in an environment that requires stability).

So, how to do it? The first step is to make a directory that’s statically served by Apache. Ours is at http://code.creativecommons.org/basket/. The directory has the sticky bit set so that everyone on the server can write to it without clobbering each other, kind of like in /tmp/ on most GNU/Linux installs. Everyone puts their tarballs up here. As for you or your organization’s “your own” packages, I think it’s fine to put all your eggs in just one basket like this. :)

But what should those packages look like?

Generally at the toplevel of your package you have something that looks like:

from setuptools import setup, find_packages
import sys, os

setup(
    name='licenseclarity',
    version='0.1',
    packages=['licenseclarity'],
    description="Tools to clarify any licensing issue you have ever had",

    author='John Doe(ig)',
    author_email='johndoe-ig@example.org',
    license='MIT',

    # ... etc ...

    # Put your dependencies here...
    install_requires=[
        'setuptools',
        'dependency1',
        'dependency2',
        ],

    # And a link to your basket here
    dependency_links=[
        'http://code.creativecommons.org/basket/',
        ],
    )

Obviously, replace values with ones that make sense for your package. The main things we’re talking about here are install_requires and dependency_links. Replace install_requires with whatever list of dependencies your package requires, and put your your own egg basket in the dependency_links list. After that, I guess set up whatever other large number of attributes you can apply to the setup function that make sense to your package (entry_points maybe? zip_safe=False is a nice one).

Lastly, maybe now that you have an egg basket, you’d like to know how to make some eggs to put in that basket. Pretty easy once you have setup.py! All you really need to do is this:

python setup.py sdist

Assuming things run well, do an `ls` into your dist/ directory. Hey, what do you know! There’s a tarball there. cp or scp or (heaven forbid) ftp it to your egg basket, and there you go. You just made a release! Maybe the next time you do it you’ll want to increment the version number.

Appendix: Why you should always list your dependencies in your package

Lastly I’d like to make a comment on only using buildout’s config file or pip’s requirements file to declare requirements: it’s completely crazy, don’t do it! Well okay, it’s not completely crazy, but you really should fill out the install_requires section in setup.py regardless of whether you are also using those tools. There are a couple of reasons for this:

  • Recursive dependencies: maybe someone will use your module as a library, and then they’ll have to declare your dependencies all over again if you don’t put them in install_requires. Let python’s package management tools handle the dependency graph logic, that’s where it belongs.
  • What happens if you build your tool in buildout and someone wants to use virtualenv/pip or vice versa? Both of these tools will check the install_requires section anyway, so be nice and fill it out!

Of course there are some awesome things that buildout and the pip requirements file can do to; for example they can install dependencies from VCS and etc which is useful in particular cases, particularly for certain in-development packages. By all means, use these tools to do that (and lots of other cool things, because both buildout and virtualenv/pip are pretty awesome). Just be a good packager and also fill out install_requires.

Now you have your own egg basket (which is completely vegan), you know how to make packages to put in it, and everyone is happy. Horray! Today we learned things (maybe).

Thanks to Asheesh Laroia for his detailed feedback on this post.

1 Comment »

Upgrade to Debian Squeeze and Mediawiki woes

nkinkade, February 10th, 2011

Just a number of days ago Debian released Squeeze as the new stable version. I decided to test the upgrade one or two of CC’s servers to see how it would go. The upgrade process was standard and went without problems, as one comes to expect with Debian. Any problems with the upgrade didn’t manifest until I noticed that one of our sites running on Mediawiki had apparently broken.

I narrowed the problem down to several extensions. Upgrading to Squeeze brought in a new version of PHP, taking it from 5.2.6 (in Lenny) to 5.3.3. PHP was emitting warnings in the Apache logs like:

Warning: Parameter 1 to somefunction() expected to be a reference, value given in /path/to/some/file.php on line ##

Looking at the PHP code in question didn’t immediately reveal the problem to me. I finally stumbled across PHP bug 50394. A specific comment on that bug revealed that the issues I was seeing were not a bug, necessarily, but the result of the way PHP 5.3.x handles a specific form of incorrect coding.

In summary, it turns out the problem is related to Mediawiki hooks and its use of the call_user_func_array() PHP built-in function. The function takes two arguments: a user function name, and an array of arguments. If the called function expects some of the arguments to be passed in by reference, then each element of the passed array must be explicitly marked as a reference. For example, this is correct:

function lol ( &$var1, $var2 ) { //do something };
$a = 'foo';
$b = 'bar';
$args = array( &$a, $b )
call_user_func_array('lol', $args);

However, you will get a PHP warning, and a subsequent failure of call_user_func_array(), if $args is defined like (missing the & before $a):

$args = array( $a, $b );

Interestingly, the “correct” way of handling this case, where the callback function expects referenced variables, also happens to be deprecated, as a form of call-time referencing, and the call_user_func_array() documentation states this:

Referenced variables in param_arr are passed to the function by reference, regardless of whether the function expects the respective parameter to be passed by reference. This form of call-time pass by reference does not emit a deprecation notice, but it is nonetheless deprecated, and will most likely be removed in the next version of PHP.

As far as I can tell, this deprecated method is the only way to handle this, yet PHP may drop this functionality. Presumably another method will replace it before that happens, but the ambiguity at the moment leaves one wondering how to properly code for this without risking that the code will break in a future release of PHP. I suppose the only sure way is to make sure that your call-back doesn’t require or need any referenced variables. I’d be happy for someone to point me to the right way to handle this, if for some reason my research just failed to produce the correct method.

I found this breakage in the following modules, but presumably it exists in many more:

ReCAPTCHA
RecentActivityNotify
SpamBlacklist

The fix for the ReCAPTCHA extension was easy, since it’s published on the extension’s page. For the other extensions, I investigated the places where this problem was occurring and removed the references from the function definitions, but not before poking around a bit to make reasonably sure that the references weren’t fully necessary.

Lesson: use caution when doing any upgrade that moves you from PHP <5.2.x to >5.3.x. Google searches reveal that this issue is rife in not only Mediawiki, but also Joomla!, and presumably any other CMS or framework that makes use of call_user_func_array().

2 Comments »

Follow your nose for translations

nathan, February 7th, 2011

One of our goals is to continue to make the licenses more useful as self-describing resources. They’ve described the licenses themselves (using CC REL) for quite a while. Last year we started marking up the license name, so software could dereference the license and show the human readable name to users. Last month we added support for the identifiers (short names), as well. While working with OpenAttribute, I realized that one thing we weren’t doing well was scoping our assertions. In RDFa, the default scope of an assertion is the URI of the current page. That means if you follow a link to a specific translation of a license (such as the French translation of CC BY 3.0 Unported), the RDFa was actually describing that license.

It’s a subtle but important point: the canonical license URI is the “bare” URI, without any translation component; for example, http://creativecommons.org/licenses/by/3.0/, and not http://creativecommons.org/licenses/by/3.0/deed.fr. At the same time I realized that there while the license translations link to one another, that relationship is not described. To improve this situation, we’ve made three changes to the license deeds (all in the RDFa, not visible to humans browsing the pages):

The choice of vocabulary to describe the translation wasn’t obvious; an inquiry on the semantic-web mailing list revealed no clear winners, so we wound up choosing one that seemed to best fit the semantics of the license summaries (to be clear, these assertions only apply to the summary of the license — the “deed” — and not the actual text of the license). It’s possible we’ll revise this in the future, but one of the great things about RDFa is that we don’t have to choose one; if we find one that works better, we can easily publish assertions using both vocabularies, easing the transition for any tools using the RDFa.

Comments Off

February 2011 Tech Update

nathan, February 4th, 2011

Highlights from January, 2011:

Comments Off

HTTPS now available on creativecommons.org

nkinkade, February 4th, 2011

A week or two ago we received a request from someone to add HTTPS support to creativecommons.org. A reasonable request at all events, but the specific impetus for the email was someone who develops an application that uses CC’s Partner Interface to integrate CC licensing. Their interface is SSL-enabled, and so accessing the Partner Interface over a normal, non-encrypted channel was giving his users browser warnings about accessing non-encrypted data. I’m happy to announce that creativecommons.org is now available over an encrypted connection. Thanks to Peter Dietz for prodding us to implement this.

Comments Off

Merging old and new CC REL schemas

nathan, February 4th, 2011

When the RDF schema that would become CC REL was first developed, it was published at http://web.resource.org/cc. A few years ago, as we were codifying some new features of CC REL (specifically how to specify attribution information), we had one of our first of several realizations about authority: Creative Commons — creativecommons.org — is the canonical source of information about CC licenses, and as such is the place we should be publishing information about how to use them. We started publishing the schema at http://creativecommons.org/ns, and using that as the namespace for RDFa we generated.

Late last year the question of “which namespace is authoritative” arose, and I realized we’d missed an important step: no one using the web.resource.org address would be aware of the new namespace if they weren’t looking for it. As of late last week, that’s been corrected. Aaron graciously started redirecting http://web.resource.org/cc to http://creativecommons.org/ns, which was the first step. We’ve also added equivalence assertions between the two, so that an agent looking at the schema will see that the old and new properties have the same semantic meaning (for example, a License in the old schema, http://web.resource.org/cc/License, is declared to be the equivalent class of License in the new schema, http://creativecommons.org/ns#License).

These updates are now live in the CC REL schema (which is incidentally primarily described as RDFa, with an RDF-XML version extracted using an automated tool).

Comments Off

More helpful 404 pages

cwebber, January 28th, 2011

This is one of those little features that tends to go into the license engine that runs on the creativecommons.org website which are helpful and small, but not too noticeable if not pointed out. I usually do a pretty bad job of making note of these when they go out, but this time, I’m doing better!

Even most people who don’t know anything about HTTP know that a 404 status code on the web somehow means that the thing you were looking for isn’t actually there. How frustrating! But if it’s not there, maybe we have enough information to help you find what you actually wanted.

That’s the idea between the work that went into Issue 255: “Smart” 404 pages. Maybe we didn’t find a license (or public domain tool) under the URL you put in, but we might be able to help you find a license that does exist. For example, licenses listed under /licenses/ on creativecommons.org are parsed out like /licenses/{code}/{version}/ or /licenses/{code}/{version}/{jurisdiction}/. Knowing that, we can give a list of licenses for what licenses someone might have meant when they:

The pages mostly look like a normal creativecommons.org 404 page, but with just a bit more contextually helpful information (the “were you looking for” section). And, of course, they still return a 404 status code!

Comments Off


previous pagenext page