Available, yet to be analyzed: some data from which something about CC adoption patterns in some languages and countries might be gleaned

ml, December 9th, 2010

Recently asked about CC use for works in Arabic. There are no Arabic-language jurisdiction ports available at this time; measuring adoption of jurisdiction ports is the easiest way to characterize patterns of adoption, as has been done by Giorgos Cheliotis, with interesting results — also see http://monitor.creativecommons.org.

Fortunately there is some other data available that could be used to characterize CC use, irrespective of porting, across some languages and countries. This data is described at http://wiki.creativecommons.org/Metrics/Data_Catalog#License_property_search_engine_API_queries and available from http://labs.creativecommons.org/metrics/sql-dumps/all.sql.gz
(warning, approximately 200 megabytes).

Nobody has analyzed this part of the available data on CC adoption yet to my knowledge. Anyone is of course welcome and encouraged to!


LibreOffice and CC OpenOffice Plugin: Good to go

cwebber, December 8th, 2010

Over the summer, our GSOC student Akila Wajirasena gave a wonderful overhaul to our OpenOffice plugin with many improvements, including a slick new user interface, support for public domain tools, and many other cool things. Recently he contacted us having looked into whether or not that plugin works with LibreOffice, a fork of OpenOffice run by the Document Foundation, which has gathered significant interest from certain areas of the free and open source software world recently.

The good news? It seems that it works just fine out of the box:

CC LibreOffice plugin test

We’ve only tested it on some limited systems though, so if you do have some problems, please report them here.

We also included some instructions previously on how to fix an issue on GNU/Linux systems where the menu would become inaccessible, and you had to install via the command line to fix it. That seems like it may be fixed in LibreOffice.

One concern though is that LibreOffice is trying to use less amounts of Java or completely abandon Java altogether. Probably Java extensions will be supported some way in the future regardless, but it may be something to keep our eye on.

Thanks to Akila again for his work not only on the plugin enhancements but also into looking into this with LibreOffice.

1 Comment »

Tuning TCP on CC’s servers

nkinkade, December 8th, 2010

A couple weeks ago we launched a new rack-mount server, which is kindly hosted by the ISC in their Redwood City, California data center. The sole purpose of this new server is to host static content, mostly i.creativecommons.org, which is probably the busiest domain CC has due to the license icons and badges being served from there.

Upon moving i.creativecommons.org to this new machine I noticed that there were terrible problems with connection timeouts when requesting images. After thrashing around for why this was happening, I used tcpdump to grab some network traffic on the server and discovered that SYN requests were arriving at the machine and dying right there, with no subsequent SYN-ACK. At that point it was clear that this was not a Varnish or Apache problem, but something at a much lower level. After testing various TCP tweaks in the running kernel I discovered that setting net.ipv4.tcp_max_syn_backlog=2048, up from the default of 256, and turning on net.ipv4.tcp_syncookies seemed to resolve the connection timeout issues.

However, the kernel message log was filled with message like the following. In fact, there was one such message written to the log every minute:

possible SYN flooding on port 80. Sending cookies.

I was confused by this because ostensibly the site was functioning just fine. My understanding was that SYN cookies were only activated when the SYN queue filled up, but as far as I could tell I had increased the depth of the queue sufficiently to avoid that problem. I even tried setting net.ipv4.tcp_max_syn_backlog arbitrarily high to see what would happen. Same result: site operated fine, but with SYN cookie kernel messages. In my testing I also discovered that disabling net.ipv4.tcp_syncookies would immediately bring back the connection timeout problems. Additionally, netstat revealed that though the site appeared to be functioning correctly, there were still an abnormally large amount of ‘failed connection attempts’ listed in the TCP stats.

I went over and over all the TCP settings and just couldn’t figure out what was happening, nor did Google shed any light on this. I then decided to do this:

$ netstat -n | grep SYN_RECV | wc -l

I ran this command many times in a row over a period of time and was surprised to see that the result was nearly always 256, give or take a few. It then occurred to me that that number looked a lot like the default value of net.ipv4.tcp_max_syn_backlog. However, as far as I knew (and know), all of those kernel parameters are supposed to be dynamic, capable of being changed on-the-fly, with sysctl or writing directly to the /proc file system. So I set all my TCP changes in /etc/sysctl.conf and rebooted the machine. Sure enough, since coming back up about a day ago I haven’t seen a single kernel message about SYN cookies. I even decided to just disable SYN cookies altogether based on a recommendation to do so in the default /etc/sysctl.conf file found on Debian systems.

The machine is now humming along nicely. For reference here are the TCP parameters I changed. The values were gleaned from various sites while doing extensive research on TCP tuning. Some of the values seem improbable to me, but don’t seem to be having any perceptible negative impact, and were also recommended in TCP tuning articles on more than one site. I went ahead and implemented these settings on the rest of CC’s servers as well:

net.ipv4.tcp_fin_timeout = 3
net.core.netdev_max_backlog = 30000
net.ipv4.tcp_no_metrics_save = 1 
net.ipv4.tcp_synack_retries = 2
net.ipv4.tcp_syn_retries = 2
net.ipv4.tcp_max_syn_backlog = 8192
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.ipv4.tcp_rmem = 4096 87380 16777216 
net.ipv4.tcp_wmem = 4096 65536 16777216
net.core.somaxconn = 1024
vm.min_free_kbytes = 65536
1 Comment »

XMP FileInfo panel for Adobe Creative Suites 4 and 5 now available!

akozak, December 6th, 2010

This is a special guest post by John Bishop of John Bishop Images.

Prior to Adobe’s Creative Suite 4, adding Creative Commons license metadata via the FileInfo… dialog (found in Photoshop, Illustrator, InDesign and more) meant coding a relatively simple text based XML panel definition and has been available from the Creative Commons Wiki since 2007.

Starting with Creative Suite 4 Adobe migrated the XMP FileInfo panel to a Flash based application, meaning that adding Creative Commons metadata became much more complex, requiring Adobe’s XMP SDK and the ability to develop applications in Flash, C++ or Java.

After significant development and testing john bishop images is pleased to announce the availability of a custom Creative Commons XMP FileInfo Panel for Creative Suite 4 and Creative Suite 5 – free of charge.

This comprehensive package offers the ability to specify Creative Commons license metadata directly in first class, industry standard tools and places Creative Commons licensing metadata on the same footing as the standardized, commercial metadata sets like Dublin Core (DC), IPTC and usePLUS and tightly integrates all the metadata fields required for a Creative Commons license in one panel.

Also included is a metadata panel definition that exposes the Creative Commons license metadata in the mini metadata panels found in Bridge, Premiere Pro, etc. And finally a set of templates that can be customized for the various license types and more is also included; these templates can be accessed from Acrobat.

For more information and to download the Creative Commons XMP FileInfo panel visit john bishop images’ Creative Commons page.

Note: The panels are localized and a English-US language file is supplied. To contribute localization files in other languages please contact john bishop images.

1 Comment »

WebM encoding CC’s videos

akozak, December 3rd, 2010

I was recently tasked with encoding all of CC’s videos into WebM, a new open video container format.

I started out by creating a table of all of the videos and target quality. Then, I started to track down the highest quality version of each video I could find.

At first I had high hopes of using ffmpeg (with libvpx support) wrapped in a bash script to batch process all of the videos into .webm. Although there are some decent directions to re-compiling ffmpeg to enable webm, I ran into trouble (some sort of version conflict) and wasn’t able to complete the process.

So after searching around, I found a few promising leads on the WebM tools page.

Our friends at Miro have come out with Miro Video Converter, which is a useful-looking tool that promises a simple way to encode video for mobile devices and the web. It purports to support webm, but sadly I wasn’t able to test it as they don’t currently develop a version for Linux (only Windows and OSX).

Next I tried installed the Fireogg extension into Firefox and tested the process by converting one small .mov file into .webm. I was surprised how well it worked, although as far as I could tell it lacked a method to batch-process or queue files (or at least that functionality isn’t within scope for its purpose). It also took an unreasonably long time to transcode the video, so my search continued.

With Arista Transcoder, I found a Linux application that seemed to work. In Ubuntu, setup was as simple as extracting the archive and running python setup.py install --install-layout=deb (provided you have python installed). The Web Browser (Advanced) preset gave me the ability to queue up and transcode the videos into 360p, 480p, and 720p .webm with reasonable transcoding times.

One minor hurdle was to encode the videos into 240p WebM, a resolution not included in the Web Browser (Advanced) preset package. To accomplish that, I modified the json preset file to include 240p into webm. You can download my modified version of the json file in the preset here. My only modification comes at the end. You should be able to just copy my version into ~/.arista/presets and replace the original if it exists.

When we start to embed the files on creativecommons.org/videos, I plan on following the instructions in the Dive Into HTML5 Video chapter.

Comments Off

Type Of: Educational (an idea)

nathan, December 3rd, 2010

I spent most of yesterday in a meeting discussing ways to make search better for open educational resources. Preparing my short presentation for the day, I thought again about one of the challenges of doing this at web scale: how do you determine what’s an educational resource? In DiscoverEd we rely on curators to tell us that a resource is educational, but that requires us to start with lists of resources from curators; it’d be nice to start following links and add things if they’re educational, move on if they aren’t. If you want to build OER search that operates at web scale, this is one of the important questions, because it influences what gets into the index, and what’s excluded1. Note that the question is not “what is an open educational resource”; the “open” part is handled by marking the resource with a CC license. With reasonable search filters you can start with the pool of CC licensed educational resources, and further restrict it to Attribution or Attribution-ShareAlike licensed works if that’s what you need.

Creative Commons licenses work in a decentralized manner by including a bit of RDFa with the license badge generated by the chooser. But no similar badge exists for OER or educational resources, at least partly because it’s hard to agree on what the one definition of OER is. But what if we just tried to say, “I’m publishing this, and I think it’s educational.” Maybe we can do that. After seeing the Xpert Project tweet about a microformat/RDFa/etc to improve discoverability, I decided to try my hand at a first draft.

<span about="" typeof="ed:LearningResource" xmlns:ed="http://example.org/#">Educational</span>

This tag generates the triple:

<> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://example.org/#LearningResource> .

Literally, “this web page is a Learning Resource”. Of course this could be written as a link, image, or even made invisible to the user.

There is one big question here: what should http://example.org/# actually be? There are lots of efforts to create vocabularies out there, so we should clearly reuse a term from one of those efforts. If we reuse one that defines a hierarchy including things like Course, Lecture, etc, as refinements of Learning Resource, that may also provide interesting information for improving the search experience.

This markup won’t be visible in Google, but it will allow crawlers and software to start determining what people think is educational. And that seems like progress to me. Reasonable first step? Fundamentally flawed? Inexcusably lame? Feedback welcome.

1 I should note that the question is “How does someone online say that a resource is educational?” because you want a) to allow people to make the judgement about other resources online, and b) you care about who’s saying the resource is educational. Please pardon my reductionism.


Technical case studies now on the wiki

akozak, November 18th, 2010

Have you ever tried to implement CC licensing into a publishing platform? Would it have been helpful to know how other platforms have done it?

I’ve just added a collection of technical case studies on the CC wiki looking at how some major adopters have implemented CC. The studies look at how some major platforms have implemented CC license choosers, the license chooser partner interface, CC license marks, search by license, and license metadata.

Several of the case studies are missing some more general information about the platform, so feel free to add your own content to the pages. Also, everyone is welcome to add their own case studies to the CC wiki.

Here is a list of new technical case studies:

Comments Off

Orgmode and Roundup: Bridging public bugtrackers and local tasklists

cwebber, November 10th, 2010

So maybe you’re already familiar with the problem. You’re collaborating with other people, and especially if you’re in a free software environment (but maybe even some install at your work) you have some bugtracker, and that’s where everyone collaborates. But on the other hand, you have a life, your own todo systems, your own notes, etc. Even for the tasks that are on the bugtracker, you might keep your own local copy of that task and notes on that task. Eventually things start to get out of sync. Chaos!

Wouldn’t it be great if you could sync both worlds? Keep the notes that are relevant to being public on the public bugtracker, but keep private notes that would just clutter up the ticket/issue/bugreport private. Mesh the public task system with your private task system. Well, why not?

So this was the very problem I’d run into. I have my work bugtracker for here at CC, our install of roundup, and then I have my own TODO setup, a collection of Org-mode files.

There are a lot of things I like about org-mode. It’s in emacs (though there’s apparently a lean vim port in the works), it’s plaintext (which means I can sync across all my machines with git… which I do!), tasks are nested trees / outlines (I really tend to break down tasks in very granular fashions as I go so I don’t get lost), notes are integrated directly with tasks (I take a lot of notes), and it’s as simple as you need it or as complex as you want to get (I started out very simple, now my usage of org-mode is fairly intricate). It also does a good job of spanning across multiple files while still retraining the ability to pull everything together with its agenda, which is useful since I like to keep things semi-organized.

And of course, the relevant file here is all my Creative Commons stuff, which I keep in a file called ccommons.org. There’s a lot of private data in here, but I’ve uploaded a minimalist version of my ccommons.org file.

So! Syncing things. If you open the file in an emacs version with org-mode installed, you’ll notice 4 sections. Two of these are crucial to my setup, but we won’t be using them today: “Events” holds say, meeting at X time, traveling on certain days; “Various Tasks” contains not roundup-related tasks. Then there’s the other two: “Roundup” will collect all the tasks we need to work on, and “Supporting funcs” has a couple of org-babel blocks in Python and emacs-lisp.

Anyway, enough talk, let’s give it a spin. You’ll need a recent version org-mode and a copy of emacs. Make sure that newer org-mode is on your load-path and then evaluate:

(require 'org)
(require 'org-install)
(require 'ob-python)
(setq org-confirm-babel-evaluate nil)
(setq org-src-fontify-natively t)
(setq org-src-tab-acts-natively t)

Next open up the relevant org-mode file. Move to the “Roundup” line, hit Tab to cycle its visibility, and move to the line that starts with “#+call:”

Now press “Ctrl+c Ctrl+c”. You’ll see it populate with issues from my issues list:

What’s happening here? So we’re executing an org-babel block at point. Org-babel is an org-mode extension that allows you to make blocks of code executable, and even chain from one language to another (it also has some stuff relevant to Donald Knuth’s “literate programming” which is cool but I’m not using here). If we look at the code blocks:

Anyway, there are three code blocks here.

  • ccommons-roundup-parse: uses python to read the CSV file generated by roundup which is relevant to my task list, converts it into a list of two-item lists (task id, task title)
  • ccommons-roundup-insert-func: the function that actually inserts items into our “* Roundup” heading. It checks the ROUNDUPID property to see if that task is already inserted or not. If not, it inserts the task with the appropriate title and ROUNDUPID.
  • ccommons-roundup-insert the actual block we end up invoking. It binds together the data from ccommons-roundup-parse with a function call to the function defined in ccommons-roundup-insert-func.

You can evaluate it multiple times. It’ll only insert new tasks that aren’t on your list currently. Now you can take notes on your tasks, schedule them for various dates, make subtasks, etc. When you’re ready to close out a task close it out both on the ticket and in org-mode. If you want to use a similar setup for org-mode, I think it’s easy enough to borrow these methods and just change the CSV URL to whatever URL is appropriate for your user’s tasks.

Now admittedly this still isn’t even the best setup. It would be good if it told you when some tasks are marked as closed in your org-mode and open in roundup and vice versa. Org-babel still feels a bit hacky… I probably wouldn’t use it on anything other than scripts-I-want-to-embed-in-my-orgmode-files (for now at least). I even had to strip out quotes from the titles because org-babel python doesn’t escape quotations from strings correctly currently (but that’s a bug, one that will hopefully be fixed). Even so, I’ve been trying to close out a lot of roundup tasks lately, and it’s really helped me to bridge both worlds.

Edit: And in case you’re wondering why I didn’t use url.el instead of piping to python, the reason is because of CSV support… there’s none builtin to emacs as far as I know, and splitting on commas doesn’t handle all of the escaping intricacies… and org-babel makes it pretty easy to be lazy and simply use python for what python already handles well.


Well Covered

nathan, October 5th, 2010

When we rolled out Hudson for CC code last month, I already knew that I wanted to have test coverage reporting. There’s simply no reason not to: it provides a way to understand how complete your tests are, and when combined with branch testing, gives you an easy way to figure out what tests need to be written (or where to target your test writing efforts).

Last week I updated DiscoverEd to use Cobertura for coverage. It was pretty easy to crib the example ant build file to add support for instrumenting and testing our code.

When I first tried to add coverage support to our Python code, I encountered an issue between coverage and Jinja2. Ned quickly committed a fix, and today I finished instrumenting our core Python projects for coverage reporting. This includes the license engine (cc.engine), the API (cc.api), the underlying license library (cc.license), and the structured data scraper used by the deeds (deedscraper).

A pleasant surprise after instrumenting is the current state of coverage. With the exception of cc.engine, we’re at greater than 90% coverage for our core code (it appears that there are lots of branches/conditionals we don’t test adequately in cc.engine right now). Looking forward to seeing 100’s across the board.

Comments Off

Find and Reuse Images: Painless Attribution

nathan, October 4th, 2010

Finding CC licensed images and using them properly is something many people seem to struggle with: finding them can be straight-forward, but many sites don’t provide copy and paste reuse code that complies with the license. Xpert, a project of University of Nottingham, has launched an image search tool that helps with this. Xpert Attribution tool searches Wikimedia Commons and Flickr and provides an easy way to get the image with the attribution information overlaid, or (even better, in my opinion) with RDFa suitable for embedding. I’ve combined the two below (downloading the image with attribution, and adding the structured-data enriched embed code below it).

Taken from http://upload.wikimedia.org/wikipedia/commons/e/eb/-_Schlumbergera_trunctata_-.jpg on 2010-10-05
Original URL – http://commons.wikimedia.org/wiki/File:-_Schlumbergera_trunctata_-.jpg created on February 2007
Nino Barbieri CC BY-SA 2.5

The inclusion of structured data with the HTML means you can click the license link above and the license deed will display the attribution information, as well as our generated attribution HTML.


previous pagenext page