Last week Creative Commons released a book titled The Power of Open featuring dozens of case studies of successful uses of CC tools, beautifully laid out magazine-style. The book also has a couple pages (45&46) on metrics and a pretty graph of CC adoption over the years.

See the main CC blog for non-technical detail on the data behind this graph. This post serves as a technical companion -- read below for how to reproduce.

Every day (modulo bugs and outages) we request license link and licensed work counts from Yahoo! Site Explorer and Flickr respectively (and sometimes elsewhere, but those two are currently pertinent to our conservative estimation). You can find the data and software (if you want to start independently gathering data) here.

After loading the data into MySQL, we delete some rows representing links that aren't of interest or from non-Yahoo link: queries to avoid having to filter. In the future at least the former ought be moved to a separate table.


delete from simple where license_uri = 'http://creativecommons.org/licenses/GPL/2.0/';
delete from simple where license_uri = 'http://creativecommons.org/licenses/LGPL/2.1/';
delete from simple where license_uri = 'http://creativecommons.org';
delete from simple where license_uri = 'http://www.creativecommons.org';
delete from simple where license_uri = 'http://creativecommons.org/licenses/publicdomain/1.0/';
delete from simple where license_uri = 'http://creativecommons.org/licenses/by-nc-nd/2.0/deed-music';
delete from simple where license_uri = 'http://creativecommons.org/licenses/by-nc-nd/2.0/br/creativecommons.org/licenses/sampling/1.0/br/';
delete from simple where license_uri = 'http://creativecommons.org/licenses/zero/1.0/';
delete from simple where license_uri = 'http://creativecommons.org/licenses/publicdomain';
delete from simple where search_engine != 'Yahoo';

The following relatively simple query obtains average counts for each distinct license across December (approximating year-end). For the six main version 2.0 licenses, Flickr knows about more licensed works than Yahoo! Site Explorer does, so Flickr numbers are used: we know at least that many works for each of those licenses exist. greatest(yahoo.ct, coalesce(flickr.ct,0)) accomplishes this. coalesce is necessary for Flickr as we don't have data most of the time, and don't want to compare with NULL.


select * from ( select ym, sum(atleast) totalcount from (select yahoo.ym, yahoo.license_uri, greatest(yahoo.ct, coalesce(flickr.ct,0)) atleast from (select extract(year_month from timestamp) ym, license_uri,round(avg(count)) ct from simple group by license_uri,extract(year_month from timestamp)) yahoo left join (select extract(year_month from utc_time_stamp) ym, license_uri,round(avg(count)) ct from site_specific group by license_uri,extract(year_month from utc_time_stamp)) flickr on flickr.ym = yahoo.ym and flickr.license_uri = yahoo.license_uri) x group by ym ) x where ym regexp '12$';

Results of above query:

Year End Total License Count
2003 943,292
2004 4,541,586
2005 15,822,408
2006 50,794,048
2007 137,564,807
2008 214,970,426
2009 336,771,549
2010 407,679,266

The more complicated query below also obtains the number of fully free/libre/open works and the proportion of works that are such:


select free.ym, freecount, totalcount, freecount/totalcount freeproportion from (select ym, sum(atleast) freecount from (select yahoo.ym, yahoo.license_uri, greatest(yahoo.ct, coalesce(flickr.ct,0)) atleast from (select extract(year_month from timestamp) ym, license_uri,round(avg(count)) ct from simple group by license_uri,extract(year_month from timestamp)) yahoo left join (select extract(year_month from utc_time_stamp) ym, license_uri,round(avg(count)) ct from site_specific group by license_uri,extract(year_month from utc_time_stamp)) flickr on flickr.ym = yahoo.ym and flickr.license_uri = yahoo.license_uri) x where license_uri regexp 'publicdomain' or license_uri regexp 'by/' or license_uri regexp 'by-sa/' group by ym) free, (select ym, sum(atleast) totalcount from (select yahoo.ym, yahoo.license_uri, greatest(yahoo.ct, coalesce(flickr.ct,0)) atleast from (select extract(year_month from timestamp) ym, license_uri,round(avg(count)) ct from simple group by license_uri,extract(year_month from timestamp)) yahoo left join (select extract(year_month from utc_time_stamp) ym, license_uri,round(avg(count)) ct from site_specific group by license_uri,extract(year_month from utc_time_stamp)) flickr on flickr.ym = yahoo.ym and flickr.license_uri = yahoo.license_uri) x group by ym) total where free.ym = total.ym and free.ym regexp '12$';

The above query obtains the following:

Year End Free License Count Total License Count Free License %
2003 208,939 943,292 22.15%
2004 1,011,650 4,541,586 22.28%
2005 4,369,938 15,822,408 27.62%
2006 12,284,600 50,794,048 24.19%
2007 40,020,147 137,564,807 29.09%
2008 68,459,952 214,970,426 31.85%
2009 136,938,501 336,771,549 40.66%
2010 160,064,676 407,679,266 39.26%

The pretty graph in the book reflects the total number of CC licensed works and the number of fully free/libre/open CC licensed works at the end of each year; the legend and text note that the proportion of the latter has roughly doubled over the history of CC.

If we look at the average for each month, not only December (remove the regular expression matching '12' at the end of the year month datestring), the data is noisier (and it appears data collection failed for two months in mid-2007, which perhaps should be interpolated):

The results of the above queries and some additional charts may be downloaded as a spreadsheet.

As noted previously, additional data is available for analysis. There's also more that could be done with the license-link and site-specific data used above, e.g., analysis of particular license classes, version update, and jurisdiction ports. Also see the non-technical post.