As with previous looks at CC licenses across different jurisdictions and regions, this one is based on search engine reported links to jurisdiction “ported” versions of licenses. It is great to see what researchers have done with this limited mechanism.
I am eager to explore other means of characterizing CC usage across regions and fields, and hope to provide more data that will enable researchers to do this. This will be increasingly important as we attempt to forge a universal license that does not call for porting in the same way, with version 4.0 over the coming year (as well as with the increasing importance of CC in the world).Comments Off
Last week Creative Commons released a book titled The Power of Open featuring dozens of case studies of successful uses of CC tools, beautifully laid out magazine-style. The book also has a couple pages (45&46) on metrics and a pretty graph of CC adoption over the years.
See the main CC blog for non-technical detail on the data behind this graph. This post serves as a technical companion — read below for how to reproduce.
Every day (modulo bugs and outages) we request license link and licensed work counts from Yahoo! Site Explorer and Flickr respectively (and sometimes elsewhere, but those two are currently pertinent to our conservative estimation). You can find the data and software (if you want to start independently gathering data) here.
After loading the data into MySQL, we delete some rows representing links that aren’t of interest or from non-Yahoo link: queries to avoid having to filter. In the future at least the former ought be moved to a separate table.
delete from simple where license_uri = ‘http://creativecommons.org/licenses/GPL/2.0/';
delete from simple where license_uri = ‘http://creativecommons.org/licenses/LGPL/2.1/';
delete from simple where license_uri = ‘http://creativecommons.org';
delete from simple where license_uri = ‘http://www.creativecommons.org';
delete from simple where license_uri = ‘http://creativecommons.org/licenses/publicdomain/1.0/';
delete from simple where license_uri = ‘http://creativecommons.org/licenses/by-nc-nd/2.0/deed-music';
delete from simple where license_uri = ‘http://creativecommons.org/licenses/by-nc-nd/2.0/br/creativecommons.org/licenses/sampling/1.0/br/';
delete from simple where license_uri = ‘http://creativecommons.org/licenses/zero/1.0/';
delete from simple where license_uri = ‘http://creativecommons.org/licenses/publicdomain';
delete from simple where search_engine != ‘Yahoo';
The following relatively simple query obtains average counts for each distinct license across December (approximating year-end). For the six main version 2.0 licenses, Flickr knows about more licensed works than Yahoo! Site Explorer does, so Flickr numbers are used: we know at least that many works for each of those licenses exist.
greatest(yahoo.ct, coalesce(flickr.ct,0)) accomplishes this.
coalesce is necessary for Flickr as we don’t have data most of the time, and don’t want to compare with NULL.
select * from ( select ym, sum(atleast) totalcount from (select yahoo.ym, yahoo.license_uri, greatest(yahoo.ct, coalesce(flickr.ct,0)) atleast from (select extract(year_month from timestamp) ym, license_uri,round(avg(count)) ct from simple group by license_uri,extract(year_month from timestamp)) yahoo left join (select extract(year_month from utc_time_stamp) ym, license_uri,round(avg(count)) ct from site_specific group by license_uri,extract(year_month from utc_time_stamp)) flickr on flickr.ym = yahoo.ym and flickr.license_uri = yahoo.license_uri) x group by ym ) x where ym regexp ’12$';
Results of above query:
|Year End||Total License Count|
select free.ym, freecount, totalcount, freecount/totalcount freeproportion from (select ym, sum(atleast) freecount from (select yahoo.ym, yahoo.license_uri, greatest(yahoo.ct, coalesce(flickr.ct,0)) atleast from (select extract(year_month from timestamp) ym, license_uri,round(avg(count)) ct from simple group by license_uri,extract(year_month from timestamp)) yahoo left join (select extract(year_month from utc_time_stamp) ym, license_uri,round(avg(count)) ct from site_specific group by license_uri,extract(year_month from utc_time_stamp)) flickr on flickr.ym = yahoo.ym and flickr.license_uri = yahoo.license_uri) x where license_uri regexp 'publicdomain' or license_uri regexp 'by/' or license_uri regexp 'by-sa/' group by ym) free, (select ym, sum(atleast) totalcount from (select yahoo.ym, yahoo.license_uri, greatest(yahoo.ct, coalesce(flickr.ct,0)) atleast from (select extract(year_month from timestamp) ym, license_uri,round(avg(count)) ct from simple group by license_uri,extract(year_month from timestamp)) yahoo left join (select extract(year_month from utc_time_stamp) ym, license_uri,round(avg(count)) ct from site_specific group by license_uri,extract(year_month from utc_time_stamp)) flickr on flickr.ym = yahoo.ym and flickr.license_uri = yahoo.license_uri) x group by ym) total where free.ym = total.ym and free.ym regexp '12$';
The above query obtains the following:
|Year End||Free License Count||Total License Count||Free License %|
The pretty graph in the book reflects the total number of CC licensed works and the number of fully free/libre/open CC licensed works at the end of each year; the legend and text note that the proportion of the latter has roughly doubled over the history of CC.
If we look at the average for each month, not only December (remove the regular expression matching ’12’ at the end of the year month datestring), the data is noisier (and it appears data collection failed for two months in mid-2007, which perhaps should be interpolated):
The results of the above queries and some additional charts may be downloaded as a spreadsheet.
As noted previously, additional data is available for analysis. There’s also more that could be done with the license-link and site-specific data used above, e.g., analysis of particular license classes, version update, and jurisdiction ports. Also see the non-technical post.Comments Off
Available, yet to be analyzed: some data from which something about CC adoption patterns in some languages and countries might be gleaned
Recently asked about CC use for works in Arabic. There are no Arabic-language jurisdiction ports available at this time; measuring adoption of jurisdiction ports is the easiest way to characterize patterns of adoption, as has been done by Giorgos Cheliotis, with interesting results — also see http://monitor.creativecommons.org.
Fortunately there is some other data available that could be used to characterize CC use, irrespective of porting, across some languages and countries. This data is described at http://wiki.creativecommons.org/Metrics/Data_Catalog#License_property_search_engine_API_queries and available from http://labs.creativecommons.org/metrics/sql-dumps/all.sql.gz
(warning, approximately 200 megabytes).
Nobody has analyzed this part of the available data on CC adoption yet to my knowledge. Anyone is of course welcome and encouraged to!2 Comments »
So after setting up EC2, S3, grabbing the files from S3, SCP-ing the python scripts and running them, one would expect to see some results. Upon the polite request of Asheesh here is a sampler.
The first script (dealing with urls that change their license, named licChange.py) results in an output which lists the URLs (that change their license [type, version or jurisdiction]), the license info and the date(s) of change:
http://blog.aikawa.com.ar/ [[‘by-nc-sa’, ‘2.5’, ‘ar’], [‘by-nc-nd’, ‘2.5’, ‘ar’]] [’21/Sep/2007:11:38:56 +0000′, ’22/Sep/2007:05:40:22 +0000′]
The line above shows that the license for the URL ‘http://blog.aikawa.com.ar/’ was changed from ‘by-nc-sa 2.5 Argentina’ to ‘by-nc-nd 2.5 Argentina’ some time between 11:38:56 GMT on the 21st of September 2007 to 05:40:22 GMT on 22nd of September 2007. The format may seem a bit awkward but you can expect a facelift for the results file. I was previously planning to re-read the file to generate statistics but we can have a seperate file for storing data and another one for the stats.
Similarly, the following lines out of the results file for licChange.py from 2007-09 show license changes for ‘http://0.0.0.0:3000/’ and ‘http://127.0.0.1/actibands/castellano/licencias.htm’ and *many other internal URLs:
http://0.0.0.0:3000/ [[‘by-nc-sa’, ‘3.0’, ”], [‘by-nc-sa’, ‘3.0’, ”], [‘by-nc-sa’, ‘3.0’, ”], [‘by-nc-sa’, ‘3.0’, ”], [‘by-nc-sa’, ‘3.0’, ”], [‘by-nc-nd’, ‘3.0’, ‘nl’], [‘by-nc-nd’, ‘3.0’, ‘nl’]] [’17/Sep/2007:08:10:28 +0000′, ’17/Sep/2007:17:50:28 +0000′, ’18/Sep/2007:16:25:47 +0000′, ’19/Sep/2007:13:03:23 +0000′, ’19/Sep/2007:13:11:16 +0000′, ’20/Sep/2007:22:16:09 +0000′, ’20/Sep/2007:22:16:39 +0000′]
http://127.0.0.1/actibands/castellano/licencias.htm [[‘by-sa’, ‘2.5’, ‘es’], [‘by-nc-sa’, ‘2.5’, ‘es’], [‘by-sa’, ‘2.5’, ‘es’], [‘by-nc-sa’, ‘2.5’, ‘es’], [‘by-sa’, ‘2.5’, ‘es’], [‘by-nc-sa’, ‘2.5’, ‘es’]] [’27/Sep/2007:20:50:44 +0000′, ’27/Sep/2007:20:50:44 +0000′, ’27/Sep/2007:20:51:00 +0000′, ’27/Sep/2007:20:51:00 +0000′, ’27/Sep/2007:20:51:23 +0000′, ’27/Sep/2007:20:51:23 +0000′]
The licenses for http://0.0.0.0:3000/ are ported for Netherlands (nl) and the one for http://127.0.0.1/actibands/castellano/licencias.htm are ported for Spain (es). Note that presently all the occurences of any URL that changes its license is outputted, this will be changed in the next nightly build. This included a better formatted result file with stats on total number of URLs changing licenses and even stats distinguishing changes between license change and version change.
Akin to this (licChange.py) there are 3 more scripts, licChooser.py, licSearch.py and deedLogs.py.
licChooser.py grabs metadata usage information and generates stats in absolute numbers and percentage of all entries, eg.: “16 out of 100 items are tagged as Audio [16%] of total entries and 29% of items with Metadata”
licSearch.py grabs information from the logs for search.creativecommons.org like the query, the engine and the search options (commercial use and derivatives).
deedLogs.py looks at the logs for the deed pages, employs MaxMind GeoIP to do a location lookup and grabs the deed page being loked at.
So this is what we have so far.Comments Off
Some major updates and we have the scripts running, thanks Asheesh for the redirection idea, it works but I couldn’t get it to give me a progress bar since everything was being redirected to the file. I tried using two different functions but they needed a shared variable, so that failed, but it was nice since now I ended up with “real” python files with a main().
The journey was interesting, we went from trying >> inside python to including # -*- coding: UTF-8 -*- and # coding: UTF-8 to get it to work and after a few more bumps finally figured out the __main__
I still need to update all the scripts, but licChange which is at the forefront of all the latest developments just got bumped upto version 8.2 (which reminds me of a dire need to update GIT:Loggy!).
This also gave me an idea of how to go about getting data out of S3 for “free” … S3 to EC2 is free … SCP from EC2 is free and voila! Why would I every want to do that? Well, for starters, the EC2 AMI runs out of space around 5 GB (note: logs for i.creativecommons.org are 4.7 GB) and secondly, the scripts seem to run faster locally. The icing on the cake, I wouldn’t have to scp the result files being generated. I could possibly automate the process of running the scripts.
Thats all for now … class at 0830 Hrs in the moring (it’s criminal, I know).
I guess, I’ll just have to keep at it.Comments Off
So this is where we are.
Now we have EC2, we have S3Sync ruby scripts on the EC2 AMI to pull the data from S3 and we have updated python scripts that read one line at a time and use Geo-IP (which was suprisingly easy to install once GCC was functional and the right versions of the C and Python modules were attained). So deployment is on full throttle and one final bug fix for generating the final results and we are done.
So, now back to the python code. Now we have 4 scripts:
- License Change (Logs for i.creativecommons.org) [Version 7]
- License Chooser (Logs for creativecommons.org) [Version 5]
- CC Search (Logs for search.creativecommons.org) [Version 4]
- Deeds (Logs for creativecommons.org/licenses/*) [Version 2]
Each of which polls a directory for new logs, reads each new log in the stated directory, line by line and uses regular expressions to parse the information into usable statistics. Hitherto throughout the development phase, the results were passed on to stdout / console. With deployment, they now need to be written to a file, while interestingly is still to be resolved. (TypeError: ‘str’ object is not callable sound familiar to anyone?)
I am greatful to Asheesh (whom I should have totally bugged more). I should’ve put in more work into the project when vactioning back home, also having less to do at school would’ve helped (studies + 3 research projects is not a recommended wotk load), but if it would be easy, it wouldn’t be fun! Oh well, I learnt a fair bit through the project and with a bit more troubleshooting we’d be good to go … for now!Comments Off
Not that I am expecting much trouble coding using the Geo-IP module, but trying to get it on to the system itself has me believing that this module is out to get me! First, mac OS X (Leopard) doesn’t come with GCC installed (shocker!) and this module needs building, so I go to get it. GCC is in packaged in with the developers tool, which is about a 2 GB install and I can’t hand-pick the components … fail. So I go get myself darwin ports, and try that route. It installs, gives me the sweet *ding*, install complete sound and when I go to terminal and … fail … no such file or directory. So I give in to its terrorist demands and make room for the developers pack thinking I’ll make up for it by actually using these tools. So I wait 19 minutes for it to complete installing, I check I have GCC [i686-apple-darwin9-gcc-4.0.1] … happily I go and python setup.py build … and what followed was not nice … a screen full of Warnings and Errors and No Build. =(
I am going to find another source and try again till it finally works!
In other news, changing all my codes to methods and including append to file for results, looking to add file-list comparison as a feature. Coming soon to a GIT repository near you!Comments Off
So I get to see the state-of-the-art and reconciling Apache and Squid logs. Based on this I need to come up with a way to reformulate the referrer ID and other such data for the logs at i.creativecommons and the ones from Varnish. As speculated in my messy proposal, a .sh using egrep is employed. Still bulk of the work is done in Python. So this doesn’t give me an excuse to read up on the Advanced Bash Shell Scripting Guide, but instead something on Python. Fun as well.
As far as I can tell, these scripts will be run before the logs are archived and uploaded in S3 storage. This will work great for the new logs which are generated from that day onwards from when the scripts are implemented. What about the analysis requiring cumulative data or trend analysis? I’ll need to sort this one out, a lot of the analysis depends on access to all the data.
Will be working from a fellow GSoCers place today, hoping to cover up on some lost ground because of travels and intermittent internet access. Will be back in Singapore and firing on all cylinders on the 8th.Comments Off
OK, so now I should maybe go back and thank nathany for the advise on not to download the whole dump of logs which looked like an innocent 319.xx GB back then … it’s only once I got some samples and started playing around I realised, that was 319.xx GB of archives, which when unzipped by rough calculation come to over 2 Tera-bytes of text logs. That much amount of space, I unfortunately don’t have.
Apart from that, the data looks interesting. More information than I had anticipated. I recall Asheesh mentioning some standard tools for working with the logs, I’ll have to follow up on that. Other-wise, now would as good a time as any to practice some regular expressions. (-:Comments Off
/me hugs paulproteuss … yes an extra s in there today, but thanks to him [and the guy who posted this :: http://www.macosxhints.com/article.php?story=2008020123070799 with a link to an archive with the S3Browser.app] I now finally have access to S3! (-:
OK, so now that I have access to 25887 objects which take up 319.191 GB (and growing) I need to sort out which ones I need to make a local copy. I sure don’t want to go around messing about with the stuff online, especially since that’s the only copy! I would take the whole dump but there are costs involved:
- Storage – not a big deal, I have 250+ GB available now
- Time – shouldn’t take *too* long, I can let my mac be sleepless a couple of nights
- Money – apparently it would cost about 50 bucks for the transfer
And well, I guess that would be just taking the easy way out. So, I’ll shovel through and familiarize myself with the data so I know which parts I really really should have and what data is not going to very helpful in the analysis and I’ll make a copy of whatever makes for good analysis.
Eye-balling shows me a lot of error logs which I might not include for analysis at the moment [at least not as a part of the GSoC project … may be later]. I’ll probably make a big list of what are all the different types of logs in there and what attributes each of them has. Then I can probably start looking at how I can use the combination of different attributes stored in each of them to come up with useful metrics.
That’s all for now, need to wrap up other projects before I can get started on GSoC full throttle. So, it’s 3:15 am and I am signing off to get back to work.Comments Off