So after setting up EC2, S3, grabbing the files from S3, SCP-ing the python scripts and running them, one would expect to see some results. Upon the polite request of Asheesh here is a sampler.
The first script (dealing with urls that change their license, named licChange.py) results in an output which lists the URLs (that change their license [type, version or jurisdiction]), the license info and the date(s) of change:
http://blog.aikawa.com.ar/ [['by-nc-sa', '2.5', 'ar'], ['by-nc-nd', '2.5', 'ar']] ['21/Sep/2007:11:38:56 +0000', '22/Sep/2007:05:40:22 +0000']
The line above shows that the license for the URL ‘http://blog.aikawa.com.ar/’ was changed from ‘by-nc-sa 2.5 Argentina’ to ‘by-nc-nd 2.5 Argentina’ some time between 11:38:56 GMT on the 21st of September 2007 to 05:40:22 GMT on 22nd of September 2007. The format may seem a bit awkward but you can expect a facelift for the results file. I was previously planning to re-read the file to generate statistics but we can have a seperate file for storing data and another one for the stats.
Similarly, the following lines out of the results file for licChange.py from 2007-09 show license changes for ‘http://0.0.0.0:3000/’ and ‘http://127.0.0.1/actibands/castellano/licencias.htm’ and *many other internal URLs:
http://0.0.0.0:3000/ [['by-nc-sa', '3.0', ''], ['by-nc-sa', '3.0', ''], ['by-nc-sa', '3.0', ''], ['by-nc-sa', '3.0', ''], ['by-nc-sa', '3.0', ''], ['by-nc-nd', '3.0', 'nl'], ['by-nc-nd', '3.0', 'nl']] ['17/Sep/2007:08:10:28 +0000', '17/Sep/2007:17:50:28 +0000', '18/Sep/2007:16:25:47 +0000', '19/Sep/2007:13:03:23 +0000', '19/Sep/2007:13:11:16 +0000', '20/Sep/2007:22:16:09 +0000', '20/Sep/2007:22:16:39 +0000']
http://127.0.0.1/actibands/castellano/licencias.htm [['by-sa', '2.5', 'es'], ['by-nc-sa', '2.5', 'es'], ['by-sa', '2.5', 'es'], ['by-nc-sa', '2.5', 'es'], ['by-sa', '2.5', 'es'], ['by-nc-sa', '2.5', 'es']] ['27/Sep/2007:20:50:44 +0000', '27/Sep/2007:20:50:44 +0000', '27/Sep/2007:20:51:00 +0000', '27/Sep/2007:20:51:00 +0000', '27/Sep/2007:20:51:23 +0000', '27/Sep/2007:20:51:23 +0000']
The licenses for http://0.0.0.0:3000/ are ported for Netherlands (nl) and the one for http://127.0.0.1/actibands/castellano/licencias.htm are ported for Spain (es). Note that presently all the occurences of any URL that changes its license is outputted, this will be changed in the next nightly build. This included a better formatted result file with stats on total number of URLs changing licenses and even stats distinguishing changes between license change and version change.
Akin to this (licChange.py) there are 3 more scripts, licChooser.py, licSearch.py and deedLogs.py.
licChooser.py grabs metadata usage information and generates stats in absolute numbers and percentage of all entries, eg.: “16 out of 100 items are tagged as Audio [16%] of total entries and 29% of items with Metadata”
licSearch.py grabs information from the logs for search.creativecommons.org like the query, the engine and the search options (commercial use and derivatives).
deedLogs.py looks at the logs for the deed pages, employs MaxMind GeoIP to do a location lookup and grabs the deed page being loked at.
So this is what we have so far.No Comments »
Some major updates and we have the scripts running, thanks Asheesh for the redirection idea, it works but I couldn’t get it to give me a progress bar since everything was being redirected to the file. I tried using two different functions but they needed a shared variable, so that failed, but it was nice since now I ended up with “real” python files with a main().
The journey was interesting, we went from trying >> inside python to including # -*- coding: UTF-8 -*- and # coding: UTF-8 to get it to work and after a few more bumps finally figured out the __main__
I still need to update all the scripts, but licChange which is at the forefront of all the latest developments just got bumped upto version 8.2 (which reminds me of a dire need to update GIT:Loggy!).
This also gave me an idea of how to go about getting data out of S3 for “free” … S3 to EC2 is free … SCP from EC2 is free and voila! Why would I every want to do that? Well, for starters, the EC2 AMI runs out of space around 5 GB (note: logs for i.creativecommons.org are 4.7 GB) and secondly, the scripts seem to run faster locally. The icing on the cake, I wouldn’t have to scp the result files being generated. I could possibly automate the process of running the scripts.
Thats all for now … class at 0830 Hrs in the moring (it’s criminal, I know).
I guess, I’ll just have to keep at it.No Comments »
So this is where we are.
Now we have EC2, we have S3Sync ruby scripts on the EC2 AMI to pull the data from S3 and we have updated python scripts that read one line at a time and use Geo-IP (which was suprisingly easy to install once GCC was functional and the right versions of the C and Python modules were attained). So deployment is on full throttle and one final bug fix for generating the final results and we are done.
So, now back to the python code. Now we have 4 scripts:
- License Change (Logs for i.creativecommons.org) [Version 7]
- License Chooser (Logs for creativecommons.org) [Version 5]
- CC Search (Logs for search.creativecommons.org) [Version 4]
- Deeds (Logs for creativecommons.org/licenses/*) [Version 2]
Each of which polls a directory for new logs, reads each new log in the stated directory, line by line and uses regular expressions to parse the information into usable statistics. Hitherto throughout the development phase, the results were passed on to stdout / console. With deployment, they now need to be written to a file, while interestingly is still to be resolved. (TypeError: ‘str’ object is not callable sound familiar to anyone?)
I am greatful to Asheesh (whom I should have totally bugged more). I should’ve put in more work into the project when vactioning back home, also having less to do at school would’ve helped (studies + 3 research projects is not a recommended wotk load), but if it would be easy, it wouldn’t be fun! Oh well, I learnt a fair bit through the project and with a bit more troubleshooting we’d be good to go … for now!No Comments »
So I get to see the state-of-the-art and reconciling Apache and Squid logs. Based on this I need to come up with a way to reformulate the referrer ID and other such data for the logs at i.creativecommons and the ones from Varnish. As speculated in my messy proposal, a .sh using egrep is employed. Still bulk of the work is done in Python. So this doesn’t give me an excuse to read up on the Advanced Bash Shell Scripting Guide, but instead something on Python. Fun as well.
As far as I can tell, these scripts will be run before the logs are archived and uploaded in S3 storage. This will work great for the new logs which are generated from that day onwards from when the scripts are implemented. What about the analysis requiring cumulative data or trend analysis? I’ll need to sort this one out, a lot of the analysis depends on access to all the data.
Will be working from a fellow GSoCers place today, hoping to cover up on some lost ground because of travels and intermittent internet access. Will be back in Singapore and firing on all cylinders on the 8th.No Comments »
OK, so now I should maybe go back and thank nathany for the advise on not to download the whole dump of logs which looked like an innocent 319.xx GB back then … it’s only once I got some samples and started playing around I realised, that was 319.xx GB of archives, which when unzipped by rough calculation come to over 2 Tera-bytes of text logs. That much amount of space, I unfortunately don’t have.
Apart from that, the data looks interesting. More information than I had anticipated. I recall Asheesh mentioning some standard tools for working with the logs, I’ll have to follow up on that. Other-wise, now would as good a time as any to practice some regular expressions. (-:No Comments »
/me hugs paulproteuss … yes an extra s in there today, but thanks to him [and the guy who posted this :: http://www.macosxhints.com/article.php?story=2008020123070799 with a link to an archive with the S3Browser.app] I now finally have access to S3! (-:
OK, so now that I have access to 25887 objects which take up 319.191 GB (and growing) I need to sort out which ones I need to make a local copy. I sure don’t want to go around messing about with the stuff online, especially since that’s the only copy! I would take the whole dump but there are costs involved:
- Storage – not a big deal, I have 250+ GB available now
- Time – shouldn’t take *too* long, I can let my mac be sleepless a couple of nights
- Money – apparently it would cost about 50 bucks for the transfer
And well, I guess that would be just taking the easy way out. So, I’ll shovel through and familiarize myself with the data so I know which parts I really really should have and what data is not going to very helpful in the analysis and I’ll make a copy of whatever makes for good analysis.
Eye-balling shows me a lot of error logs which I might not include for analysis at the moment [at least not as a part of the GSoC project ... may be later]. I’ll probably make a big list of what are all the different types of logs in there and what attributes each of them has. Then I can probably start looking at how I can use the combination of different attributes stored in each of them to come up with useful metrics.
That’s all for now, need to wrap up other projects before I can get started on GSoC full throttle. So, it’s 3:15 am and I am signing off to get back to work.No Comments »
Before I dive-in on what CC-Logger is all about, a small intro to who I am:
My name is Ankit Guglani, and I am an undergrad at Singapore Management University [link]. I has been working on a research project called ‘CC-Monitor’ Since 2006 under the supervision of prof. Giorgos Cheliotis. This was my introduction to Creative Commons, I have been working on Collecting and Compiling CC-Metrics since.
This summer is again blessed with the Google, Summer of Code [GSoC]. Yes, just like last year, Creative Commons [CC] is one of the mentoring organizations yet again. This time around there are 4 active CC-GSoC Projects; Here’s what my GSoC project with CC [named CC-Logger] is all about.
The project aims to uncover the hidden metrics in the CC Logs. So I’ll take look at the following logs (which are tucked away in an Amazon S3 account):
1) logs for creativecommons.org/license
2) logs for i.creativecommons.org and creativecommons.org/images/licenses
3) logs of creativecommons.org/licenses
4) logs for search.creativecommons.org
Looking at these I hope to find additional information about CC usage such as switching patterns and click-through rates to the deeds, and then analyze and interpret these results. I also hope to come up with some indices / metrics that can be used as indicative predictors for trends.
I know, you’re thinking it’s summer of CODE, and this is all analysis, where is the code? Once I am done with coming up with metrics and indices, I get to code to automate the calculation (and possibly the web-publishing of the results) for all the future logs.
The fun part about this project is, the logs I need to analyze right now, are over 100 GB in size. I am so looking forward to having a local copy of that!
That’s all for now, thanks to prof. Giorgos for getting me into CC, Mike for suggesting looking at the logs and of course Asheesh, for the mentoring. (-:1 Comment »