Metrics
Loggy: some results
Ankit Guglani, September 2nd, 2008
So after setting up EC2, S3, grabbing the files from S3, SCP-ing the python scripts and running them, one would expect to see some results. Upon the polite request of Asheesh here is a sampler.
The first script (dealing with urls that change their license, named licChange.py) results in an output which lists the URLs (that change their license [type, version or jurisdiction]), the license info and the date(s) of change:
http://blog.aikawa.com.ar/ [['by-nc-sa', '2.5', 'ar'], ['by-nc-nd', '2.5', 'ar']] ['21/Sep/2007:11:38:56 +0000', '22/Sep/2007:05:40:22 +0000']
The line above shows that the license for the URL ‘http://blog.aikawa.com.ar/’ was changed from ‘by-nc-sa 2.5 Argentina’ to ‘by-nc-nd 2.5 Argentina’ some time between 11:38:56 GMT on the 21st of September 2007 to 05:40:22 GMT on 22nd of September 2007. The format may seem a bit awkward but you can expect a facelift for the results file. I was previously planning to re-read the file to generate statistics but we can have a seperate file for storing data and another one for the stats.
Similarly, the following lines out of the results file for licChange.py from 2007-09 show license changes for ‘http://0.0.0.0:3000/’ and ‘http://127.0.0.1/actibands/castellano/licencias.htm’ and *many other internal URLs:
http://0.0.0.0:3000/ [['by-nc-sa', '3.0', ''], ['by-nc-sa', '3.0', ''], ['by-nc-sa', '3.0', ''], ['by-nc-sa', '3.0', ''], ['by-nc-sa', '3.0', ''], ['by-nc-nd', '3.0', 'nl'], ['by-nc-nd', '3.0', 'nl']] ['17/Sep/2007:08:10:28 +0000', '17/Sep/2007:17:50:28 +0000', '18/Sep/2007:16:25:47 +0000', '19/Sep/2007:13:03:23 +0000', '19/Sep/2007:13:11:16 +0000', '20/Sep/2007:22:16:09 +0000', '20/Sep/2007:22:16:39 +0000']
http://127.0.0.1/actibands/castellano/licencias.htm [['by-sa', '2.5', 'es'], ['by-nc-sa', '2.5', 'es'], ['by-sa', '2.5', 'es'], ['by-nc-sa', '2.5', 'es'], ['by-sa', '2.5', 'es'], ['by-nc-sa', '2.5', 'es']] ['27/Sep/2007:20:50:44 +0000', '27/Sep/2007:20:50:44 +0000', '27/Sep/2007:20:51:00 +0000', '27/Sep/2007:20:51:00 +0000', '27/Sep/2007:20:51:23 +0000', '27/Sep/2007:20:51:23 +0000']
The licenses for http://0.0.0.0:3000/ are ported for Netherlands (nl) and the one for http://127.0.0.1/actibands/castellano/licencias.htm are ported for Spain (es). Note that presently all the occurences of any URL that changes its license is outputted, this will be changed in the next nightly build. This included a better formatted result file with stats on total number of URLs changing licenses and even stats distinguishing changes between license change and version change.
Akin to this (licChange.py) there are 3 more scripts, licChooser.py, licSearch.py and deedLogs.py.
licChooser.py grabs metadata usage information and generates stats in absolute numbers and percentage of all entries, eg.: “16 out of 100 items are tagged as Audio [16%] of total entries and 29% of items with Metadata”
licSearch.py grabs information from the logs for search.creativecommons.org like the query, the engine and the search options (commercial use and derivatives).
deedLogs.py looks at the logs for the deed pages, employs MaxMind GeoIP to do a location lookup and grabs the deed page being loked at.
So this is what we have so far.
No Comments »>>> py >> file … Also if __name__ == ‘__main__’:
Ankit Guglani, September 1st, 2008
Some major updates and we have the scripts running, thanks Asheesh for the redirection idea, it works but I couldn’t get it to give me a progress bar since everything was being redirected to the file. I tried using two different functions but they needed a shared variable, so that failed, but it was nice since now I ended up with “real” python files with a main().
The journey was interesting, we went from trying >> inside python to including # -*- coding: UTF-8 -*- and # coding: UTF-8 to get it to work and after a few more bumps finally figured out the __main__
I still need to update all the scripts, but licChange which is at the forefront of all the latest developments just got bumped upto version 8.2 (which reminds me of a dire need to update GIT:Loggy!).
This also gave me an idea of how to go about getting data out of S3 for “free” … S3 to EC2 is free … SCP from EC2 is free and voila! Why would I every want to do that? Well, for starters, the EC2 AMI runs out of space around 5 GB (note: logs for i.creativecommons.org are 4.7 GB) and secondly, the scripts seem to run faster locally. The icing on the cake, I wouldn’t have to scp the result files being generated. I could possibly automate the process of running the scripts.
Thats all for now … class at 0830 Hrs in the moring (it’s criminal, I know).
I guess, I’ll just have to keep at it.
No Comments »EC2, S3Sync and back to Python.
Ankit Guglani, August 31st, 2008
So this is where we are.
Now we have EC2, we have S3Sync ruby scripts on the EC2 AMI to pull the data from S3 and we have updated python scripts that read one line at a time and use Geo-IP (which was suprisingly easy to install once GCC was functional and the right versions of the C and Python modules were attained). So deployment is on full throttle and one final bug fix for generating the final results and we are done.
So, now back to the python code. Now we have 4 scripts:
- License Change (Logs for i.creativecommons.org) [Version 7]
- License Chooser (Logs for creativecommons.org) [Version 5]
- CC Search (Logs for search.creativecommons.org) [Version 4]
- Deeds (Logs for creativecommons.org/licenses/*) [Version 2]
Each of which polls a directory for new logs, reads each new log in the stated directory, line by line and uses regular expressions to parse the information into usable statistics. Hitherto throughout the development phase, the results were passed on to stdout / console. With deployment, they now need to be written to a file, while interestingly is still to be resolved. (TypeError: ’str’ object is not callable sound familiar to anyone?)
I am greatful to Asheesh (whom I should have totally bugged more). I should’ve put in more work into the project when vactioning back home, also having less to do at school would’ve helped (studies + 3 research projects is not a recommended wotk load), but if it would be easy, it wouldn’t be fun! Oh well, I learnt a fair bit through the project and with a bit more troubleshooting we’d be good to go … for now!
No Comments »GeoIP Hates Me … phail.
Ankit Guglani, August 6th, 2008
Not that I am expecting much trouble coding using the Geo-IP module, but trying to get it on to the system itself has me believing that this module is out to get me! First, mac OS X (Leopard) doesn’t come with GCC installed (shocker!) and this module needs building, so I go to get it. GCC is in packaged in with the developers tool, which is about a 2 GB install and I can’t hand-pick the components … fail. So I go get myself darwin ports, and try that route. It installs, gives me the sweet *ding*, install complete sound and when I go to terminal and … fail … no such file or directory. So I give in to its terrorist demands and make room for the developers pack thinking I’ll make up for it by actually using these tools. So I wait 19 minutes for it to complete installing, I check I have GCC [i686-apple-darwin9-gcc-4.0.1] … happily I go and python setup.py build … and what followed was not nice … a screen full of Warnings and Errors and No Build. =(
I am going to find another source and try again till it finally works!
In other news, changing all my codes to methods and including append to file for results, looking to add file-list comparison as a feature. Coming soon to a GIT repository near you!
No Comments »Python && cumulative metrics?
Ankit Guglani, June 23rd, 2008
So I get to see the state-of-the-art and reconciling Apache and Squid logs. Based on this I need to come up with a way to reformulate the referrer ID and other such data for the logs at i.creativecommons and the ones from Varnish. As speculated in my messy proposal, a .sh using egrep is employed. Still bulk of the work is done in Python. So this doesn’t give me an excuse to read up on the Advanced Bash Shell Scripting Guide, but instead something on Python. Fun as well.
As far as I can tell, these scripts will be run before the logs are archived and uploaded in S3 storage. This will work great for the new logs which are generated from that day onwards from when the scripts are implemented. What about the analysis requiring cumulative data or trend analysis? I’ll need to sort this one out, a lot of the analysis depends on access to all the data.
Will be working from a fellow GSoCers place today, hoping to cover up on some lost ground because of travels and intermittent internet access. Will be back in Singapore and firing on all cylinders on the 8th.
No Comments »How big are the logs again?
Ankit Guglani, June 13th, 2008
OK, so now I should maybe go back and thank nathany for the advise on not to download the whole dump of logs which looked like an innocent 319.xx GB back then … it’s only once I got some samples and started playing around I realised, that was 319.xx GB of archives, which when unzipped by rough calculation come to over 2 Tera-bytes of text logs. That much amount of space, I unfortunately don’t have.
Apart from that, the data looks interesting. More information than I had anticipated. I recall Asheesh mentioning some standard tools for working with the logs, I’ll have to follow up on that. Other-wise, now would as good a time as any to practice some regular expressions. (-:
No Comments »S3 Finally!
Ankit Guglani, June 3rd, 2008
/me hugs paulproteuss … yes an extra s in there today, but thanks to him [and the guy who posted this :: http://www.macosxhints.com/article.php?story=2008020123070799 with a link to an archive with the S3Browser.app] I now finally have access to S3! (-:
OK, so now that I have access to 25887 objects which take up 319.191 GB (and growing) I need to sort out which ones I need to make a local copy. I sure don’t want to go around messing about with the stuff online, especially since that’s the only copy! I would take the whole dump but there are costs involved:
- Storage – not a big deal, I have 250+ GB available now
- Time – shouldn’t take *too* long, I can let my mac be sleepless a couple of nights
- Money – apparently it would cost about 50 bucks for the transfer
And well, I guess that would be just taking the easy way out. So, I’ll shovel through and familiarize myself with the data so I know which parts I really really should have and what data is not going to very helpful in the analysis and I’ll make a copy of whatever makes for good analysis.
Eye-balling shows me a lot of error logs which I might not include for analysis at the moment [at least not as a part of the GSoC project ... may be later]. I’ll probably make a big list of what are all the different types of logs in there and what attributes each of them has. Then I can probably start looking at how I can use the combination of different attributes stored in each of them to come up with useful metrics.
That’s all for now, need to wrap up other projects before I can get started on GSoC full throttle. So, it’s 3:15 am and I am signing off to get back to work.
No Comments »CC-Logger The S3 Chronicles
Ankit Guglani, June 2nd, 2008
Task One :: Setup a client to access the S3 bucket and retrieve the logs.
Attempt One :: Jungle Disk. A tool for the Mac with GUI implementation that mounts the S3 as a drive. Sounds cool, easy to install and setup, detects the buckets, but the mounted drive is empty. I would wait, expecting it to load the information, but my firewall told me it wasn’t exchanging any information with the S3 server and that was that.
Attempt Two :: Now I went for the S3Sync.rb and s3cmd.rb. Two ruby scripts that are acclaimed to be the solution. I setup the yml file with the keys to accessing the bucket and place it in the required location. I run the script and “Environment is not set up”. I go over the read me again and sure enough I had missed something; there were two pre-requisites, Ruby and Open-SSL library for ruby. I had ruby of course, but not the Open-SSL library. So I go and look for it and guess what, it doesn’t exist O_o. There are only eassl and jruby-openssl.
I don’t want to switch to another tool for a few reasons. One, this was recommended by the folks at CC, so I know this works with their setup. Two, each tool interacts with S3 in a different way and other tools may / may not work even if I set them up properly. Three come on, this is ruby, something I am familiar with, and it looks fairly easy to use, if only I can get hold of that dependency.
I’ll read up on some posts I came across earlier where people said they were using S3 … maybe there are some description of the setup there. If I find nothing, back to the IRC.
That’s all for now.
No Comments »CC-Logger [GSoC 2008]
Ankit Guglani, June 1st, 2008
Before I dive-in on what CC-Logger is all about, a small intro to who I am:
My name is Ankit Guglani, and I am an undergrad at Singapore Management University [link]. I has been working on a research project called ‘CC-Monitor’ Since 2006 under the supervision of prof. Giorgos Cheliotis. This was my introduction to Creative Commons, I have been working on Collecting and Compiling CC-Metrics since.
This summer is again blessed with the Google, Summer of Code [GSoC]. Yes, just like last year, Creative Commons [CC] is one of the mentoring organizations yet again. This time around there are 4 active CC-GSoC Projects; Here’s what my GSoC project with CC [named CC-Logger] is all about.
The project aims to uncover the hidden metrics in the CC Logs. So I’ll take look at the following logs (which are tucked away in an Amazon S3 account):
1) logs for creativecommons.org/license
2) logs for i.creativecommons.org and creativecommons.org/images/licenses
3) logs of creativecommons.org/licenses
4) logs for search.creativecommons.org
Looking at these I hope to find additional information about CC usage such as switching patterns and click-through rates to the deeds, and then analyze and interpret these results. I also hope to come up with some indices / metrics that can be used as indicative predictors for trends.
I know, you’re thinking it’s summer of CODE, and this is all analysis, where is the code? Once I am done with coming up with metrics and indices, I get to code to automate the calculation (and possibly the web-publishing of the results) for all the future logs.
The fun part about this project is, the logs I need to analyze right now, are over 100 GB in size. I am so looking forward to having a local copy of that!
That’s all for now, thanks to prof. Giorgos for getting me into CC, Mike for suggesting looking at the logs and of course Asheesh, for the mentoring. (-:
1 Comment »