Labs News
How big are the logs again?
Ankit Guglani, June 13th, 2008
OK, so now I should maybe go back and thank nathany for the advise on not to download the whole dump of logs which looked like an innocent 319.xx GB back then … it’s only once I got some samples and started playing around I realised, that was 319.xx GB of archives, which when unzipped by rough calculation come to over 2 Tera-bytes of text logs. That much amount of space, I unfortunately don’t have.
Apart from that, the data looks interesting. More information than I had anticipated. I recall Asheesh mentioning some standard tools for working with the logs, I’ll have to follow up on that. Other-wise, now would as good a time as any to practice some regular expressions. (-:
No Comments »Flickr Image Re-Use for OpenOffice.org update
Mihai Husleag, June 13th, 2008
I`m happy to announce that i succeeded in doing, in a basic manner, all the 3 requirements for this project : search photos by tags, by license and to insert one photo into a document.
Here you have a screenshot made after a search was done on tag mountains and license Attribution License :
Also here you have the screenshot with the photo inserted into a document . As you can see the image was inserted with a default size, but this will be changed later.
What i`ll try to do next :
- add menus to each image with the available sizes
- improve the searching
- inserting the image into the document with the selected size
- adding the license into the document
- more testing
I hope, that in less than 2 weeks i will make available a good version.
Any comments or suggestions are well appreciated.
ps : I came across this article. “I for one can’t wait.” says Andrew Min about this project. I`ll try to not disappoint him :)
1 Comment »Wait, they test web apps now?
Frank Tobia, June 10th, 2008
Greetings all. I’m Frank and I’m the other tech intern at CC for 2008. I mainly focus on writing Python, though I’m comfortable with Java and I’ll deal with C when sufficiently coerced.
For my first task, I’ll be improving the test suite for the <a href=”http://api.creativecommons.org/docs/”>web service API</a>. Currently it runs on <a href=”http://cherrypy.org/”>CherryPy</a> and the test suite is brittle and somewhat broken. I’ll be porting the tests over to <a href=”http://pythonpaste.org/”>Python Paste</a>, getting them all to pass, and checking code coverage to see where more tests would be beneficial (testing is fun after all). The overarching goal is to get the API running on <a href=”http://pylonshq.com/”>Pylons</a> so CC has fewer server stacks to maintain.
No Comments »Summer Internship : email notification for Semantic Media Wiki
Steren Giannini, June 10th, 2008
First I would like to introduce myself. I’m Steren Giannini, a French student from Ecole Centrale de Lyon, a college of general engeneering. I’m working as a tech intern at Creative Commons SF. It’s for me the first time I cross the Atlantic, so this stay in San Francisco is a real adventure. I’m very proud to do my part in the CC revolution.
During my school years, I created websites for companies (to make some spending money ;) ) and I recently worked on Inkscape : I’ve been enhancing the Live Path Effect system.
The first goal of my project at CC is to improve the internal task and project tracking system. It uses Semantic MediaWiki. Today tasks can be created and assigned to users. What I have to do is to add email notification to the system.
This means :
- send email to the assigned users when a task changes
- send “reminder” emails to assigned users when due dates are approaching.
So yesterday I started by reading some documentation about Semantic Web in general, RDF specification, MediaWiki and Semantic MediaWiki. Then I installed MediaWiki on my localhost in order to have a closer look to the code.
My first goal is to get used to the existing system before codding something.
No Comments »S3 Finally!
Ankit Guglani, June 3rd, 2008
/me hugs paulproteuss … yes an extra s in there today, but thanks to him [and the guy who posted this :: http://www.macosxhints.com/article.php?story=2008020123070799 with a link to an archive with the S3Browser.app] I now finally have access to S3! (-:
OK, so now that I have access to 25887 objects which take up 319.191 GB (and growing) I need to sort out which ones I need to make a local copy. I sure don’t want to go around messing about with the stuff online, especially since that’s the only copy! I would take the whole dump but there are costs involved:
- Storage – not a big deal, I have 250+ GB available now
- Time – shouldn’t take *too* long, I can let my mac be sleepless a couple of nights
- Money – apparently it would cost about 50 bucks for the transfer
And well, I guess that would be just taking the easy way out. So, I’ll shovel through and familiarize myself with the data so I know which parts I really really should have and what data is not going to very helpful in the analysis and I’ll make a copy of whatever makes for good analysis.
Eye-balling shows me a lot of error logs which I might not include for analysis at the moment [at least not as a part of the GSoC project ... may be later]. I’ll probably make a big list of what are all the different types of logs in there and what attributes each of them has. Then I can probably start looking at how I can use the combination of different attributes stored in each of them to come up with useful metrics.
That’s all for now, need to wrap up other projects before I can get started on GSoC full throttle. So, it’s 3:15 am and I am signing off to get back to work.
No Comments »CC-Logger The S3 Chronicles
Ankit Guglani, June 2nd, 2008
Task One :: Setup a client to access the S3 bucket and retrieve the logs.
Attempt One :: Jungle Disk. A tool for the Mac with GUI implementation that mounts the S3 as a drive. Sounds cool, easy to install and setup, detects the buckets, but the mounted drive is empty. I would wait, expecting it to load the information, but my firewall told me it wasn’t exchanging any information with the S3 server and that was that.
Attempt Two :: Now I went for the S3Sync.rb and s3cmd.rb. Two ruby scripts that are acclaimed to be the solution. I setup the yml file with the keys to accessing the bucket and place it in the required location. I run the script and “Environment is not set up”. I go over the read me again and sure enough I had missed something; there were two pre-requisites, Ruby and Open-SSL library for ruby. I had ruby of course, but not the Open-SSL library. So I go and look for it and guess what, it doesn’t exist O_o. There are only eassl and jruby-openssl.
I don’t want to switch to another tool for a few reasons. One, this was recommended by the folks at CC, so I know this works with their setup. Two, each tool interacts with S3 in a different way and other tools may / may not work even if I set them up properly. Three come on, this is ruby, something I am familiar with, and it looks fairly easy to use, if only I can get hold of that dependency.
I’ll read up on some posts I came across earlier where people said they were using S3 … maybe there are some description of the setup there. If I find nothing, back to the IRC.
That’s all for now.
No Comments »CC-Logger [GSoC 2008]
Ankit Guglani, June 1st, 2008
Before I dive-in on what CC-Logger is all about, a small intro to who I am:
My name is Ankit Guglani, and I am an undergrad at Singapore Management University [link]. I has been working on a research project called ‘CC-Monitor’ Since 2006 under the supervision of prof. Giorgos Cheliotis. This was my introduction to Creative Commons, I have been working on Collecting and Compiling CC-Metrics since.
This summer is again blessed with the Google, Summer of Code [GSoC]. Yes, just like last year, Creative Commons [CC] is one of the mentoring organizations yet again. This time around there are 4 active CC-GSoC Projects; Here’s what my GSoC project with CC [named CC-Logger] is all about.
The project aims to uncover the hidden metrics in the CC Logs. So I’ll take look at the following logs (which are tucked away in an Amazon S3 account):
1) logs for creativecommons.org/license
2) logs for i.creativecommons.org and creativecommons.org/images/licenses
3) logs of creativecommons.org/licenses
4) logs for search.creativecommons.org
Looking at these I hope to find additional information about CC usage such as switching patterns and click-through rates to the deeds, and then analyze and interpret these results. I also hope to come up with some indices / metrics that can be used as indicative predictors for trends.
I know, you’re thinking it’s summer of CODE, and this is all analysis, where is the code? Once I am done with coming up with metrics and indices, I get to code to automate the calculation (and possibly the web-publishing of the results) for all the future logs.
The fun part about this project is, the logs I need to analyze right now, are over 100 GB in size. I am so looking forward to having a local copy of that!
That’s all for now, thanks to prof. Giorgos for getting me into CC, Mike for suggesting looking at the logs and of course Asheesh, for the mentoring. (-:
1 Comment »GSoC 2008 : Flickr Image Re-Use for OpenOffice.org
Mihai Husleag, May 29th, 2008
As title might suggest, i have been selected for GSoC 2008. As mentor for this project has been assigned Nathan Yergler.
The developing will focus on 3 key functionalities:
- ability to search photos by tags
- filter search results by license attributes
- insert the image into the document along with attribution information
The first 2 steps were done(of course, this will not be the final version) in small demo that i attached to my application for GSoC 2008. The OpenOffice components for which the extension will be implemented are Writer, Impress and Calc.
The application will be written in Java, using NetBeans with its plugin OpenOffice NetBeans Integration.
A short introduction : I`m Mihai Husleag, 24 years old, student in Computer Science, at Alexandru Ioan Cuza University of Iasi, Romania. My previous experience as programmer is more related to the .NET framework. Another thing about me, if in the weekends i`m not reachable then its a high probability that you will find me here.
If you have any suggestions about this project(new functionalities, things you don’t like, etc) feel free to leave a comment.
No Comments »License-oriented metadata validator and viewer: the development has just started
Hugo Dworak, May 26th, 2008
Creative Commons participates in Google Summer of Code™ and has accepted a proposal (see the abstract) of Hugo Dworak based on its description of a task to rewrite its now-defunct metadata validator. Asheesh Laroia has been assigned as the mentor of the project. The work began on May 26th, 2008 as per the project timeline. It is expected to be completed in twelve weeks. More details will be provided in the dedicated CC Wiki article and the progress will be weekly featured on this blog.
The project focuses on developing an on-line tool — free software written in Python — to validate digitally embedded Creative Commons licenses within files of different types. Files will be pasted directly to a form, identified by a URL, or uploaded by a user. The application will present the results in a human?readable fashion and notify the user if the means used to express the license terms are deprecated.
1 Comment »liblicense 0.7.0: Now with working Python bindings again
asheesh, May 16th, 2008
I just released liblicense 0.7.0 on SourceForge. It fixes the Python bindings. They’ve been broken since the 0.6 release, it seems. Some functionality in them probably worked between 0.6 and 0.7, but (read on for more)…
No Comments »