summer of code

License-oriented metadata validator and viewer: libvalidator

Hugo Dworak, July 8th, 2008

As the Google Summer of Code 2008 midterm evaluation deadline is approaching, it is a good time to report the progress when it comes to the license-oriented metadata validator and viewer.

The source code is located in two dedicated git repositories. The first being validator, which contains the source code of the Web application based on Pylons and Genshi. The second repository is libvalidator, which hosts the files that constitute the core library that the project will utilise. This is the component that the development focuses on right now.

The purpose of the aforementioned library is to parse input files, scan them for relevant license information, and output the results in a machine-readable fashion. More precisely, its workflow is the following: parse the file and associated RDF information so that a complete set of RDF data is available, filter the results with regard to license information (not only related to the document itself, but also to other objects described within it), and return the results in a manner preferable for the usage by the Web application.

pyRdfa seems to be the best tool for the parsing stage so far. It handles the current recommendation for embedding license metadata (namely RDFa) as well as other non-deprecated methods: linking to an external or embedded (using the “data” URL scheme) RDF files and utilising the Dublin Core. The significant lacking is handling of the invalid direct embedding of RDF/XML within the HTML/XHTML source code (as an element or in a comment) and this is resolved by first capturing all such instances using a regular expression and then parsing the data just as external RDF/XML files.

Once the RDF triples are extracted, one can use SPARQL to narrow the results just to the triples related to the licensed objects. Both librdf and rdflib support this language. Moreover, the RDF/XML related to the license must be parsed, so that its conditions (permissions, requirements, and restrictions) are then presented to the user.

The library takes advantage of standard Python tools such as Buildout and nose. When it is completed, the project will be all about writing a Web application that will serve as an interface to libvalidator.

No Comments »

RDFa for Semantic MediaWiki [GSoC 2008]

David McCabe, July 1st, 2008

Hello, world!

My name is David McCabe, and this summer I am adding RDFa support to Semantic MediaWiki, as part of the Google Summer of Code 2008. I am an undergraduate in Mathematics at Portland State University. For the Google Summer of Code 2006, I wrote Liquid Threads, a MediaWiki extension that replaces talk pages with a threaded discussion system.

Semantic MediaWiki (SMW) is the software used for the CC wiki and many other wikis. SMW allows authors to mark up wiki pages so that their contents and relationships are machine-readable. SMW already publishes this machine-readable data in RDF/XML format.

You can read about RDFA on the CC Wiki. There is also a Google Tech Talk on RDFa.

No Comments »

Python && cumulative metrics?

Ankit Guglani, June 23rd, 2008

So I get to see the state-of-the-art and reconciling Apache and Squid logs. Based on this I need to come up with a way to reformulate the referrer ID and other such data for the logs at i.creativecommons and the ones from Varnish. As speculated in my messy proposal, a .sh using egrep is employed. Still bulk of the work is done in Python. So this doesn’t give me an excuse to read up on the Advanced Bash Shell Scripting Guide, but instead something on Python. Fun as well.

As far as I can tell, these scripts will be run before the logs are archived and uploaded in S3 storage. This will work great for the new logs which are generated from that day onwards from when the scripts are implemented. What about the analysis requiring cumulative data or trend analysis? I’ll need to sort this one out, a lot of the analysis depends on access to all the data.

Will be working from a fellow GSoCers place today, hoping to cover up on some lost ground because of travels and intermittent internet access. Will be back in Singapore and firing on all cylinders on the 8th.

No Comments »

How big are the logs again?

Ankit Guglani, June 13th, 2008

OK, so now I should maybe go back and thank nathany for the advise on not to download the whole dump of logs which looked like an innocent 319.xx GB back then … it’s only once I got some samples and started playing around I realised, that was 319.xx GB of archives, which when unzipped by rough calculation come to over 2 Tera-bytes of text logs. That much amount of space, I unfortunately don’t have.

Apart from that, the data looks interesting. More information than I had anticipated. I recall Asheesh mentioning some standard tools for working with the logs, I’ll have to follow up on that. Other-wise, now would as good a time as any to practice some regular expressions. (-:

No Comments »

Flickr Image Re-Use for OpenOffice.org update

Mihai Husleag, June 13th, 2008

I`m happy to announce that i succeeded in doing, in a basic manner, all the 3 requirements for this project : search photos by tags, by license and to insert one photo into a document.

Here you have a screenshot made after a search was done on tag mountains and license Attribution License :

Results after a search on a tag and a license

Also here you have the screenshot with the photo inserted into a document . As you can see the image was inserted with a default size, but this will be changed later.

What i`ll try to do next :

  • add menus to each image with the available sizes
  • improve the searching
  • inserting the image into the document with the selected size
  • adding the license into the document
  • more testing

I hope, that in less than 2 weeks i will make available a good version.

Any comments or suggestions are well appreciated.

ps : I came across this article. “I for one can’t wait.” says Andrew Min about this project. I`ll try to not disappoint him :)

1 Comment »

S3 Finally!

Ankit Guglani, June 3rd, 2008

/me hugs paulproteuss … yes an extra s in there today, but thanks to him [and the guy who posted this :: http://www.macosxhints.com/article.php?story=2008020123070799 with a link to an archive with the S3Browser.app] I now finally have access to S3! (-:

OK, so now that I have access to 25887 objects which take up 319.191 GB (and growing) I need to sort out which ones I need to make a local copy. I sure don’t want to go around messing about with the stuff online, especially since that’s the only copy! I would take the whole dump but there are costs involved:

  • Storage – not a big deal, I have 250+ GB available now
  • Time – shouldn’t take *too* long, I can let my mac be sleepless a couple of nights
  • Money – apparently it would cost about 50 bucks for the transfer

And well, I guess that would be just taking the easy way out. So, I’ll shovel through and familiarize myself with the data so I know which parts I really really should have and what data is not going to very helpful in the analysis and I’ll make a copy of whatever makes for good analysis.

Eye-balling shows me a lot of error logs which I might not include for analysis at the moment [at least not as a part of the GSoC project ... may be later]. I’ll probably make a big list of what are all the different types of logs in there and what attributes each of them has. Then I can probably start looking at how I can use the combination of different attributes stored in each of them to come up with useful metrics.

That’s all for now, need to wrap up other projects before I can get started on GSoC full throttle. So, it’s 3:15 am and I am signing off to get back to work.

No Comments »

CC-Logger The S3 Chronicles

Ankit Guglani, June 2nd, 2008

Task One :: Setup a client to access the S3 bucket and retrieve the logs.

Attempt One :: Jungle Disk. A tool for the Mac with GUI implementation that mounts the S3 as a drive. Sounds cool, easy to install and setup, detects the buckets, but the mounted drive is empty. I would wait, expecting it to load the information, but my firewall told me it wasn’t exchanging any information with the S3 server and that was that.

Attempt Two :: Now I went for the S3Sync.rb and s3cmd.rb. Two ruby scripts that are acclaimed to be the solution. I setup the yml file with the keys to accessing the bucket and place it in the required location. I run the script and “Environment is not set up”. I go over the read me again and sure enough I had missed something; there were two pre-requisites, Ruby and Open-SSL library for ruby. I had ruby of course, but not the Open-SSL library. So I go and look for it and guess what, it doesn’t exist O_o. There are only eassl and jruby-openssl.

I don’t want to switch to another tool for a few reasons. One, this was recommended by the folks at CC, so I know this works with their setup. Two, each tool interacts with S3 in a different way and other tools may / may not work even if I set them up properly. Three come on, this is ruby, something I am familiar with, and it looks fairly easy to use, if only I can get hold of that dependency.

I’ll read up on some posts I came across earlier where people said they were using S3 … maybe there are some description of the setup there. If I find nothing, back to the IRC.

That’s all for now.

No Comments »

CC-Logger [GSoC 2008]

Ankit Guglani, June 1st, 2008

Before I dive-in on what CC-Logger is all about, a small intro to who I am:

My name is Ankit Guglani, and I am an undergrad at Singapore Management University [link]. I has been working on a research project called ‘CC-Monitor’ Since 2006 under the supervision of prof. Giorgos Cheliotis. This was my introduction to Creative Commons, I have been working on Collecting and Compiling CC-Metrics since.

This summer is again blessed with the Google, Summer of Code [GSoC]. Yes, just like last year, Creative Commons [CC] is one of the mentoring organizations yet again. This time around there are 4 active CC-GSoC Projects; Here’s what my GSoC project with CC [named CC-Logger] is all about.

The project aims to uncover the hidden metrics in the CC Logs. So I’ll take look at the following logs (which are tucked away in an Amazon S3 account):

1) logs for creativecommons.org/license
2) logs for i.creativecommons.org and creativecommons.org/images/licenses
3) logs of creativecommons.org/licenses
4) logs for search.creativecommons.org

Looking at these I hope to find additional information about CC usage such as switching patterns and click-through rates to the deeds, and then analyze and interpret these results. I also hope to come up with some indices / metrics that can be used as indicative predictors for trends.

I know, you’re thinking it’s summer of CODE, and this is all analysis, where is the code? Once I am done with coming up with metrics and indices, I get to code to automate the calculation (and possibly the web-publishing of the results) for all the future logs.

The fun part about this project is, the logs I need to analyze right now, are over 100 GB in size. I am so looking forward to having a local copy of that!

That’s all for now, thanks to prof. Giorgos for getting me into CC, Mike for suggesting looking at the logs and of course Asheesh, for the mentoring. (-:

1 Comment »

GSoC 2008 : Flickr Image Re-Use for OpenOffice.org

Mihai Husleag, May 29th, 2008

As title might suggest, i have been selected for GSoC 2008. As mentor for this project has been assigned Nathan Yergler.

The developing will focus on 3 key functionalities:

  • ability to search photos by tags
  • filter search results by license attributes
  • insert the image into the document along with attribution information

The first 2 steps were done(of course, this will not be the final version) in small demo that i attached to my application for GSoC 2008. The OpenOffice components for which the extension will be implemented are Writer, Impress and Calc.

The application will be written in Java, using NetBeans with its plugin OpenOffice NetBeans Integration.

A short introduction : I`m Mihai Husleag, 24 years old, student in Computer Science, at Alexandru Ioan Cuza University of Iasi, Romania. My previous experience as programmer is more related to the .NET framework. Another thing about me, if in the weekends i`m not reachable then its a high probability that you will find me here.

If you have any suggestions about this project(new functionalities, things you don’t like, etc) feel free to leave a comment.

No Comments »

License-oriented metadata validator and viewer: the development has just started

Hugo Dworak, May 26th, 2008

Creative Commons participates in Google Summer of Code™ and has accepted a proposal (see the abstract) of Hugo Dworak based on its description of a task to rewrite its now-defunct metadata validator. Asheesh Laroia has been assigned as the mentor of the project. The work began on May 26th, 2008 as per the project timeline. It is expected to be completed in twelve weeks. More details will be provided in the dedicated CC Wiki article and the progress will be weekly featured on this blog.

The project focuses on developing an on-line tool — free software written in Python — to validate digitally embedded Creative Commons licenses within files of different types. Files will be pasted directly to a form, identified by a URL, or uploaded by a user. The application will present the results in a human?readable fashion and notify the user if the means used to express the license terms are deprecated.

1 Comment »
Page 2 of 4«1234»