Not Panicking: switching to Virtualenv for deployment

cwebber, July 29th, 2011

I’ve written about zc.buildout and virtualenv before and how to use them both simultaneously, which I find to be useful for development on my own machine. I really admire both of these tools; I especially think that buildout is really great for projects where you want developers to be able to get your package running quickly without having to understand how python packaging works. (I use buildout for this purpose for one of my own personal projects, MediaGoblin, and I think it’s served a wonderful purpose of getting new contributors up and going quickly.)

Anyway, in that previous blogpost about zc.buildout and virtualenv I erroneously suggested that virtualenv is best for multiple packages in development and zc.buildout is better for just one. I was rightly corrected that you can use the develop line of a buildout config file to specify multiple python packages. So this is what we’ve been doing for the last year roughly, running a meta-package with cc.engine checked out of git and the rest running out of python packages.

We’ve been doing packaging and releasing to our own egg basket for a while, and for the most part that has worked out, but our system administrator Nathan Kinkade pointed out that we don’t really need packages, it’s a bunch of extra steps to build, nobody outside of CC is using these packages, and it’s a lot easier to rollback a git repository in case of an emergency than it is a python package.

That lead me to reconsider the way we’re currently doing deployment and my growing feeling that maybe zc.buildout, while great for developing locally, really just isn’t a good option for deployment. Whenever I want to pull down new versions of packages, I would run buildout. But buildout likes to do something which makes this period very, very painful: if for whatever reason it can’t manage to install all packages, it tears down the entire environment. It removes ./bin/python, it removes all other scripts. I’ve found this to be highly stressful, especially because you never know if some package on PyPi is going to time out and then suddenly as punishment your environment no longer works, suddenly parts of creativecommons.org aren’t running, and you start to have a minor panic attack as you rush to get things up again. That’s not very great.

Anyway, I always stress out about this, which has lead me to adding coping mechanisms to our fabric deploy script:

Don't Panic! screenshot

This helps reduce my blood pressure somewhat, but anyway, we decided to move from buildout to virtualenv for deployment. Actually, there’s not much more to say; it only took a couple of hours to make the switch and there really wasn’t anything special to say about it. It just works and generally seems a lot simpler.

In short: buildout is pretty great. If you’re looking for an option to make it really, really easy for people who want to try out your project to get something working or start contributing, it’s the closest the python world has to an interface as simple as (or simpler than) `./configure && make`. But as for deployment… especially if you’d like to do code checkouts of your main packages, just go with virtualenv.

Comments Off

LRMI tech WG CFP

ml, July 18th, 2011

If you know your stuff, the you might be able to guess from the subject what this is about. Perhaps LR = Learning Resource is not obvious. More on the main CC blog…

Comments Off

CTO

ml, July 12th, 2011

Creative Commons is hiring its next Chief Technology Officer.

If you follow the links in the post linked above, you can find out a lot about the technology we’re looking for someone to be chief officer of. Why not submit a patch, bug report, or documentation edit with your resume? ;-)

Comments Off

RDFaCE: an RDFa-enhanced TinyMCE rich editor

ml, July 9th, 2011

For a long time — it feels like much longer than the RDFa Plugin for WordPress tech challenge has been on the wiki (28 months) — the idea that there should be such a thing has been around. I recall multiple Summer of Code applications proposing to tackle the problem. However, it is a really hard UI problem.

I’m really happy to see the announcement of RDFaCE, which does most of the hard work.

Without reading any documentation or watching their screencast (still haven’t watched it, no idea if it is any good!) I was able to add a cc:attributionName annotation specific to the image in their demo on my first try:

  • select the photographer name, insert cc:attributionName annotation with literal value already in the text. RDFaCE seems to already know the correct cc: namespace mapping.
  • select content around photo, set subject to photo URL
  • verify that triples produced are correct

Granted I more or less know what I’m doing. But, so do lots of other people. Contrary to some impressions, annotating stuff on the web with name-value pairs (“stuff” is the subject in the “triple”) is hardly brain-twisting.

I look forward to seeing RDFaCE bundled in a WordPress plugin with some awareness of the WordPress media manager, and using on this very blog.

TinyMCE is the free software rich text editor used in lots of projects in addition to WordPress, so this is a great step forward!

1 Comment »

Libre Graphics Magazine interview at Libre Graphics Meeting

cwebber, July 5th, 2011

As discussed previously, I represented Creative Commons at Libre Graphics Meeting 2011. Also attending were the people behind Libre Graphics Magazine. If you aren’t already familiar with Libre Graphics Magazine, it’s a cool project crossing free software and free cultural works. It isn’t as much a magazine of free software design tutorials (though to some extent it is that also) as it is a design magazine offering a critical perspective on and showcasing works made with such tools.

It’s valuable that we have a magazine that can show off the strengths of libre graphics tools when put in the hands of capable artists. But the people behind this magazine can probably describe this much better themselves. On that note, Danny Piccirillo recorded an interview with the main people behind Libre Graphics Magazine (Ana Carvalho, ginger “all lowercase” coons, Ricardo Lafuente). Amongst other things, the interview touched on why even the printing of the magazine itself is useful:

Ana Carvalho: In the professional world, one of the things that is usually pointed out to people that use FLOSS [Free/Libre/Open Source Software] for design is that it’s not good for printing.

ginger coons: We proved them wrong!

Ana Carvalho: Yes. And… you can see it’s possible. And you can do it with the same quality that you can do it with other kinds of tools. So that’s a very strong point.

ginger coons: That really is a constant refrain even within our own community. People always still talk about the printing problem. So… what printing problem?

Ricardo Lafuente:There’s a lot of edges to be ironed out, but on the other hand we do get compliments from printers on how good our PDFs are constructed. And that’s thanks to the quality of FLOSS software. There’s still this kind of misconception that FLOSS software is not up to par with professional standards… that’s not true, people still don’t believe that, but that’s their problem, and this is one of our ways to try and prove them wrong and actually try and get their interest toward alternate ways of making beautiful things.

There are plenty of other gems in the interview. Assuming we’ve piqued your interest, you can watch the whole thing below:

Libre Graphics Magazine on YouTube screenshot
View on YouTube or archive.org / CC BY-SA 3.0

And of course check out Libre Graphics Magazine itself. The magazine is licensed as CC BY-SA 3.0, and PDFs are available at no cost on the site, but it really is a magazine that is designed for and shines best in print, so consider purchasing a physical copy. Thanks to ginger coons and also Ana Carvalho and Ricardo Lafuente of Manufactura Independente for taking the time to do this interview and to Danny Piccirillo for the large time investment in both filming it and editing it down.

Comments Off

Notes on CC adoption metrics from The Power of Open

ml, June 27th, 2011

Last week Creative Commons released a book titled The Power of Open featuring dozens of case studies of successful uses of CC tools, beautifully laid out magazine-style. The book also has a couple pages (45&46) on metrics and a pretty graph of CC adoption over the years.

See the main CC blog for non-technical detail on the data behind this graph. This post serves as a technical companion — read below for how to reproduce.

Every day (modulo bugs and outages) we request license link and licensed work counts from Yahoo! Site Explorer and Flickr respectively (and sometimes elsewhere, but those two are currently pertinent to our conservative estimation). You can find the data and software (if you want to start independently gathering data) here.

After loading the data into MySQL, we delete some rows representing links that aren’t of interest or from non-Yahoo link: queries to avoid having to filter. In the future at least the former ought be moved to a separate table.


delete from simple where license_uri = ‘http://creativecommons.org/licenses/GPL/2.0/';
delete from simple where license_uri = ‘http://creativecommons.org/licenses/LGPL/2.1/';
delete from simple where license_uri = ‘http://creativecommons.org';
delete from simple where license_uri = ‘http://www.creativecommons.org';
delete from simple where license_uri = ‘http://creativecommons.org/licenses/publicdomain/1.0/';
delete from simple where license_uri = ‘http://creativecommons.org/licenses/by-nc-nd/2.0/deed-music';
delete from simple where license_uri = ‘http://creativecommons.org/licenses/by-nc-nd/2.0/br/creativecommons.org/licenses/sampling/1.0/br/';
delete from simple where license_uri = ‘http://creativecommons.org/licenses/zero/1.0/';
delete from simple where license_uri = ‘http://creativecommons.org/licenses/publicdomain';
delete from simple where search_engine != ‘Yahoo';

The following relatively simple query obtains average counts for each distinct license across December (approximating year-end). For the six main version 2.0 licenses, Flickr knows about more licensed works than Yahoo! Site Explorer does, so Flickr numbers are used: we know at least that many works for each of those licenses exist. greatest(yahoo.ct, coalesce(flickr.ct,0)) accomplishes this. coalesce is necessary for Flickr as we don’t have data most of the time, and don’t want to compare with NULL.


select * from ( select ym, sum(atleast) totalcount from (select yahoo.ym, yahoo.license_uri, greatest(yahoo.ct, coalesce(flickr.ct,0)) atleast from (select extract(year_month from timestamp) ym, license_uri,round(avg(count)) ct from simple group by license_uri,extract(year_month from timestamp)) yahoo left join (select extract(year_month from utc_time_stamp) ym, license_uri,round(avg(count)) ct from site_specific group by license_uri,extract(year_month from utc_time_stamp)) flickr on flickr.ym = yahoo.ym and flickr.license_uri = yahoo.license_uri) x group by ym ) x where ym regexp ’12$';

Results of above query:

Year End Total License Count
2003 943,292
2004 4,541,586
2005 15,822,408
2006 50,794,048
2007 137,564,807
2008 214,970,426
2009 336,771,549
2010 407,679,266

The more complicated query below also obtains the number of fully free/libre/open works and the proportion of works that are such:


select free.ym, freecount, totalcount, freecount/totalcount freeproportion from (select ym, sum(atleast) freecount from (select yahoo.ym, yahoo.license_uri, greatest(yahoo.ct, coalesce(flickr.ct,0)) atleast from (select extract(year_month from timestamp) ym, license_uri,round(avg(count)) ct from simple group by license_uri,extract(year_month from timestamp)) yahoo left join (select extract(year_month from utc_time_stamp) ym, license_uri,round(avg(count)) ct from site_specific group by license_uri,extract(year_month from utc_time_stamp)) flickr on flickr.ym = yahoo.ym and flickr.license_uri = yahoo.license_uri) x where license_uri regexp 'publicdomain' or license_uri regexp 'by/' or license_uri regexp 'by-sa/' group by ym) free, (select ym, sum(atleast) totalcount from (select yahoo.ym, yahoo.license_uri, greatest(yahoo.ct, coalesce(flickr.ct,0)) atleast from (select extract(year_month from timestamp) ym, license_uri,round(avg(count)) ct from simple group by license_uri,extract(year_month from timestamp)) yahoo left join (select extract(year_month from utc_time_stamp) ym, license_uri,round(avg(count)) ct from site_specific group by license_uri,extract(year_month from utc_time_stamp)) flickr on flickr.ym = yahoo.ym and flickr.license_uri = yahoo.license_uri) x group by ym) total where free.ym = total.ym and free.ym regexp '12$';

The above query obtains the following:

Year End Free License Count Total License Count Free License %
2003 208,939 943,292 22.15%
2004 1,011,650 4,541,586 22.28%
2005 4,369,938 15,822,408 27.62%
2006 12,284,600 50,794,048 24.19%
2007 40,020,147 137,564,807 29.09%
2008 68,459,952 214,970,426 31.85%
2009 136,938,501 336,771,549 40.66%
2010 160,064,676 407,679,266 39.26%

The pretty graph in the book reflects the total number of CC licensed works and the number of fully free/libre/open CC licensed works at the end of each year; the legend and text note that the proportion of the latter has roughly doubled over the history of CC.

If we look at the average for each month, not only December (remove the regular expression matching ’12’ at the end of the year month datestring), the data is noisier (and it appears data collection failed for two months in mid-2007, which perhaps should be interpolated):

The results of the above queries and some additional charts may be downloaded as a spreadsheet.

As noted previously, additional data is available for analysis. There’s also more that could be done with the license-link and site-specific data used above, e.g., analysis of particular license classes, version update, and jurisdiction ports. Also see the non-technical post.

Comments Off

FSF recommends CC0 for code snippets in documentation

cwebber, May 27th, 2011

This week Brett Smith of the Free Software Foundation has announced a new publication, How to choose a license for your own work. It is good to see the FSF making such a document; hopefully it can help reduce confusion and time spent for developers working on new projects and give guidance to help reduce license proliferation.

There are several interesting things in this document, amongst which is the recommendation of the Apache License 2.0 license for non-copyleft works (the announcement gives some insight into the thinking that lead to this). But from Creative Commons’ perspective the most interesting part of the article is certainly the recommendation of CC0 for code snippets in documentation. From the document:

Some documentation includes software source code. For instance, a manual for a programming language might include examples for readers to follow. You should both include these in the manual under the FDL’s terms, and release them under another license that’s appropriate for software. Doing so helps make it easy to use the code in other projects. We recommend that you dedicate small pieces of code to the public domain using CC0, and distribute larger pieces under the same license that the associated software project uses.

This announcement comes on the heels of our other recent announcement that CC0 is now recognized as acceptable for software and is compatible with the GPL, something we worked on carefully with the Free Software Foundation to clarify. It is good to see results coming out of this collaboration and we hope to see more collaboration with the FSF and more practical uses of CC0 for software in the future.

Comments Off

Using wget to login to Mediawiki

nkinkade, April 30th, 2011

For a couple years CC has been using the Pywikipediabot to do a few small operations on a password-protected, private installation of Mediawiki. It used to create a basic page, then ask people to add information to that page, and then a few days later it would email the contents of that page to a group of people.

As of a today we are no longer using Pywikipediabot to create a page, but only to mail the contents of a page. It occurred to me that Pywikipediabot was really overkill for such a small task. I decided to write a simple shell script using wget to accomplish this task. My initial thought was to use the Mediawiki API, but all the documents I found indicated that if one merely wanted the content of a page, to use the action query parameter to index.php, such as /SomeArticle?action=raw. It wasn’t even clear to me that there would be a way to accomplish what I wanted via the API without having to parse an XML response (there may be, I just didn’t readily find it).

So I decided to use wget to work with the normal user interface of Mediawiki, but I didn’t quickly find any good information on how to go about this, or what I found was outdated and no longer worked. I’m posting this here in case it could be useful to anyone else. Here is the basic idea:

#!/bin/bash

PAGE_TITLE="Some_page_title"

RCPT_TO=group@somesite.com
MAIL_FROM="'John Q. Public' <management@somesite.com>"
MAIL_SUBJECT="Contents of ${PAGE_TITLE}"

MW_LOGIN="Some Login"
MW_PASSWD="somepassword"

# Mediawiki uses a login token, and we must have it for this to work.
WP_LOGIN_TOKEN=$(wget -q -O - --save-cookies cookies.txt --keep-session-cookies 

http://www.somesite.com/Special:UserLogin

                                     | grep wpLoginToken | grep -o '[a-z0-9]{32}')

wget -q --load-cookies cookies.txt --save-cookies cookies.txt --keep-session-cookies 
        --post-data "wpName=${MW_LOGIN}&wpPassword=${MW_PASSWD}
&wpRemember=1&wpLoginattempt=Log%20in&wpLoginToken=${WP_LOGIN_TOKEN}" 
        "http://www.somesite.com/index.php?title=Special:UserLogin&action=submitlogin&type=login"

wget -q -O email_body.txt --load-cookies cookies.txt 
        "http://www.somesite.com/${PAGE_TITLE}?action=raw"

cat email_body.txt | mail -s "${MAIL_SUBJECT}" -a "From: ${MAIL_FROM}" ${RCPT_TO}
3 Comments »

The Future of DiscoverEd

nathan, April 11th, 2011

The DiscoverEd project was started in 2008 to explore how structured data could be applied to improving search for open educational resources (OER). Since then we have seen the ability of a working prototype to engage people’s imaginations, and have been fortunate to have our work supported by the Hewlett Foundation, Open Society Foundation, and the Bill & Melinda Gates Foundation, through their support of AgShare. Today, in an effort to focus our resources and expertise on areas that will have maximum impact, we’re discontinuing development of the project.

DiscoverEd was initially conceived as a Google Custom Search Engine (CSE), which would utilize labels provided by curators. When we ran into issues with applying labels at the resource level, instead of to broad URL patterns, we began to look for alternate implementations. Creative Commons chose to build on Apache Nutch, an open source search engine. We previously built on Nutch when developing the prototype CC Search in 2003-2004, which was later retired when Yahoo! and later Google added CC support to their search products.

Building on Apache Nutch, we added the ability to index and search on structured data encountered in web pages. This structured data, usually in the form of RDFa, could describe the license, subject area, education level, or language of a resource. In developing DiscoverEd, we recognized that structured data could be useful more broadly than just for OER, so while these are the fields we focused on as a starting point, DiscoverEd indexes all structured data it encounters, making it very flexible for emergent and exploratory vocabularies.

DiscoverEd succeeded in demonstrating how structured data and full text indexing can work together to provide a richer, more flexible search interface. By allowing users to perform an initial search using a familiar keyword search, and then refine by additional fields, users are able to iteratively refine their search. (See our paper from OpenEd 2010 for a fuller discussion of the search interface implemented, and how it addressees user needs.) The code for DiscoverEd is freely available under the Apache Software License, and can be found in its repository hosted by Gitorious. While Creative Commons is not currently developing the code, we may return to it in the future if an opportunity presents itself, or if there is a need to test additional ideas related to search and discovery.

Creative Commons is discontinuing development to focus our resources and expertise where we can have maximum impact. We do not have the resources needed to run DiscoverEd at web scale, but would love to see someone take that on. Through the development of DiscoverEd, Creative Commons has observed that there have been many attempts to describe educational resources and how they relate together in a complete, rigorous manner. These attempts have failed to gain the traction necessary for widespread adoption on the scale of Dublin Core, or CC REL. There is an opportunity for the community to build consensus around a set of properties for describing resources, attempting to balance utility (enough information to be useful) with succinctness (only describing that which is necessary, to avoid unnecessary impediments to adoption).

With the generous support of the Hewlett Foundation, Creative Commons will be working over the next year to identify key factors to success. You can follow the work in the “Describing OER” category on this blog, or on the Describing OER wiki page.

Update/Clarification (13 April 2011): Search for CC licensed (“open”) content is largely solved: Google has implemented a version at web scale, and CC REL provides a clear mechanism for marking and labeling. However, search and discovery for open educational resources is not a solved problem: many projects, including DiscoverEd, have tried different approaches to the issue, but none has successfully deployed a web scale OER search engine. Creative Commons has identified the lack of a vocabulary with widespread adoption as one issue impeding progress. While we plan to focus our efforts on that particular problem, we encourage others to continue working on the larger challenge of OER discovery.

Comments Off

A post with good advice for GSOC students

cwebber, March 25th, 2011

As Nathan already mentioned, we will be once again participating in Google Summer of Code this year. GSOC can be a lot of fun, but can also be very intimidating or frustrating for both students and mentors.

Luckily for everyone, some kind folks have written up a nice post called The DOs and DON’Ts of Google Summer of Code: Student Edition (and I hear there may be more such posts on the way?). We thought it was a good read, and maybe if you’re considering in participating in Google Summer of Code, reading (and following) that advice will help make Google Summer of Code both smoother and even more fun than it already is!

Addendum: Also, I think a lot of difficulty in projects GSOC, and in a lot of free software development, is caused by natural human shyness. Which makes sense. But in general, if you’re afraid or embarrassed to ask questions or participate, remember that your mentor is here to help you. So keep that in mind! (… and also the aforementioned do’s and don’ts post’s advice.)

1 Comment »


previous pagenext page