code sprint

DiscoverEd Code Sprint: Day 3

akozak, June 18th, 2010

Day 1 and Day 2 of the DiscoverEd code sprint turned out to be very productive, and the third and final day didn’t disappoint either. All teams were able to complete or make significant contributions to useful new features.

Our team that originally developed support for arbitrary metadata and a plugin architecture to pull in external data in Day 1 and 2 switched to working on integrating branches with the main DiscoverEd source tree. They spent some time fixing bugs in new code that connects the RDFa parser with the Jena store, so that it could be merged with other branches in the code repository, including their work on the metadata plugins, the master branch, and the provenance work previously begun by Creative Commons. After merging those branches, they also took some time to update their work from Day 1 and Day 2 to store the provenance of the metadata. Towards the end of the day they began working on support for running all the DiscoverEd tests with a script. After today they plan on doing some housekeeping work, namely creating documentation about new plugin support in the CC wiki.

The “flexible query interface” team continued to debug problems with DiscoverEd. They had identified the bugs encountered in Days 1 and 2, so they worked on transitioning away from an old metadata-writing API that they suspect doesn’t work with the current version of Nutch. It appears this issue may be related to the upgrade from Nutch 0.8 to 1.1. Going forward, they plan on migrating code to the new API, and testing if it resolves the problems.

The “user generated metadata” team decided to divide and conquer smaller tasks of their project. They mocked up the form interface and began work on building it into a JSP. They were also building tests for the back end of DiscoverEd, which basically test that once you add a tag, you can get the resource ID and the tag back through a search, and that it’s been added into the Jena store such that when the Nutch index is run again the tag will be added to the Lucene index. They worked through a lot of merge conflicts, which had stalled development.

Creative Commons thanks AgShare project funders (The Gates Foundation), MSU, vuDAT, MSU Global, and to the participants in the sprint for making all of the contributions to DiscoverEd over the past three days possible.

What makes DiscoverEd exciting is that, while most search engines use algorithmic analyses of resources alone for search, DiscoverEd can incorporate facts and semantic information provided by the publishers or curators, enabling more useful search. Structured data in standardized formats such as RDFa is a powerful way for otherwise unrelated projects and resource curators to cooperate and express facts about their resources in the same way so that third-party tools (like DiscoverEd) can use that data for other purposes (like search and discovery). We look forward to deploying this innovative tool for our AgShare partners to enable search and discovery of educational resources about agriculture and hope that it’s found useful in other contexts as well.

Comments Off

DiscoverEd Code Sprint: Day 2

akozak, June 17th, 2010

Day 2 of the DiscoverEd code sprint was largely a continuation of work that started on Day 1.

The team working on a plugin architecture for DiscoverEd finished their project, which was to build an extension point that enables plugins to pull arbitrary metadata from external sources. At the beginning of the day they had to finish some tests for the new support for arbitrary metadata added in Day 1, but by the end of the day were able to merge their new functionality back into the DiscoverEd repository. They were even able to build a small proof-of-concept plugin that pulled data from the API into the Jena store, which later makes its way into the Lucene index. Tomorrow they’ll start looking at developing support for custom vocabularies and ontologies (such as AGROVOC).

The flexible query syntax team were able to commit their changes to the DiscoverEd code that makes it so you can configure arbitrary metadata queries. For example, if you had been running a DiscoverEd instance to search a feed of OpenCourseWare resources that uses Dublin Core metadata in RDFa format to express information about those resources, and wanted to enable searches on metadata which weren’t supported in DiscoverEd, you would have had to write Java code to enable that. With this change you can now just create a configuration file mapping an arbitrary tag to that metadata, which will store that metadata along side the tag in the Lucene index. This allows Nutch (which searches the Lucene index) to search for those custom tags in the config file. This code was pushed to a branch while the team spent the rest of the day chasing down a bug where only a subset of metadata gets successfully added to the Lucene index in their working environment.

The “user-generated metadata” team was able to create a test which adds a new resource into the Jena store, adds a tag to that resource, and verifies the tag was added… and it passes! Their next step is to create a test for a similar process in the Lucene index, such that a tag added to a new resource in the Jena store gets successfully added to the index and one can search for that tag through Nutch. The first test passing was a prerequisite for starting work on the second test.

Work continues into the third and final day, so look for a wrap-up report from Day 3 soon. The various repository commits can be browsed on Gitorious.

1 Comment »

DiscoverEd Code Sprint: Day 1

akozak, June 16th, 2010

This week some of us from Creative Commons are in lovely East Lansing, Michigan Michigan State University for the DiscoverEd code sprint hosted by the folks at MSU vuDAT.

For those of you not familiar with DiscoverEd, it first began as a Creative Commons project to show how structured data can enable and improve search and discovery of educational resources on the web.

Search and discovery of education resources has been a hot topic within the OER (Open Educational Resources) community for some time. Given the decentralization of high-quality educational resources free to use, remix, and redistribute existing on the web, the question has been “How can learners and instructors find and discover these resources in useful ways?” CC has previously done significant work in the metadata domain, having developed a W3C-published vocabulary for publishing licensing metadata and being involved in early efforts to being semantic web concepts to scientific data. DiscoverEd not only represents a continuation of CC’s efforts to make data interoperability work, but insofar as it deals in metadata about open content, it’s an interesting synthesis of open, standardized data formats and open content.

Development for DiscoverEd has recently been supported by AgShare, an MSU-led project funded by the Gates Foundation:

The aim of the AgShare planning and pilot project is to create a scalable and sustainable collaboration of existing organizations for African publishing, localizing, and sharing of teaching and learning materials that fill critical resource gaps in African MSc agriculture curriculum and that can be modified for other downstream uses.

AgShare involves institutional collaboration for the modification and dissemination of very useful open educational content internationally. Creative Commons is supporting the project by developing, documenting, and supporting an instance of DiscoverEd educational search that will provide learners in Africa with a method to find or discover educational resources curated by participating organizations.

Which takes me to where we are today: CC and our AgShare partners at MSU put together a DiscoverEd code sprint to connect our AgShare developers with the community of programmers and downstream users who are interested in educational resource search and discovery using structured data to find common ground and areas for collaboration. Most participants have listed themselves on the sprint wiki-page.

After brief introductions, everyone spent some time instantiating the codebase, getting access to version control, sharing commonly used terms, and settling other coordination issues. That took up the first half of the day, and after lunch at a great Ethiopian restaurant, we returned to begin work on new code. Given the unfamiliarity many of the developers had with the codebase, the original plan for pair programming was abandoned to allow larger groups:

One group focused on developing code to allow users to add missing or additional metadata about resources. This would enable arbitrary tags to be injected into both the Jena triple-store and Lucene documents from the Nuch front-end. This functionality would extend the reach of Nutch’s search beyond what’s found through aggregation, to include metadata provided by users, or a community.

Another group focused on developing a plugin interface for storing and retrieving metadata from external and services and databases. This functionality could potentially supply metadata for resources without any metadata, or could supplement existing metadata with additional data from other services and databases. A splinter group of that original team took time to analyze the OpenCalais API to see if it could be integrated with the above DiscoverEd functionality to provide back useful metadata.

The last team worked on getting up to speed on the DiscoverEd architecture and creating instructions for getting it running in an Ubuntu virtual box environment on Windows, as well as test-driving new documentation for hacking DiscoverEd. This team also spent time mapping out the work needed to implement more flexible query filtering. This turned out to be an invaluable exercise as they discovered inefficiencies and unnecessary code in the DiscoverEd codebase (namely, in how DiscoverEd maps query strings for metadata to Lucene index data), and are in the process of planning for a new metadata query syntax.

Day 2 will likely be an extension of today’s work. Look for another post soon.

1 Comment »