The Future of DiscoverEd

nathan, April 11th, 2011

The DiscoverEd project was started in 2008 to explore how structured data could be applied to improving search for open educational resources (OER). Since then we have seen the ability of a working prototype to engage people’s imaginations, and have been fortunate to have our work supported by the Hewlett Foundation, Open Society Foundation, and the Bill & Melinda Gates Foundation, through their support of AgShare. Today, in an effort to focus our resources and expertise on areas that will have maximum impact, we’re discontinuing development of the project.

DiscoverEd was initially conceived as a Google Custom Search Engine (CSE), which would utilize labels provided by curators. When we ran into issues with applying labels at the resource level, instead of to broad URL patterns, we began to look for alternate implementations. Creative Commons chose to build on Apache Nutch, an open source search engine. We previously built on Nutch when developing the prototype CC Search in 2003-2004, which was later retired when Yahoo! and later Google added CC support to their search products.

Building on Apache Nutch, we added the ability to index and search on structured data encountered in web pages. This structured data, usually in the form of RDFa, could describe the license, subject area, education level, or language of a resource. In developing DiscoverEd, we recognized that structured data could be useful more broadly than just for OER, so while these are the fields we focused on as a starting point, DiscoverEd indexes all structured data it encounters, making it very flexible for emergent and exploratory vocabularies.

DiscoverEd succeeded in demonstrating how structured data and full text indexing can work together to provide a richer, more flexible search interface. By allowing users to perform an initial search using a familiar keyword search, and then refine by additional fields, users are able to iteratively refine their search. (See our paper from OpenEd 2010 for a fuller discussion of the search interface implemented, and how it addressees user needs.) The code for DiscoverEd is freely available under the Apache Software License, and can be found in its repository hosted by Gitorious. While Creative Commons is not currently developing the code, we may return to it in the future if an opportunity presents itself, or if there is a need to test additional ideas related to search and discovery.

Creative Commons is discontinuing development to focus our resources and expertise where we can have maximum impact. We do not have the resources needed to run DiscoverEd at web scale, but would love to see someone take that on. Through the development of DiscoverEd, Creative Commons has observed that there have been many attempts to describe educational resources and how they relate together in a complete, rigorous manner. These attempts have failed to gain the traction necessary for widespread adoption on the scale of Dublin Core, or CC REL. There is an opportunity for the community to build consensus around a set of properties for describing resources, attempting to balance utility (enough information to be useful) with succinctness (only describing that which is necessary, to avoid unnecessary impediments to adoption).

With the generous support of the Hewlett Foundation, Creative Commons will be working over the next year to identify key factors to success. You can follow the work in the “Describing OER” category on this blog, or on the Describing OER wiki page.

Update/Clarification (13 April 2011): Search for CC licensed (“open”) content is largely solved: Google has implemented a version at web scale, and CC REL provides a clear mechanism for marking and labeling. However, search and discovery for open educational resources is not a solved problem: many projects, including DiscoverEd, have tried different approaches to the issue, but none has successfully deployed a web scale OER search engine. Creative Commons has identified the lack of a vocabulary with widespread adoption as one issue impeding progress. While we plan to focus our efforts on that particular problem, we encourage others to continue working on the larger challenge of OER discovery.

Comments Off

Supporting tools for decentralized metadata

nathan, March 16th, 2011

Over the past couple years Creative Commons has built DiscoverEd, a prototype search and discovery tool. We built DiscoverEd to explore how search for open educational resources (OER) could be improved through the use of decentralized metadata. But DiscoverEd was never an end point. DiscoverEd is one of what we hope will be many applications developed to leverage decentralized, structured data about resources on the web. (Our license deeds are another application that use metadata published with works, in that case to provide attribution for re-users.) Recently we’ve been thinking about tools that could be developed to complement DiscoverEd to create a rich and compelling ecosystem for decentralized metadata for educational resources.

The use of decentralized metadata to drive discovery allows creators and curators to publish information about works without relying on a central authority, and allows developers to utilize that data with seeking permission from a gate keeper. However, self publishing requires a certain degree of technical expertise from creators and curators. Two tools can help ease this burden and aid deployment of the necessary metadata. A Validator would help publishers and curators understand how their resources are ingested and processed by DiscoverEd (and other tools). A Curation Tool would allow users to identify resources — individually, as an ad hoc group, or as part of an institutional team — and label them with quality, review, or other metadata.

The Validator tool would allow users to enter a URL to be checked, and return details of what information DiscoverEd or other software could extract. The results would also provide links to examples and common problems when publishing metadata. For example, how to publish information about the education level and subject matter of a resource, or about what resources were remixed in order to create the new one. A self service tool would allow users to repeatedly check the state of their resources, so they can understand how changes made to their site impact the way others interact with it. A self service tool is essential to scale adoption beyond the level possible when each publisher requires hands on assistance.

The Validation tool would also be integrated with DiscoverEd. DiscoverEd utilizes decentralized metadata to improve its search index, and allow users to search by particular facets, such as subject, education level, or language. When it does not have metadata for one of the “core” fields (education level, subject, license, language in the default configuration), it displays a help icon to indicate that some piece of information is missing. After initial development is complete, the help icons will be linked to the validation tool so that users and publishers alike can get immediate feedback about what’s missing and what’s there.

The Curation Tool would be a general purpose piece of software which would allow users to identify works, and annotate additional information about them. We imagine that common annotations might be that they meet some quality review, align to a particular standard, or simply “like”. Just as social bookmarking tools like Delicious allow users to make a list of resources, the Curation Tool would allow users to create lists, identifying why a particular resource is in the list, and possibly adding additional metadata not provided by the publisher. For example, a user might make a list of resources which they have reviewed for quality, and identify which Common Core standard each conforms to. The tool would allow users to collaborate on lists, as well. All lists would be public, and published in a way that allows DiscoverEd to ingest the information collected. The Curation tool would be open source software, so users can download a copy and run it for their own school or professional society, if they so desire.

We think that the development of supporting tools can help advance the adoption of decentralized, structured data for educational resources. Are there simple ideas we’ve missed? Twists on these we should take into account? Leave your comments below.

Comments Off

Type Of: Educational (an idea)

nathan, December 3rd, 2010

I spent most of yesterday in a meeting discussing ways to make search better for open educational resources. Preparing my short presentation for the day, I thought again about one of the challenges of doing this at web scale: how do you determine what’s an educational resource? In DiscoverEd we rely on curators to tell us that a resource is educational, but that requires us to start with lists of resources from curators; it’d be nice to start following links and add things if they’re educational, move on if they aren’t. If you want to build OER search that operates at web scale, this is one of the important questions, because it influences what gets into the index, and what’s excluded1. Note that the question is not “what is an open educational resource”; the “open” part is handled by marking the resource with a CC license. With reasonable search filters you can start with the pool of CC licensed educational resources, and further restrict it to Attribution or Attribution-ShareAlike licensed works if that’s what you need.

Creative Commons licenses work in a decentralized manner by including a bit of RDFa with the license badge generated by the chooser. But no similar badge exists for OER or educational resources, at least partly because it’s hard to agree on what the one definition of OER is. But what if we just tried to say, “I’m publishing this, and I think it’s educational.” Maybe we can do that. After seeing the Xpert Project tweet about a microformat/RDFa/etc to improve discoverability, I decided to try my hand at a first draft.

<span about="" typeof="ed:LearningResource" xmlns:ed="">Educational</span>

This tag generates the triple:

<> <> <> .

Literally, “this web page is a Learning Resource”. Of course this could be written as a link, image, or even made invisible to the user.

There is one big question here: what should actually be? There are lots of efforts to create vocabularies out there, so we should clearly reuse a term from one of those efforts. If we reuse one that defines a hierarchy including things like Course, Lecture, etc, as refinements of Learning Resource, that may also provide interesting information for improving the search experience.

This markup won’t be visible in Google, but it will allow crawlers and software to start determining what people think is educational. And that seems like progress to me. Reasonable first step? Fundamentally flawed? Inexcusably lame? Feedback welcome.

1 I should note that the question is “How does someone online say that a resource is educational?” because you want a) to allow people to make the judgement about other resources online, and b) you care about who’s saying the resource is educational. Please pardon my reductionism.


Curation and Structured Data for educational resources

nathan, September 2nd, 2010

I spent most of last week in Alexandria, Virginia, talking about DiscoverEd (and hearing others talk about their work, including Duraspace and Handles) with the Learning Registry project. Learning Registry is a project of the US government that’s focusing on how to make federal learning resources more accessible to educators. We were invited to discuss DiscoverEd face to face after it became clear that some of the issues we’ve been addressing — searching metadata about OER, multiple parties making different assertions about the same resource — were going to be key for Learning Registry.

Learning Registry has been collecting ideas about the project using Idea Scale, and on Tuesday Aaron published a list of the most popular ideas to date. Reading the list, it’s clear that there are a few themes. First, people are interested in using structured data to help them search. Whether it’s a microformat to identify a resource as educational, extraction of metadata, adding information about specific properties (seat time, cost, etc), several ideas centered around making use of structured data to improve the search experience.

While that’s not too surprising, it was also interesting to note that the idea of distributed curation came up, whether people called it that or not. The microformat suggestion involves, at its most basic level, the ability to say “I think this is educational”. It’s not a far step from that to “I think this other resource (created by someone else) is educational.” People also suggested using sitemaps as the basis for listing educational resources. All of this makes me think that the idea of curating and collating resources is going to be an important part of how people find information in the future (whether they’re explicitly curating, or implicitly by posting a link to Twitter, it’s all about filtering information to the set you’re interested in).

It’s always nice to hear that you’re on the right track with a project; seeing the ideas suggested for Learning Registry reinforces my belief that we’re looking at the right areas for improving the OER search experience. (It’s also awesome that people want license information incorporated into search results — what a great idea!)

Comments Off

Named Graph support lands in DiscoverEd

nathan, September 1st, 2010

Earlier this week we updated the master branch on the DiscoverEd project. While this “release” contains many improvements, the biggest one by far is the use of named graphs for tracking provenance. The initial DiscoverEd prototype supported multiple curators making statements about the same resource. For example, the same photograph of the US moon landing might be useful for both a history and a physics course. Unfortunately the initial prototype didn’t retain information about who made each individual statement, which limited the ways in which you could refine your search.

With this week’s release we’re storing that information, and opening up more ways to explore the metadata stored about resources. We use Jena as our RDF store, and Named Graphs for Jena (NG4J) provide an elegant way to integrate this source information with DiscoverEd.

Also notable in this release is database independence. Any database backend supported by NG4J can be used with DiscoverEd, and the default configuration uses Derby for ease of experimentation and development.

Because we did not store metadata previously, we’re recrawling our sources now. Our instance of DiscoverEd will be updated with this code over the next week or two.

1 Comment »

DiscoverEd Code Sprint: Day 3

akozak, June 18th, 2010

Day 1 and Day 2 of the DiscoverEd code sprint turned out to be very productive, and the third and final day didn’t disappoint either. All teams were able to complete or make significant contributions to useful new features.

Our team that originally developed support for arbitrary metadata and a plugin architecture to pull in external data in Day 1 and 2 switched to working on integrating branches with the main DiscoverEd source tree. They spent some time fixing bugs in new code that connects the RDFa parser with the Jena store, so that it could be merged with other branches in the code repository, including their work on the metadata plugins, the master branch, and the provenance work previously begun by Creative Commons. After merging those branches, they also took some time to update their work from Day 1 and Day 2 to store the provenance of the metadata. Towards the end of the day they began working on support for running all the DiscoverEd tests with a script. After today they plan on doing some housekeeping work, namely creating documentation about new plugin support in the CC wiki.

The “flexible query interface” team continued to debug problems with DiscoverEd. They had identified the bugs encountered in Days 1 and 2, so they worked on transitioning away from an old metadata-writing API that they suspect doesn’t work with the current version of Nutch. It appears this issue may be related to the upgrade from Nutch 0.8 to 1.1. Going forward, they plan on migrating code to the new API, and testing if it resolves the problems.

The “user generated metadata” team decided to divide and conquer smaller tasks of their project. They mocked up the form interface and began work on building it into a JSP. They were also building tests for the back end of DiscoverEd, which basically test that once you add a tag, you can get the resource ID and the tag back through a search, and that it’s been added into the Jena store such that when the Nutch index is run again the tag will be added to the Lucene index. They worked through a lot of merge conflicts, which had stalled development.

Creative Commons thanks AgShare project funders (The Gates Foundation), MSU, vuDAT, MSU Global, and to the participants in the sprint for making all of the contributions to DiscoverEd over the past three days possible.

What makes DiscoverEd exciting is that, while most search engines use algorithmic analyses of resources alone for search, DiscoverEd can incorporate facts and semantic information provided by the publishers or curators, enabling more useful search. Structured data in standardized formats such as RDFa is a powerful way for otherwise unrelated projects and resource curators to cooperate and express facts about their resources in the same way so that third-party tools (like DiscoverEd) can use that data for other purposes (like search and discovery). We look forward to deploying this innovative tool for our AgShare partners to enable search and discovery of educational resources about agriculture and hope that it’s found useful in other contexts as well.

Comments Off

DiscoverEd Code Sprint: Day 2

akozak, June 17th, 2010

Day 2 of the DiscoverEd code sprint was largely a continuation of work that started on Day 1.

The team working on a plugin architecture for DiscoverEd finished their project, which was to build an extension point that enables plugins to pull arbitrary metadata from external sources. At the beginning of the day they had to finish some tests for the new support for arbitrary metadata added in Day 1, but by the end of the day were able to merge their new functionality back into the DiscoverEd repository. They were even able to build a small proof-of-concept plugin that pulled data from the API into the Jena store, which later makes its way into the Lucene index. Tomorrow they’ll start looking at developing support for custom vocabularies and ontologies (such as AGROVOC).

The flexible query syntax team were able to commit their changes to the DiscoverEd code that makes it so you can configure arbitrary metadata queries. For example, if you had been running a DiscoverEd instance to search a feed of OpenCourseWare resources that uses Dublin Core metadata in RDFa format to express information about those resources, and wanted to enable searches on metadata which weren’t supported in DiscoverEd, you would have had to write Java code to enable that. With this change you can now just create a configuration file mapping an arbitrary tag to that metadata, which will store that metadata along side the tag in the Lucene index. This allows Nutch (which searches the Lucene index) to search for those custom tags in the config file. This code was pushed to a branch while the team spent the rest of the day chasing down a bug where only a subset of metadata gets successfully added to the Lucene index in their working environment.

The “user-generated metadata” team was able to create a test which adds a new resource into the Jena store, adds a tag to that resource, and verifies the tag was added… and it passes! Their next step is to create a test for a similar process in the Lucene index, such that a tag added to a new resource in the Jena store gets successfully added to the index and one can search for that tag through Nutch. The first test passing was a prerequisite for starting work on the second test.

Work continues into the third and final day, so look for a wrap-up report from Day 3 soon. The various repository commits can be browsed on Gitorious.

1 Comment »

DiscoverEd Code Sprint: Day 1

akozak, June 16th, 2010

This week some of us from Creative Commons are in lovely East Lansing, Michigan Michigan State University for the DiscoverEd code sprint hosted by the folks at MSU vuDAT.

For those of you not familiar with DiscoverEd, it first began as a Creative Commons project to show how structured data can enable and improve search and discovery of educational resources on the web.

Search and discovery of education resources has been a hot topic within the OER (Open Educational Resources) community for some time. Given the decentralization of high-quality educational resources free to use, remix, and redistribute existing on the web, the question has been “How can learners and instructors find and discover these resources in useful ways?” CC has previously done significant work in the metadata domain, having developed a W3C-published vocabulary for publishing licensing metadata and being involved in early efforts to being semantic web concepts to scientific data. DiscoverEd not only represents a continuation of CC’s efforts to make data interoperability work, but insofar as it deals in metadata about open content, it’s an interesting synthesis of open, standardized data formats and open content.

Development for DiscoverEd has recently been supported by AgShare, an MSU-led project funded by the Gates Foundation:

The aim of the AgShare planning and pilot project is to create a scalable and sustainable collaboration of existing organizations for African publishing, localizing, and sharing of teaching and learning materials that fill critical resource gaps in African MSc agriculture curriculum and that can be modified for other downstream uses.

AgShare involves institutional collaboration for the modification and dissemination of very useful open educational content internationally. Creative Commons is supporting the project by developing, documenting, and supporting an instance of DiscoverEd educational search that will provide learners in Africa with a method to find or discover educational resources curated by participating organizations.

Which takes me to where we are today: CC and our AgShare partners at MSU put together a DiscoverEd code sprint to connect our AgShare developers with the community of programmers and downstream users who are interested in educational resource search and discovery using structured data to find common ground and areas for collaboration. Most participants have listed themselves on the sprint wiki-page.

After brief introductions, everyone spent some time instantiating the codebase, getting access to version control, sharing commonly used terms, and settling other coordination issues. That took up the first half of the day, and after lunch at a great Ethiopian restaurant, we returned to begin work on new code. Given the unfamiliarity many of the developers had with the codebase, the original plan for pair programming was abandoned to allow larger groups:

One group focused on developing code to allow users to add missing or additional metadata about resources. This would enable arbitrary tags to be injected into both the Jena triple-store and Lucene documents from the Nuch front-end. This functionality would extend the reach of Nutch’s search beyond what’s found through aggregation, to include metadata provided by users, or a community.

Another group focused on developing a plugin interface for storing and retrieving metadata from external and services and databases. This functionality could potentially supply metadata for resources without any metadata, or could supplement existing metadata with additional data from other services and databases. A splinter group of that original team took time to analyze the OpenCalais API to see if it could be integrated with the above DiscoverEd functionality to provide back useful metadata.

The last team worked on getting up to speed on the DiscoverEd architecture and creating instructions for getting it running in an Ubuntu virtual box environment on Windows, as well as test-driving new documentation for hacking DiscoverEd. This team also spent time mapping out the work needed to implement more flexible query filtering. This turned out to be an invaluable exercise as they discovered inefficiencies and unnecessary code in the DiscoverEd codebase (namely, in how DiscoverEd maps query strings for metadata to Lucene index data), and are in the process of planning for a new metadata query syntax.

Day 2 will likely be an extension of today’s work. Look for another post soon.

1 Comment »