The DiscoverEd project was started in 2008 to explore how structured data could be applied to improving search for open educational resources (OER). Since then we have seen the ability of a working prototype to engage people’s imaginations, and have been fortunate to have our work supported by the Hewlett Foundation, Open Society Foundation, and the Bill & Melinda Gates Foundation, through their support of AgShare. Today, in an effort to focus our resources and expertise on areas that will have maximum impact, we’re discontinuing development of the project.
DiscoverEd was initially conceived as a Google Custom Search Engine (CSE), which would utilize labels provided by curators. When we ran into issues with applying labels at the resource level, instead of to broad URL patterns, we began to look for alternate implementations. Creative Commons chose to build on Apache Nutch, an open source search engine. We previously built on Nutch when developing the prototype CC Search in 2003-2004, which was later retired when Yahoo! and later Google added CC support to their search products.
Building on Apache Nutch, we added the ability to index and search on structured data encountered in web pages. This structured data, usually in the form of RDFa, could describe the license, subject area, education level, or language of a resource. In developing DiscoverEd, we recognized that structured data could be useful more broadly than just for OER, so while these are the fields we focused on as a starting point, DiscoverEd indexes all structured data it encounters, making it very flexible for emergent and exploratory vocabularies.
DiscoverEd succeeded in demonstrating how structured data and full text indexing can work together to provide a richer, more flexible search interface. By allowing users to perform an initial search using a familiar keyword search, and then refine by additional fields, users are able to iteratively refine their search. (See our paper from OpenEd 2010 for a fuller discussion of the search interface implemented, and how it addressees user needs.) The code for DiscoverEd is freely available under the Apache Software License, and can be found in its repository hosted by Gitorious. While Creative Commons is not currently developing the code, we may return to it in the future if an opportunity presents itself, or if there is a need to test additional ideas related to search and discovery.
Creative Commons is discontinuing development to focus our resources and expertise where we can have maximum impact. We do not have the resources needed to run DiscoverEd at web scale, but would love to see someone take that on. Through the development of DiscoverEd, Creative Commons has observed that there have been many attempts to describe educational resources and how they relate together in a complete, rigorous manner. These attempts have failed to gain the traction necessary for widespread adoption on the scale of Dublin Core, or CC REL. There is an opportunity for the community to build consensus around a set of properties for describing resources, attempting to balance utility (enough information to be useful) with succinctness (only describing that which is necessary, to avoid unnecessary impediments to adoption).
With the generous support of the Hewlett Foundation, Creative Commons will be working over the next year to identify key factors to success. You can follow the work in the “Describing OER” category on this blog, or on the Describing OER wiki page.
Update/Clarification (13 April 2011): Search for CC licensed (“open”) content is largely solved: Google has implemented a version at web scale, and CC REL provides a clear mechanism for marking and labeling. However, search and discovery for open educational resources is not a solved problem: many projects, including DiscoverEd, have tried different approaches to the issue, but none has successfully deployed a web scale OER search engine. Creative Commons has identified the lack of a vocabulary with widespread adoption as one issue impeding progress. While we plan to focus our efforts on that particular problem, we encourage others to continue working on the larger challenge of OER discovery.Comments Off
I spent most of yesterday in a meeting discussing ways to make search better for open educational resources. Preparing my short presentation for the day, I thought again about one of the challenges of doing this at web scale: how do you determine what’s an educational resource? In DiscoverEd we rely on curators to tell us that a resource is educational, but that requires us to start with lists of resources from curators; it’d be nice to start following links and add things if they’re educational, move on if they aren’t. If you want to build OER search that operates at web scale, this is one of the important questions, because it influences what gets into the index, and what’s excluded1. Note that the question is not “what is an open educational resource”; the “open” part is handled by marking the resource with a CC license. With reasonable search filters you can start with the pool of CC licensed educational resources, and further restrict it to Attribution or Attribution-ShareAlike licensed works if that’s what you need.
Creative Commons licenses work in a decentralized manner by including a bit of RDFa with the license badge generated by the chooser. But no similar badge exists for OER or educational resources, at least partly because it’s hard to agree on what the one definition of OER is. But what if we just tried to say, “I’m publishing this, and I think it’s educational.” Maybe we can do that. After seeing the Xpert Project tweet about a microformat/RDFa/etc to improve discoverability, I decided to try my hand at a first draft.
<span about="" typeof="ed:LearningResource" xmlns:ed="http://example.org/#">Educational</span>
This tag generates the triple:
<> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://example.org/#LearningResource> .
Literally, “this web page is a Learning Resource”. Of course this could be written as a link, image, or even made invisible to the user.
There is one big question here: what should
http://example.org/# actually be? There are lots of efforts to create vocabularies out there, so we should clearly reuse a term from one of those efforts. If we reuse one that defines a hierarchy including things like Course, Lecture, etc, as refinements of Learning Resource, that may also provide interesting information for improving the search experience.
This markup won’t be visible in Google, but it will allow crawlers and software to start determining what people think is educational. And that seems like progress to me. Reasonable first step? Fundamentally flawed? Inexcusably lame? Feedback welcome.
1 I should note that the question is “How does someone online say that a resource is educational?” because you want a) to allow people to make the judgement about other resources online, and b) you care about who’s saying the resource is educational. Please pardon my reductionism.9 Comments »
I spent most of last week in Alexandria, Virginia, talking about DiscoverEd (and hearing others talk about their work, including Duraspace and Handles) with the Learning Registry project. Learning Registry is a project of the US government that’s focusing on how to make federal learning resources more accessible to educators. We were invited to discuss DiscoverEd face to face after it became clear that some of the issues we’ve been addressing — searching metadata about OER, multiple parties making different assertions about the same resource — were going to be key for Learning Registry.
Learning Registry has been collecting ideas about the project using Idea Scale, and on Tuesday Aaron published a list of the most popular ideas to date. Reading the list, it’s clear that there are a few themes. First, people are interested in using structured data to help them search. Whether it’s a microformat to identify a resource as educational, extraction of metadata, adding information about specific properties (seat time, cost, etc), several ideas centered around making use of structured data to improve the search experience.
While that’s not too surprising, it was also interesting to note that the idea of distributed curation came up, whether people called it that or not. The microformat suggestion involves, at its most basic level, the ability to say “I think this is educational”. It’s not a far step from that to “I think this other resource (created by someone else) is educational.” People also suggested using sitemaps as the basis for listing educational resources. All of this makes me think that the idea of curating and collating resources is going to be an important part of how people find information in the future (whether they’re explicitly curating, or implicitly by posting a link to Twitter, it’s all about filtering information to the set you’re interested in).
It’s always nice to hear that you’re on the right track with a project; seeing the ideas suggested for Learning Registry reinforces my belief that we’re looking at the right areas for improving the OER search experience. (It’s also awesome that people want license information incorporated into search results — what a great idea!)Comments Off