nathan, December 3rd, 2010

I spent most of yesterday in a meeting discussing ways to make search better for open educational resources. Preparing my short presentation for the day, I thought again about one of the challenges of doing this at web scale: how do you determine what’s an educational resource? In DiscoverEd we rely on curators to tell us that a resource is educational, but that requires us to start with lists of resources from curators; it’d be nice to start following links and add things if they’re educational, move on if they aren’t. If you want to build OER search that operates at web scale, this is one of the important questions, because it influences what gets into the index, and what’s excluded1. Note that the question is not “what is an open educational resource”; the “open” part is handled by marking the resource with a CC license. With reasonable search filters you can start with the pool of CC licensed educational resources, and further restrict it to Attribution or Attribution-ShareAlike licensed works if that’s what you need.

Creative Commons licenses work in a decentralized manner by including a bit of RDFa with the license badge generated by the chooser. But no similar badge exists for OER or educational resources, at least partly because it’s hard to agree on what the one definition of OER is. But what if we just tried to say, “I’m publishing this, and I think it’s educational.” Maybe we can do that. After seeing the Xpert Project tweet about a microformat/RDFa/etc to improve discoverability, I decided to try my hand at a first draft.

<span about="" typeof="ed:LearningResource" xmlns:ed="">Educational</span>

This tag generates the triple:

<> <> <> .

Literally, “this web page is a Learning Resource”. Of course this could be written as a link, image, or even made invisible to the user.

There is one big question here: what should actually be? There are lots of efforts to create vocabularies out there, so we should clearly reuse a term from one of those efforts. If we reuse one that defines a hierarchy including things like Course, Lecture, etc, as refinements of Learning Resource, that may also provide interesting information for improving the search experience.

This markup won’t be visible in Google, but it will allow crawlers and software to start determining what people think is educational. And that seems like progress to me. Reasonable first step? Fundamentally flawed? Inexcusably lame? Feedback welcome.

1 I should note that the question is “How does someone online say that a resource is educational?” because you want a) to allow people to make the judgement about other resources online, and b) you care about who’s saying the resource is educational. Please pardon my reductionism.

  1. Pat says:

    I would agree that the vocabulary is part of the problem – and whether DC and DC Terms does enough is open to question.

    You could also extend the example above to include related items in the data – then things would be interesting for a crawler and the triples.

    I also think part of this is a cataloguing thing, as people need to know the system exists before they can use it, and some people may have resources that return to is going to take them ages to recatalogue.

    Perhaps a web crawler and a crowdsource would work better / allow for scale?

  2. Nathan Yergler says:

    One reason I like this sort of solution is that it allows third-parties to publish machine readable information about resources. So if I’m passionate about a subject area, I could publish a list of the best OER on it, and annotate them as “these resources over there are Learning Resources”. Ideally you’d include additional information like language, education level, and subject, I think. Is that what you’re referring to by related items, or do you mean “here are other, independent resources that are related to this one?”

    Concur that there is an uptake, but as you say, people need to know it exists; before that it actually has to exist ;).

    Can you say more about what you mean by crowdsourcing?

  3. I disliked the idea at first glance — it would be more useful for people to provide more specific info (most of the things in section 5.x of IEEE LOM, excluding the generic ones that aren’t education-specific) which a education-focused search engine or other tool could use to decide whether to include, but I suppose a super simple assertion that something is educational might make for quicker adoption, even of more specific assertions (because more adoption leads to more learning, tools, beneficient cycle) some via refining LearningResource, and would be useful for schema publication (eg typicalLearningTime could have domain LearningResource).

    So, bravo!

    You’re right about the big question, too. Though assuming no dominant and good vocabulary exists, reuse could be worse than creating a new vocabulary with a few key properties that the right parties can quickly implement.

    (Though commenting mostly from ignorace.)

  4. Nathan Yergler says:

    Providing more concrete properties like language, education level, etc is definitely preferable to just saying “educational”. The idea of creating a constrained vocabulary that doesn’t attempt to bite off too much (and which can later have equivalence assertions added) is a good one, which would probably mesh well with other community efforts.

  5. I’d like to back this discussion up a little. What exactly is the problem we’re trying to solve in search? Is the problem that the content that folks are trying to find in search is not categorized properly? If so, your “TypeOf:” mechanism should serve quite well. But, I argue that the main problem is that the content exists in formatted silos that don’t link to anything in the web. In essence, there is no way to determine relevancy or importance because this content stands alone.

    Google usually relies on some sort of relevancy ranking to bring the user to the most appropriate search results. Google’s problem, when it comes to fine-grained educational resources, is that the resources are usually packaged objects (PDFs, PPTs, DOCs, etc.) and not websites. They are part of the deep web and don’t live at the surface. Additionally, since Google uses PageRank (based on web links) to determine relevancy, it tends to miss these types of content because they aren’t linking to other websites. What Google needs, then, is a new way to determine relevancy or importance for this type of content. Or does it?

    Instead of forcing Google to change its behavior, let’s think of another way to solve the problem. One way to fix this is for content publishers to build their educational resources into the web, not outside of the web. When a resource is built into the web and references other content in the web, it becomes a part of the network, part of the relevancy system that already exists.

    Connexions ( is well-known for building its educational resources into the web using their own CNXML markup. I did a quick test (and this may not mean anything significant) where I found THE most popular learning module on Connexions: “independent variable”. It’s a very short module that essentially just points to the Wikipedia definition for independent variable. I then did a Google search for “independent variable” and guess what… it returned Wikipedia and Connexions as the #1 and #2 results. Funny.

    However, it seems that most other Connexions modules don’t link outside the domain. What would happen if they did? Stand alone educational resources will continue to stand alone unless they are built into the existing infrastructure. Maybe my argument is for more and better tools to create educational resources as websites. We need to make it easier to publish content on the web, not upload objects to the web.

  6. [...] do at Creative Commons, whether modeling license attributes, work registration, or domain-specific descriptions that add value to licensed works. Nice to see in depth academic backing for this [...]

  7. Nathan Yergler says:

    Dark-web resources are obviously a very real issue for web search (both generally as well as for hypothetical OER search) completely out of scope for this proposal. And linking amongst OER — both for saying “this is related/relevant”, as well as “I built on this to make my OER” — is behavior we want to encourage.

    But there’s a fundamental question that I think is at a lower level than either of these that I’m talking about: If I’m building an education search tool (whatever that may be), how do I decide to include resources from Connexions, and not include a MySpace page for a band called The Independent Variables, for example.

    If we look for parallels with Creative Commons licenses (which is perhaps dangerous), we find that the easy way to limit the set of resources under consideration for a CC license search is to look for the license mark (rel=”license”); publishers are responsible for marking their work with this additional piece of information. This proposal is an attempt to answer the question of “what resources should be included in education search” using a similar principle: publishers annotate resources (theirs or others) with the assertion that they are educational.

    It’s *possible* that with more linking, etc, that you could develop an algorithm that was able to make this determination with equal (or “better”) accuracy. But that requires creators to change the resources themselves (as opposed to the publishing template used by their CMS, for example), and still requires search engines to develop an algorithm to understand the relationship.

    I’m not certain this is actually sufficient, or the best way to spell it, just wanted to write it as a starting point for discussion. I agree that developing educational resources that are “born digital” or “born on the web” (and perhaps developing the tools to support that development, if needed) is an essential requirement for realizing the potential of OER.

  8. I agree with everything until I start thinking about the definition of a copyright license compared to the definition of an educational resource. I’m not sure those two things are really congruent. What are the descriptive characteristics of a license?

    1) certifier: who recognizes this as a copyright license (and where is this recognized)?
    2) authority: who is trusted to create and distribute this license (e.g., Creative Commons)?

    Are there other characteristics that are important here? Essentially, the way to make a congruent example here is to determine a body that will recognize content as being an educational resource and determine authorities that will create educational resources. Granted, CC became its own trusted authority in the copyright space and colleges and universities would be natural trusted authorities in the educational resource space. But we still need to identify a certifier or recognizing body.

    Perhaps this overcomplicates things, but if there aren’t any criteria for what counts as an educational resource, then there is no point in labeling it as such -> so I jump back up to my previous argument that we need to focus on changing content creation behavior.

  9. Nathan Yergler says:

    I’m actually not sure that I agree that the license assertion relies upon the certifier and authority. No one certifies that a photo I post on Flickr is really mine to CC license, definitely not Creative Commons. It’s up to the creator/publisher to mark their work. I think that knowing how to “spell” the assertion that “I think this work is educational” is the first step towards saying, “we [a school district, professional society, or even individual] have a rubric/tool for determining if a resource meets our educational standards, and this resource over here does.” I think there’s agreement that the latter is useful, and that this idea provides the basis for it, but that doesn’t necessarily mean that this idea is useful on its own :).