News

Indexing License Metadata in Tracker, Week 1

jakin, June 12th, 2007

Week 1 of Google Summer of Code is complete and already I’m seeing much progress. There’s a mess of formats to embed licenses into and a mess of ways to embed them. My first task has been straightening out where licenses are embedded in each format and how exactly to go about extracting them. Here’s where I’m at:

Format Form of Metadata Location of Metadata Extraction with Tracker Test content
MP3
  • XMP
  • Native id3 tags
  • For id3v24, the PRIV,XMP field
  • WCOP tag
Extracting MP3 tags has moved from an ID3 parser to handing off the work to GStreamer/MPlayer/Totem. As far as I can tell, this prevents me from extracting the XMP. XMP embedded with Exempi
PDF XMP metadata field Extend the current PDF extractor (which uses Poppler) to read the metadata field. However reading the metadata field isn’t wrapped in Poppler’s glib bindings, but I have written and submitted a patch. XMP embedded with Exempi
OGG
  • XMP
  • Native comment field
  • XMP comment field
  • LICENSE comment field
Extend the GStreamer extractor to check for the presence of an XMP comment field. GStreamer places this within the EXTENDED_COMMENTS tag (requires GStreamer 0.10.10). XMP embedded with vorbiscomment
JPEG XMP Exif XML Packet field Extend the Imagemagick extractor, using ‘convert file.jpg xmp:-’ to read XMP XMP embedded with Exempi
PNG XMP iTXt, XML:com:adobe:xmp field Extend the PNG extractor, adding a check for XML:com:adobe:xmp. (For backwards compatibility, the ability to read iTXt in libpng is disabled by default until version 1.3.) XMP embedded with Exempi
HTML RDFa <a rel=”license” href=”…”></a> Write a new HTML extractor, using libxml2, and scan for RDFa Various actual sites, including creativecommons.org
SVG RDF /svg/metadata/rdf I could specifically parse the XML, checking for the RDF schema used by Inkscape. Should I check for XMP also??? Inkscape
Any XML XMP Wherever valid Write a generic XML extractor (and/or extractor for each particular format), scanning with libxml2
OpenOffice.org (OASIS) OO.org CC License Add-In SoC Project is working on the spec OO.org Add-In
MS Office DocumentSummaryInformation Infile, CreativeCommons_LicenseURL property Extend existing msoffice extractor MSOffice Add-in

If this is all well and good, I’d like to help update the CC Wiki with updated embedding specifications.

As far as coding goes, I wrote the code for Tracker to check for and extract metadata from XMP sidecar files. XMP is parsed by Hubert’s XMP library. The timing of Adobe’s release of their XMP Toolkit and Hubert subsequently release of Exempi 1.99.x, have been an early boon to the project. The ‘license’ tag in the CC namespace is the only metadata extracted at the moment.

I’ve also been hacking the extractors of the above list of formats to determine the feasibility and processes of extracting license metadata from each.

Where I stand now is that feedback on the above would be much appreciated and if all is well I can get the XMP sidecar code I have pushed into Tracker’s Subversion repository soon.

Happy hacking, indeed.

6 Responses to “Indexing License Metadata in Tracker, Week 1”

  1. What is the correct way to do MP3 tag embedding these days? Last year I did GSoC and finished my project. Unfortunately, it was unable to be patched into Banshee because they were waiting for Gnome CVS to move to SVN. When the move did happen, Banshee moved to Gstreamer tag reading. As far as I can tell, Gstreamer doesn’t handle the tags we need for MP3 license tag parsing.

    My point: has anyone gotten involved with the Gstreamer project to fix this or have any more info on it? Until Gstreamer can do what we need, license handing wont be in Banshee =/

  2. Cassio Melo says:

    Great job Jason! Does MS Addin put CC metadata into documents? Cheers.

  3. Jason Kivlighn says:

    I just downloaded and installed the MS Add-in. Apparently the CC metadata is added to the Custom properties (the DocumentSummaryInformation section) of the file. Here’s a sample dump, using libgsf, of the CC-related properties:

    prop ‘CreativeCommons_Derivatives’
    = “Share Alike”
    prop ‘CreativeCommons_Licensed’
    = TRUE
    prop ‘CreativeCommons_CommercialUse’
    = “Yes”
    prop ‘CreativeCommons_LicenseURL’
    = “http://creativecommons.org/licenses/by-sa/2.5/”
    prop ‘CreativeCommons_Jurisdiction’
    = “”

  4. Jason Kivlighn says:

    I know that in GStreamer there are tags GST_TAG_LICENSE and GST_TAG_COPYRIGHT. For MP3′s GST_TAG_COPYRIGHT is taken from the TCOP tag. I see that GST_TAG_LICENSE isn’t used, though I take it that it should be the value of the WCOP tag. It seems trivial to make this update. I’ll look into GStreamer for more info and maybe go for a patch, unless somebody has more info.

  5. Jason Kivlighn says:

    GStreamer’s id3v2 tag parser doesn’t handle WZZZ tags (which includes the license URL tag, WCOP). Here’s a patch to fix that:

    http://bugzilla.gnome.org/show_bug.cgi?id=447000

    I’ll see how that goes.

  6. Jason Kivlighn says:

    I should also point out that for my latest progress, check out:

    http://wiki.creativecommons.org/Tracker_CC_Indexing