Indexing License Metadata in Tracker, Week 1
Jason Kivlighn, June 12th, 2007
Week 1 of Google Summer of Code is complete and already I’m seeing much progress. There’s a mess of formats to embed licenses into and a mess of ways to embed them. My first task has been straightening out where licenses are embedded in each format and how exactly to go about extracting them. Here’s where I’m at:
| Format | Form of Metadata | Location of Metadata | Extraction with Tracker | Test content |
| MP3 |
|
|
Extracting MP3 tags has moved from an ID3 parser to handing off the work to GStreamer/MPlayer/Totem. As far as I can tell, this prevents me from extracting the XMP. | XMP embedded with Exempi |
| XMP | metadata field | Extend the current PDF extractor (which uses Poppler) to read the metadata field. However reading the metadata field isn’t wrapped in Poppler’s glib bindings, but I have written and submitted a patch. | XMP embedded with Exempi | |
| OGG |
|
|
Extend the GStreamer extractor to check for the presence of an XMP comment field. GStreamer places this within the EXTENDED_COMMENTS tag (requires GStreamer 0.10.10). | XMP embedded with vorbiscomment |
| JPEG | XMP | Exif XML Packet field | Extend the Imagemagick extractor, using ‘convert file.jpg xmp:-’ to read XMP | XMP embedded with Exempi |
| PNG | XMP | iTXt, XML:com:adobe:xmp field | Extend the PNG extractor, adding a check for XML:com:adobe:xmp. (For backwards compatibility, the ability to read iTXt in libpng is disabled by default until version 1.3.) | XMP embedded with Exempi |
| HTML | RDFa | <a rel=”license” href=”…”></a> | Write a new HTML extractor, using libxml2, and scan for RDFa | Various actual sites, including creativecommons.org |
| SVG | RDF | /svg/metadata/rdf | I could specifically parse the XML, checking for the RDF schema used by Inkscape. Should I check for XMP also??? | Inkscape |
| Any XML | XMP | Wherever valid | Write a generic XML extractor (and/or extractor for each particular format), scanning with libxml2 | |
| OpenOffice.org (OASIS) | OO.org CC License Add-In SoC Project is working on the spec | OO.org Add-In | ||
| MS Office | DocumentSummaryInformation Infile, CreativeCommons_LicenseURL property | Extend existing msoffice extractor | MSOffice Add-in | |
If this is all well and good, I’d like to help update the CC Wiki with updated embedding specifications.
As far as coding goes, I wrote the code for Tracker to check for and extract metadata from XMP sidecar files. XMP is parsed by Hubert’s XMP library. The timing of Adobe’s release of their XMP Toolkit and Hubert subsequently release of Exempi 1.99.x, have been an early boon to the project. The ‘license’ tag in the CC namespace is the only metadata extracted at the moment.
I’ve also been hacking the extractors of the above list of formats to determine the feasibility and processes of extracting license metadata from each.
Where I stand now is that feedback on the above would be much appreciated and if all is well I can get the XMP sidecar code I have pushed into Tracker’s Subversion repository soon.
Happy hacking, indeed.

Luke Hoersten
June 12th, 2007 at 14:20 +0000What is the correct way to do MP3 tag embedding these days? Last year I did GSoC and finished my project. Unfortunately, it was unable to be patched into Banshee because they were waiting for Gnome CVS to move to SVN. When the move did happen, Banshee moved to Gstreamer tag reading. As far as I can tell, Gstreamer doesn’t handle the tags we need for MP3 license tag parsing.
My point: has anyone gotten involved with the Gstreamer project to fix this or have any more info on it? Until Gstreamer can do what we need, license handing wont be in Banshee =/
Cassio Melo
June 12th, 2007 at 17:30 +0000Great job Jason! Does MS Addin put CC metadata into documents? Cheers.
Jason Kivlighn
June 12th, 2007 at 18:48 +0000I just downloaded and installed the MS Add-in. Apparently the CC metadata is added to the Custom properties (the DocumentSummaryInformation section) of the file. Here’s a sample dump, using libgsf, of the CC-related properties:
prop ‘CreativeCommons_Derivatives’
= “Share Alike”
prop ‘CreativeCommons_Licensed’
= TRUE
prop ‘CreativeCommons_CommercialUse’
= “Yes”
prop ‘CreativeCommons_LicenseURL’
= “http://creativecommons.org/licenses/by-sa/2.5/”
prop ‘CreativeCommons_Jurisdiction’
= “”
Jason Kivlighn
June 12th, 2007 at 23:41 +0000I know that in GStreamer there are tags GST_TAG_LICENSE and GST_TAG_COPYRIGHT. For MP3’s GST_TAG_COPYRIGHT is taken from the TCOP tag. I see that GST_TAG_LICENSE isn’t used, though I take it that it should be the value of the WCOP tag. It seems trivial to make this update. I’ll look into GStreamer for more info and maybe go for a patch, unless somebody has more info.
Jason Kivlighn
June 13th, 2007 at 05:53 +0000GStreamer’s id3v2 tag parser doesn’t handle WZZZ tags (which includes the license URL tag, WCOP). Here’s a patch to fix that:
http://bugzilla.gnome.org/show_bug.cgi?id=447000
I’ll see how that goes.
Jason Kivlighn
June 14th, 2007 at 20:53 +0000I should also point out that for my latest progress, check out:
http://wiki.creativecommons.org/Tracker_CC_Indexing