I attended the first annual ACAP conference this week (ACAP stands for Automated Binary Access Protocol, their website is http://www.the-acap.org/). They are defining a standard mechanism that will allow publishers to communicate permissions information about content that can be automatically recognized and interpreted by any automated process that would like to use the content.
Currently the only mechanism available to publishers is the Robots Exclusion Protocol (REP, some people just call it robots.txt, the file placed in the root directory of a website that defines which files should or shouldn’t be visited by crawlers). There is also the Robots META tag which allows HTML authors to indicate to visiting robots if a specific document may be indexed, or used to harvest more links.
These mechanisms don’t give publishers very much control over how their content is used and they would like much finer grained control - both at the content level (not just files) and on the usage level (how the content can be used). A while back Yahoo announced a “robots-nocontent” tag that allows marking certain sections as unrelated to the main content of the page, and should therefor be ignored by search engines. It is an interesting addition to REP, but really doesn’t do much to take into account publishers concerns - since it is focused on control over indexing, not delivery (e.g. caching, summarization), and hasn’t been picked up by other search engines.
So what happened at the conference? Attendance was good, lots of people from various publishers - heavily European, but not exclusively. Google, Yahoo and MSN sent representatives, but it was clear that this was a conference led by publishers. There was a certain amount of “Google\Yahoo\MSN” bashing going on, mostly because they haven’t been willing to sign up as members of ACAP.
It is clear that the current lack of control that publishers have over their content on the Internet is a problem and it is preventing them from putting more of their content online (books, periodicals etc). It is also clear that the two groups (search-engines and publishers) have very different mind-sets about what is important. Publishers want to control all aspects of delivery and use of their content, since they make money from that content. They worry that aggregators give away the baby with the bathwater, and by creating their own summaries and caching pages search engines are lowering the value of the actual content. Search engines want access to as much content as possible, index it, summarize it and make things as easy to find as possible on the web. They make money from people accessing this aggregated information.
Then of course there is the whole issue of copyright - the publishers whole business model is based on the fact that owners of content have a proscribed set of rights that allow them to have control over their content, its delivery and usage. Search engines believe that copyright is a bit antiquated, and that by allowing people to find content, they are doing the publishers a service. I think that they are both right - but they come from such different worlds and mind-sets they have troubling understanding each other, or addressing each other concerns. They need to understand that they are in a symbiotic relationship and need to figure out how to work together.
It seems to me that they may be worried about “yesterday’s” technology - with the advent of tools for extracting parts of a page, ad-enabled syndication and mashups - the issues will become even thornier, and any solution will need to take those emerging capabilities into account too. ACAP could become very influential as forum for creators of these technologies and publishers to sit down, understand each other and work out solutions to the issues. If ACAP can make that happen - then they will have real influence over how content is delivered on the web.