microformatique - a blog about microformats and “data at the edges” : from little things, big things grow

RDFa Prima 1.0

RDFa is a syntax for expressing this structured data in XHTML. The rendered, hypertext data of XHTML is reused by the RDFa markup, so that publishers don’t repeat themselves. The underlying abstract representation is RDF, which lets publishers build their own vocabulary, extend others, and evolve their vocabulary with maximal interoperability over time. The expressed structure is closely tied to the data, so that rendered data can be copied and pasted along with its relevant structure.

I’m far from an expert in RDF. To me, RDFa is an approach to getting many of the advantages of microformats, coupled with those of RDF.
Microformats and RDF mecha-transformer? Or bastard love child of both?

I really don’t know.

Technorati Tags: microformats, RDF, RDFa, semweb

Tim O’Reilly on the “small s semantic web”

I do hope Tim O’Reilly is aware of the “lowercase semantic web”, and the work that those in the microformats community have been doing. His recent post is an interesting survey of the kinds of informal semantic web projects which are between them implementing a kind of semantic web (what might be called a “more semantic web”).

Sadly, no reference to microformats, which I’d estimate now appear on tens of millions of pages, but references to projects like “freebase” (may I call it vapour, as for now, for all the excitment among the usual suspects, all it seems one can do is sig up for an invite, an the FAQ link doesn’t even work - c’mon people.)

But the overall approach I think is right. We have a working web now. There is a tonne of data on it - an unimaginable amount really. Let’s see what we can do with that, and how we can incrementally improve that.

Technorati Tags: semantic web, semweb, microformats

Semantics in HTML Part III - Towards a semantic web

Part I - Traditional HTML Semantics
Part II - Standardizing Vocabularies
Part III - Directions in HTML Semantics (this article)

Fundamental to any science or engineering discipline [is] a common vocabulary for expressing its concepts, and a language for relating them together

Brad Appleton

The World Wide Web is a simple thing really. It is, at the bottom, HTTP and HTML. Throw in some image formats (delivered via HTTP), CSS for styling (delivered via HTTP), Javascript for interaction design, and that’s more or less it.

The jewel in the crown, at least from the perspective of content developers, is HTML. HTML is used to markup, and so convey, the vast majority of the web’s information.

Today’s web is a web of HTML, and it is difficult to imagine a web much different any time soon. An almost unimaginable amount of time and money has been invested by individuals, organizations and companies in acquiring the skills and technology to publish and consume HTML. Even incremental changes to this landscape, such as the development of XHTML, a semantically, and syntactically near equivalent to HTML grounded in XML rather than SGML, and of CSS for developing web page presentation have taken in the order of a decade to get any kind of significant adoption by even professional web developers.

In this context, while not impossible (disruptive technologies do replace well entrenched perfectly serviceable ones - think of air traffic over rail or road for long distance travel, CDs over LPs, digital music formats and players of CDs, and so on), it is difficult to imagine within any reasonable time frame the web of HTML being replaced by “a better web” (this alone should make Adobe pause in considering the role Apollo will play). And this goes equally well for “The Semantic Web” an ambitious project to recast the web as a web of data, largely for machine consumption, using new markup formats centered on the Resource Description Framework (RDF), “a common framework for expressing … information so it can be exchanged between applications without loss of meaning.”

Whether or not such a project comes to pass, the web of HTML will be with us for a long time yet. Various projects are under way to further mature HTML - both within the W3C (in December 2006, the W3C announced a new HTML working group (chartered in March of 2007) to oversee the future development of HTML), as well as through the work of ad hoc groups like the WhatWG, whose work on HTML5 maps out an alternative, though not necessarily competing vision for the future of HTML.

What all those with more than a passing interest in the future of HTML (and indeed a great many professional developers today) devote not insignificant attention to is the question of semantics in HTML. In the first of these articles, I paid attention to the nature of the semantics in today’s HTML (HTML 4.01, XHTML 1+) - the kinds of “built in” semantics HTML provides, through elements and attributes. In the second, I looked at current processes and mechanisms for extending the semantics of today’s HTML - in particular, the one truly successful project for doing so to date, microformats, and foreshadowed the subject of this, the final installment in the series - future developments in the semantics of HTML.

Future Semantics

So far we’ve seen that there are three sources of semantics in HTML

The built in semantics of HTML itself - its elements and attributes
The ad hoc semantics of developers inventing their own vocabularies, which is typically “injected” into HTML largely using the class and id attributes of HTML
Semi structured approaches to developing richer semantics, in particular the microformats project.

It would make sense that future semantic developments of HTML would come from these or similar sources or approaches. In this article I want to focus on each in turn, and consider the benefits and shortcomings of each approach to developing richer semantics for HTML.

I’ll begin with the second approach, “bottom up” semantics, which I considered in the first article, and have paid no small amount of attention to with previous research. In short, despite the success of bottom up ontologies, what Thomas Vander Wal terms “folksonomies”, where common vocabularies for describing things emerge through ad hoc usage (well known examples are Flickr’s tags, and Del.icio.us), vocabularies for describing common data on the web simply haven’t emerged. This is not just an assertion, as my previous research indicates. It should in fact not come as a surprise, because class values, for example, are “hidden”, while tags at del.icio.us or flickr, by comparison are visible giving rise to a positive feedback loop - when I as a user see a tag for a particular kind of thing, I am more likely to use it myself for similar kinds of things. Over time, particular terms appear to “win”, and become the conventionally accepted tag for that kind of thing. With class and id values on the other hand, we simply don’t get the network effect to anoint particular words as the names of things.

In short, I’d argue that looking to emergent semantics for future vocabularies for HTML is a futile exercise - at least without sensible scaffolding to help the process. I’ll turn to just what that scaffolding might be shortly.

The primary source of semantics in HTML is of course that built into HTML itself - the elements and attributes of HTML. It would seem to be a reasonable argument then that the best source of developments in HTML semantics should come through the language itself. If observation leads us to the conclusion that certain constructs are particularly common on the web - whether they be structural ones like document sections, data constructs like addresses, or rhetorical constructs like irony, then it makes sense to include these very common constructs in HTML using the raw materials of HTML itself.

This is certainly one approach HTML5 takes to developing the semantics of that language. HTML5 proposes new elements such as section, nav and footer. HTML5 also proposes other mechanisms for extending the semantics of HTML, in particular by mechanism for anointing particular values for the class attribute. I’ll turn to that in the final section of this article.

It’s not clear at this stage whether this is a direction the new HTML working group at the W3C will take in developing future versions of HTML. It may be particularly challenging in the context of backwards compatibility. But certainly the XHTML2 project took this approach as one mechanism for extending HTML’s semantics (however, it may well be argued that in fact XHTML2 is no more HTML in any meaningful sense than say Docbook is). Similarly to HTML5, XHTML2 introduces new elements, such as section, and new attributes, one in particular the role attribute.

Interestingly, XHTML effectively introduces no new elements or attributes to the language (with the exception of Ruby in XHTML 1.1). We’ll return to why that might be in a moment.

Given that two major projects to overhaul HTML, both within and without the W3C have embraced the mechanism of adding elements and attributes to the language in order to extend HTML’s semantics, the initial conclusion one might draw is that this would be the best possible model for extending the semantics of HTML. Afterall, some very smart hardworking people with a lot of theoretical and practical experience in markup languages, parsers, browsers, and so on have come to this conclusion.

Respectfully, I’d argue that it is a really bad mechanism for doing so.

It both goes too far, and not nearly far enough, both breaking hundreds of millions of installed browsers, many of which will simply never be upgraded, while actually not solving the problem at all. Because the problem is not that the semantics of HTML are impoverished and need enriching (that is certainly a significant problem), the problem is that there is no mechanism for enriching the semantics of HTML without redefining the language, with all the attendant problems of backwards compatibility of sites, and software. If we are going to go to the trouble of making such high impact changes (and I’m not sure it’s necessary), then by not fixing this problem, we simply ensure that it will happen in perpetuity, each time we feel that the semantics of HTML is insufficiently rich. Which of course it will always be.

Whatever choices even the wisest, most diligent custodians of HTML make in terms of adding elements and attributes to the language, there will always be the need for more. The proliferation of XML based languages, or the very broad use of class and id values as evidenced by my research of late 2005, demonstrates that there is far too wide a possible set of vocabularies for a single language to embody using the built in mechanisms of elements and attributes alone. What HTML needs is a mechanism for extending the semantics of the language without changing the language itself. In the last part of this article, I want to consider what these mechanisms may be.

Language is a virus

HTML is a language, but not quite in the sense that English or French or Swahili are. Languages whose primary purposes is to communicate with machines - which HTML I’d argue is - require stability, to the point of rigidity, unambiguity (because software is very bad at deducing meaning from language), and constrained semantics. Programming languages, like C, or Fortran, tend to have very long half lives, their syntax and semantics changing very slowly, if at all, and almost always with a strong emphasis on backwards compatibility. HTML, though a language for communicating primarily with machines, is a curious hybrid - because the purpose of this human to machine communication is, at least in part, for those machines then to communicate to other humans. As a consequence, natural language semantics are embedded far deeper into HTML than they typically are into what was once termed 3GLs (or third generation programming languages). The problem is that natural language semantics is far more fluid (terms emerge rapidly, over the period of years, or even shorter), and far broader than the semantics required in order to communicate directly with hardware.

This presents the particular problem I outlined a moment ago, of the project to enrich HTML by innovations within the language being both too broad and too narrow. There’s little doubt that the further development of the semantic capacity of HTML is fundamentally important, but the real challenge is how to do it.

Perhaps the use of the term “semantic capacity” in the previous paragraph, as well of course my significant investment in microformats, and the previous articles in this series all point in the direction of how I would suggest projects to extend the semantics of HTML should go.

Just as CSS separated the presentation of a web page from its structure, content or semantics (various people will argue which of those three is the more accurate), HTML needs a mechanism to separate the semantics of a document from its structure. In fact, it largely already has several such mechanisms, which is not to say that no innovation is required, but rather, more attention needs to be paid to these existing, widely used, and more than a little successful mechanisms.

Before I continue, I need to make clear that I don’t think this is an argument in favor of RDF, inline RDF, or a non HTML mechanism. In this sense, it’s different to CSS, where a new language was developed to separate out the appearance of a page. Rather, the argument is for making HTML essentially semantically neutral, much like div and span elements (while not simply abandoning the existing built in semantics of the language), and providing a mechanism within HTML to enable it to be “semanticked” (sorry) in the same way that it can be styled.

Which all sounds nice in principle, but what about in practice? How might that work? I’d argue HTML needs both a mechanism that is part of the language, and processes to help guide emergent vocabularies, in order to both enable and develop richer semantics.

As we’ve seen, both in this article, and in the previous in the series, HTML itself (through the class, id, rel, rev attributes in particular), HTML5 (the class attribute, with a less anarchic approach to allowing values than HTML, where within syntactic constraints, anything goes) and HTML2 (through the class and role attributes) all provide mechanisms to extend the semantics of the language.

It remains an interesting open question as to whether the class, id, rel, rev and possibly other existing attributes of HTML suffice to enable the adequate extension of HTML, without the need for further innovation within the language, and hope that the new W3 HTML working group pay attention to that issue (and that the WhatWG might do so as well). At present, one very common use case that has been discovered by those working with microformats is the need for a mechanism for marking up content that is both unambiguous and standardized (for the benefit of machine communication) and also “human friendly”. A good, but far from unique example is dates. ISO8601, as adopted as the standard format for dates on the web by the W3C, is an unambiguous but far from humanly friendly date format. Humans write dates in all manner of (at times even) ambiguous ways - for example it is unclear whether 5.12.2007 is May 12th or December 5th. The microformats project has developed the abbr design pattern for marking up such data. One problem is that it arguably stretches the semantics of the abbr element considerably. It may be that an innovation within HTML (perhaps as “simple” as to explicitly enable the use of the abbr element for precisely this kind of purpose), or the development of a new attribute which may be used with some or all HTML elements, or even a whole new HTML element to solve this markup problem is required.

But in addition to mechanisms, the failure of bottom up evolution to develop meaningful, widespread vocabularies for even very common markup constructs such as page headers, or navigation, indicates that processes are required to enable the development of the actual semantic content that these mechanisms of HTML will enable. In the second article in this series, I looked in considerable detail at the microformats project, in part because I have quite a bit of interest in it myself, but in particular because it is the project which has been by far the most successful innovator of semantics for the web. In the order of tens of millions of pages are now published on the web using one or more microformats. The lessons I drew from that article is that microformats embrace a number of principles and practices which seem to have underpinned their success.

Microformats: solve a specific problem; start as simple as possible; design for humans first, machines second; reuse building blocks from widely adopted standards; modularity / embeddability; enable and encourage decentralized development, content, services

An important question which remains is, is the microformats project alone sufficient to provide all the required semantics for the web? Partly this is a question of whether the processes scale. But it is also a sociological question as to whether there is “one true” process which will give rise to all possible outcomes within a space in the most efficient manner. I also argued in the previous article that microformats to date have concerned themselves with only two of at least four distinct categories of semantics. Of these four categories I identified by analyzing HTML in the first two articles, structural, content/data, rhetorical, and relational, microformats concern themselves with content/data and relational semantics. This may simply be accidental, or reflect the interests of the people who initially worked on microformats, as well as those who were then attracted to the project.

All of these questions lead me to believe that the microformats project alone will not provide a platform, organization, or specific process for developing all the required semantics of the HTML web. Nor indeed would those associated with that project even argue that this is a goal or ambition of the project. One of the successful features of the project is its focus, as clearly stated “microformats are not … a panacea for all taxonomies, ontologies, and other such abstractions“.

But as a successful project focussed on extending the semantics of HTML, it serves as an excellent model for the kind of process which can be successful.

So what are the key lessons any project, whether internal to an organization to help standardize internally used semantics, all the way through loose affiliations of interested people within a domain, to much more structured organizations like WhatWG, and all the way to the gold standard, the W3C itself, learn from the success (and teething pains) of the microformats project, when it comes to the process of developing new or richer semantics for the web?

It seems to me that three aspects of the microformats process are central.

they solve focussed, existing, real word problems, not theoretical ones, by building on existing work, whether on the web, or in other related fields
They are open, and take on board the input of anyone who wishes to get involved in a specific format, with the minimum of fuss - it’s as simple as joining a mailing list.
They are iterative - and so help enable emergent consensus, not unlike the IETF’s “rough consensus and running code” model.

Where else?

So if the microformats project is unlikely to tackle in particular structural and rhetorical semantics, how else might the semantics which fall into these categories be developed and enhanced? In some ways, these two areas of semantics fall at two ends of a spectrum. Rhetoric has a very long scholarly and philosophical tradition, while, outside the reasonably narrow constraints of traditional publishing (from which ironically HTML has taken very little by way of its technical vocabulary), the conventions of web design are new, and emerging, and have much less by way of consensus when it comes to naming constructs. As such, in some ways the project of creating a vocabulary of common rhetorical conventions should pose no great challenge, though having it adopted may prove much more of a challenge. A far greater challenge, and one for which there would be much more immediate benefit, is in developing a vocabulary to describe even the very common structural and user interface features of web pages and sites.

This is in fact something which I have been pondering, and writing on for some time. I also know through conversations with many of our peers that it is a challenge that a great many of them (you) think it would be important to solve. For some time, I have been of the strong opinion that design patterns, and pattern languages more generally, as commonly applied to architecture, object oriented analysis and design, and other areas of computer science provide a framework for developing what Brad Appleton argues is “[f]undamental to any science or engineering discipline … a common vocabulary for expressing its concepts, and a language for relating them together”. To that end I started a project almost 18 months ago to develop a pattern language approach to solving the problem of richer semantics for HTML when it comes to web page and site structure and architecture. My involvement with microformats has meant that I’ve learned a lot about how such a project might work, and I hope to reanimate the project (at least the public aspects of it), beginning with a presentation on my ideas to the Information Architecture Summit, who have graciously extended an invitation to speak on this subject.

An earlier article of mine goes into this whole area in considerable detail and so rather than repeat it here, I’ll simply link to it. If these ideas interest you, I hope you might like to help work on that project, or if you’ve not already done so, investigate microformats.

I began this long article with the quote from Brad Appleton, and finish with it, because to me it captures the whole tangled issue of semantics on the web, that I’ve been trying to unravel a little with these articles, succinctly, and clearly. Semantics is about language. As a maturing profession, to move forward, we who develop [for] the web quite simply need this “common vocabulary for expressing its concepts, and a language for relating them together”.

Technorati Tags: semantics, patterns, microformats, HTML, CSS, RDF, Semantic Web, semweb

RDF versus microformats

A discussion which doesn’t seem to want to go away is one (probably mis)characterized as RDF versus microformats (or more broadly the “uppercase” Semantic Web versus the “lowercase” semantic web - essentially Tim Berners-lee’s ambitious project to reconstruct the web versus today’s web with richer semantics).

I must say that I am largely convinced, as much through experience as simply rhetoric, that the path of least resistance typically characterizes technology adoption, which is a fundamental motivator for the microformats project. However, it is also true that disruptive technologies, which require considerable investment in new skills for their adopters do appear and obsolete existing technologies. This has happened on the web with CSS, which despite the considerable cost of adoption, has, among professional developers at the very least, essentially entirely replaced the use of HTML for marking up web page appearance.

Will this happen with “The Semantic Web”? I suspect at the very least it will be a very long haul and big ask, based on the experience of trying to get even professional web developers to adopt slightly different HTML practices to create valid HTML and accessible sites. The Semantic Web project is particularly dependent on the network effect of machine readable data, which means it is very dependent on widespread adoption before it is of any real value at all.

The issue has recently been vigorously discussed by Ian Davis with comments by Tantek Çelik, and a and a thoughtful detailed rejoinder by Ian.

Technorati Tags: microformats, RDF, CSS

Got Wine?

Who’d have thought those oenophiles were also some of the folks most interested in standardized data formats, and indeed microformats. Well, it comes as little surprise, with Dab Cederholm and Dan Benjamin’s microformatted outCork’d as a leading Wine 2.0 (yeah, there’s a group that calls themselves that) site.

I’ve just stumbled upon a wine formats wiki (”blending wine with the semantic web”), for discussing and developing fomrats for marking up wine related information online, and they seem to very sensible be reusing microfomats where appropriate. Good work.

Technorati Tags: microformats, wine

A world sans metadata

Via the Touchstone blog, I came across this article by R. Todd Stephens (Ph. D. no less) Life Without Metadata. Worth a look. I wonder whether all of what the author talks about actually constitutes data, or metadata (are the labels on a soup can data or metadata?).

I also note that a search at the site, DM Review, returns no hits for “microformats”.

Anyone at CM Review who finds a link in from this blog may like to contact me, and I can fix that for you

john

What happened to the purdy pages?

I’ve just switched over to the Bartelme theme by Scott Wallick. It’s full of microformats, like hAtom, and much lighter weight. I like th IA a little better too. I’ll spruce it up a little if I get a moment, and need to add the technorati blog reactions widget I added to the old theme.

john

Upcoming microformats presentations

Despite not getting to SxSW this year (in fact, because of other speaking engagements), I’ve got a number of presentations coming up over the next 6 weeks.

First up, at IA Summit in Las Vegas, from 23-26 March., I’ll be doing a one day workshop with Thomas Vander Wal (be great top catch up with him again), Karen Loasby and Margaret Hanley on http://www.iasummit.org/2007/preconferencesession/designing_with_structured_data.html”>Designing with Structured Data, my part of the day is a hands on microformats session, then a conference session WebPatterns: design patterns in web site architecture and User Interaction (yes, I am still working away on that project).

Then, a couple of weeks later, at Web 2.0 Expo, I’ll be talking about microformats. There’s also a barcamp style conference running in parallel, which I am sure I’ll get involved with.

Technorati Tags: microformats, web2expo, iasummit, webpatterns, patterns

Yahoo Tech Developer job

Yahoo! has definitely been one of the strongest adopters of microformats, across a range of their sites. One of the very earliest was Yahoo! Tech, who use hCard and hReview extensively.

Yahoo! Tech is looking for an experienced front end developer, with microformats experience.

Yahoo! Tech offers consumers a rich environment for obtaining product data, reviews, and other information about tech gadgets such as computers, cell phones, software, games, and home audio and video equipment. If you’re a skilled backend engineer, you can be a part of this growing team.

Tech engineers work in a team dedicated to addressing functionality, performance, security, and application stability, in a highly-integrated environment that uses web services to consolidate data from numerous other Yahoo! teams. Knowledge and understanding of web services (especially the REST model), XML, and the MySQL database, along with a firm understanding of good software design and development principles, are critical factors for success.

Responsibilities include the following:

Work as part of a team to identify and resolve software and networking issues on the tech.yahoo.com site.
Translate product requirements into functional specifications and build software that meets those requirements.
Work with and coordinate activities with Tech as well as other Yahoo! teams that provide data or services to Yahoo! Tech.
Contribute to the continual improvement of our development processes
Maintain up-to-date documentation on processes and code

Minimal job qualifications:

Demonstrated software design and development experience
Development skills in PHP, CSS, HTML, Microformats, Accessibility, JavaScript
Strong knowledge of XML/XSL handling
Outstanding communication and interpersonal skills

For more information or to visit http://careers.yahoo.com

Technorati Tags: microformats, jobs, yahoo

Sucks or Rocks

Sucks or Rocks is a fun site I came across the other day. It takes keywords, then uses Yahoo’s search API to find references to this word - and then tries to determine based on context whether its a positive or negative reference.

Sounds a lot like the kind of thing that would be trivial to do with VoteLinks.

Technorati Tags: microformats, votelinks

microformatique - a blog about microformats and “data at the edges”

RDFa Prima 1.0

Tim O’Reilly on the “small s semantic web”

Semantics in HTML Part III - Towards a semantic web

Future Semantics

Language is a virus

Where else?

RDF versus microformats

Got Wine?

A world sans metadata

What happened to the purdy pages?

Upcoming microformats presentations

Yahoo Tech Developer job

Sucks or Rocks

Pages

Categories

Books

Events

People

sites

RSS Feeds

Meta

Blog Search