NATHAN ASHBY-KUHLMAN > Blog Entry

If only reporters could add semantic markup

Adrian Holovaty urges news Web sites to follow Mark Pilgrim’s example and create an archive of citations. It would let readers get a list of every story from your news organization that quotes a certain person or organization, and readers could also see which people or organizations are quoted most or least.

Doing something like this properly depends on using HTML’s cite tag to delimit the name of every person or organization cited in every single story. Then a computer can generate a citation index automatically.

There’s just one big problem. This isn’t something online producers at any news organization have enough time to do. Perhaps I’m biased, since I’m speaking as an overworked online producer, but when some online news sites can’t even spare the time to fix paragraph breaks, how would they ever be able to add cite tags?

The only answer I see is giving control of semantic markup like this to people who have time to do it well on every story on a day-to-day basis — reporters. The reason bloggers who push the envelope like Mark Pilgrim have time to add these tags is that they can add them as they are writing. There’s no post-processing required, because the semantic markup is added directly into HTML that is destined directly for the Web.

At news organizations that repurpose print content, however, markup like this would have to be done as a post-processing step by online personnel, because print production systems don’t understand it. Where I work, once a typical front-page story has been written and edited, it is sent from a Harris database into a flat file for Quark. Once the print page is done, an online producer sends the finished text through a converter into a second flat file which is FTPed to a staging server that moves it into our Vignette-based Web content management system’s database. Then, once it’s approved, our public Web server can display it as HTML. That’s four file-format conversions from what a reporter wrote to the HTML readers’ Web browsers receive.

The point is that for news Web sites to adopt markup like the cite tag that adds meaning to their Web documents, the markup needs to be added by reporters and editors and flow harmlessly through the print edition production system before arriving on the Web. So until someday when newspapers’ dead-tree editions are of secondary importance to online editions and print staff are the ones doing the format conversions, I have a modest proposal: Publishers need to scrap print production systems and replace them with ones that handle this kind of markup, ignoring but preserving it until it arrives on the Web. In fact, this is exactly the kind of thing the XML-based News Industry Text Format is designed to do — encode news articles so they can be displayed in print or online.

But until all of the systems at work understand NITF, online producers like me are stuck fixing the subheads that the computer assumed were the main headlines, hunting down photos because the print systems do not automatically record which pictures go with a story, and categorizing stories into the right sections. We’re too busy to add cite tags. That’s something the reporters should do, if only they could. And I might just have time to write the script to build that citation index — if I didn’t have to do the half of my job that computers should be doing. If only they could.

POST A COMMENT on “If only reporters could add semantic markup”

Your name: (optional)

Your site: (optional)

Comment: (only <a>, <em> and <strong> allowed)

This page last modified on Wednesday, January 11, 2006 at 11:12 pm