Article URLs week: Day 2

JULY 29, 2003, 9:23 am

To continue Article URLs week, as promised yesterday I’ve found some sites serving up truly awful URLs, and a few others using truly respectable ones.

Here are three sites that fail miserably:

CranstonHerald.com: F
http://news.mywebpal.com/news_tool_v2.cfm?show=localnews &pnpID=491&NewsID=474896&CategoryID=10327&on=0
Not only does this URL contain an absurd amount of query string garbage, it isn’t even hosted on the newspaper’s own domain.
derbydailyrep.com: F
http://www.derbydailyrep.com/display/inn_news/news2.txt
This article will be overwritten in 24 hours with the next day’s news2.txt. Removing articles from free circulation and into paid archives after some period is one thing, but posting different articles at old URLs is a recipe for totally confusing Web readers.
NorthJersey.com: F
http://www.northjersey.com/page.php?qstr= eXJpcnk3ZjczN2Y3dnFlZUVFeXk3JmZnYmVsN2Y3dnFlZUVFeX k2NDA3OTA4JnlyaXJ5N2Y3MTdmN3ZxZWVFRXl5Mg==
This one is a little long, don’t you think? With 62 apparent possible values (numbers plus upper- and lower-case letters) for each of 90 digits, this URL could uniquely specify one of 10¹⁶¹ documents, which is more than the number of particles in the universe.

And here are three sites worthy of emulation:

Enquirer.com: A-
http://www.enquirer.com/editions/2003/07/29/ loc_wwwloc1lastcall29.html
This URL does contain some superfluous parts and abbreviates its section as loc rather than putting the article in a local news folder. However, it formats dates beautifully and — something all sites should do — is hackable to that day’s front page (but not higher).
freep.com: A
http://www.freep.com/sports/lions/ford29_20030729.htm
This URL invites hacking to the parent Lions and sports sections. The only redundant characters are the repeated day of the month. The only thing preventing this from being an A+ is that the URL does violate the principle of hierarchy by putting the date after the article’s slug. Since articles appear within days, the date should go farther to the left.
HonoluluAdvertiser.com: A
http://the.honoluluadvertiser.com/article/ 2003/Jul/29/bz/bz02a.html
Although I would prefer a slug rather than the 02a identifier and the word “business” rather than the abbreviation bz, this URL is wonderfully hackable, to the parent business section and to the parent date (which lists articles in all sections published that day).

We’ll continue looking at more news sites’ article URLs Wednesday.

Comment by Jirka (ji_bo BLA BLA yahoo.com), posted July 29, 2003, 4:46 pm

Good work, Nathan. I like your last three posts (in fact, I planned to start rating sites' URLs myself, but I still don't have my weblog).

The question is: are you "eating your own dog food"? :-)

Let's take the URL of your last post: http://www.ashbykuhlman.net/blog/2003/07/28/0847. Immediately, my understanding was that (1) you have your own site http://www.ashbykuhlman.net with weblog http://www.ashbykuhlman.net/blog/ being just a part of it, (2) "0847" is the number of your posts up to now.

Well, no.

First, URL http://www.ashbykuhlman.net/blog/ leads to the same page as http://www.ashbykuhlman.net/ so we can say the word blog is kind of redundant "garbage" here. :-) Maybe you want to have the word "blog" in all your URLs and/or maybe you plan to use URL http://www.ashbykuhlman.net/ for something more general in the future. However, there are plenty of blogs without the word blog in their URLs and they're OK (like scripting.com). So although to say that the word "blog" in your URLs is garbage is probably too strong, having two access URLs for the same resource is definitely confusing.

(And what's more confusing is that URL http://www.ashbykuhlman.net/blog - i.e. the one without slash at the end - leads to 404 error. I've probably never seen this - everybody lets readers to skip final slashes and redirects their browsers to the appropriate URL if it exists. By the way, your other URLs like http://www.ashbykuhlman.net/blog/2003/07/29 or http://www.ashbykuhlman.net/blog/2003/07 - i.e. the ones without slashes at the end of URL - don't return 404 error. Kind of inconsistence...)

Second, the string "0847". When I moved from your newest post to the previous one, the number was "2227". OK, one can realize the number is the time of the post. Still, it's pretty confusing because it's not immediately clear what the number means. I already mentioned scripting.com, so here's another example of a clearer approach: http://scriptingnews.userland.com/2003/07/29#When:10:08:00AM - there's no confusion there.

Anyway, I'll keep reading your blog to learn more interesting things. :-)

Comment by Nathan Ashby-Kuhlman, posted July 29, 2003, 5:26 pm

Jirka, you raise some great points.

First, I fixed the problem where http://www.ashbykuhlman.net/blog, without the slash, brought up a 404 error. That was a configuration mistake I hadn’t intended.

I agree http://www.ashbykuhlman.net/blog/ is redundant with http://www.ashbykuhlman.net. Yes, the point of the “blog” part of the URL was to allow uses of the site other than just blogging (although I’m not really doing any of that). As opposed to personal sites, I think online news sites have fewer purposes other than publishing articles, but you’re still right that the “blog” part is “garbage” by my own standard. Maybe I’ll remove it.

I also am growing to dislike the four-digit timestamps you originally thought were ID numbers. I’ve been stating preferences the past few days for using slugs/words to identify news sites’ articles rather than long numbers, and here on my own site I am strongly considering switching to URLs like this: http://simon.incutio.com/archive/2003/07/28/phpXpath.

Comment by Steven Jarvis, posted July 29, 2003, 9:56 pm

Great series, Nathan! I've got a devil's advocate question for you: why do URLs need to be hackable? My wife (who is remarkably non-websavvy) would never in a thousand year think about hacking an URL. I'd say the same is true for at least 90% (and probably much higher than that) of the audience of news websites. *I* like hackable URLs, and I agree in general that they should be hierarchical, if only because I like at least the appearance (such as that given by liberal use of mod_rewrite) of a well-organized site. Isn't hacking an URL really just a fall-back point when the site's navigation fails you?

And I promise this question isn't prompted by the guilt at the state of the URLs at my work site (i.e., http://www.nwanews.com/times/story_news.php?storyid=108869 where the storyid is meaningless even to me). Really. ;)

Comment by Nathan Ashby-Kuhlman, posted July 30, 2003, 3:38 am

Steven, that’s an important question, and I think there are two ways to answer it.

First, the practical answer: What’s wrong with designing for the 10 percent who know the trick? All print newspapers use page numbers, just as all news sites use URLs, but that doesn’t mean all print readers use the front-page “index” to find out what page number editorials or comics are on today. Some people (like me) just prefer to browse rather than going directly to something specific. But the print newspaper keeps the direct navigation available for those who find it useful.

Or consider phone numbers — you can generally still pinpoint a landline to a specific town or neighborhood using its area code and exchange. Even if few people use the organizational trick often, the organization is still superior to randomly assigned numbers. As long as hackable URLs to serve Web-savvy readers do not interfere somehow with serving less Web-savvy readers, it benefits the greater good to use them. There’s also something to be said for evangelism. The more sites use hackable URLs, the more Web readers might catch on to trying them. I do see a day coming when news is delivered primarily online, and how limiting the medium would be if much of the audience still only knew its “beginner” features!

The second answer is more philosophical. Hacking a URL shouldn’t just be a fall-back point to the site’s navigation, but an ever-consistent reflection of the site’s navigational hierarchy. The point, then, is not whether anyone actually does hack URLs but whether they make sense to hack. For example, the URL of my work site’s baseball section (http://www.tcpalm.com/tcp/baseball) doesn’t tell me it’s a child of the sports section (http://www.tcpalm.com/tcp/sports/). The fact that the URL is not hackable is really just a clue to a confusing (CMS-imposed) navigational hierarchy.

Comment by Steven, posted July 31, 2003, 9:51 am

Nathan,

As to your first answer, there is nothing wrong with designing URLs for the 10% of us who hack them as a means of navigation (I know *I* certainly appreciate it), and having alternate means of accessing a site's content is almost always a good thing.

I think the end of your second answer goes a short way toward answering the question of why most news sites have non-hackable URLs: the limitations of the various CMSes that power these site. Whether home-grown or commercial, most do not produce hackable URLs, and cost (especially for the commercially available CMSes) is no indication where human-readable URLs are concerned.

Administrators of those news sites who start to think about the value of human-readable (and -hackable) URLs might be able to positively influence the vendors who create those CMSes (in the case of commercial CMSes) or get their own staff to work on modifying an in-house CMS (for those who have custom CMSes) to create such URLs. However, I think most commercial CMS vendors have a long list of other problems that would be better addressed, such as producing valid and accessible (X)HTML. That being said, good URLs are an important part of a whole news site.

Comment by Julie, posted August 1, 2003, 6:47 pm

Right on, Nathan. And I would add to your practical and philosophical reasons, the not as pressing but not altogether insignificant either psychological reason:

Your URLs should be neat and orderly because they make an impression about your organization on those who actively view or use them and those who receive them in e-mails or IM. It's the same reason you show up for a job interview in a suit and tie.

http://www.cnn.com/2003/US/07/30/airline.warning/index.html

Impression: Clear. Organized. They really have their act together.

http://torontostar.com/NASApp/cs/ContentServer?
GXHC_gx_session_id_=bdcda2ebdf22a959&pagename=
thestar/Layout/Article_Type1&c=Article&cid=
1059689420236&call_pageid=968332188492&col=968793972154

Impression: It's a miracle they can find their own stories. They just showed up at the interview wearing mustard-stained T-shirts and wrinkled cargo pants.

Unfortunately, I suspect Steven's point is a good one. Since most of the garbage is tied to poorly designed CMS that often have even bigger issues, for most afflicted sites the problem is unlikely to go away any time soon. Then again, admitting they have a problem is half the battle ;)

Comment by Steven, posted August 4, 2003, 4:07 pm

Julie, I *absolutely* agree with you about the impression an URL gives. I have the same issues with poor grammar and spelling. All show whether the creator has paid attention to detail or not. As for the CMS issue, yeah, I think clean, useful URLs rank lower on the scale of importance than valid and semantic code, though I also believe that, unfortunately, it's very difficult to win that first half of the battle. ;)

Comment by David Blomquist, posted August 6, 2003, 10:13 pm

Nathan, thanks for the A grade on freep.com's naming convention. Your hierarchy makes sense, but there is a reason why we attach the date to the file name as we do: Believe it or not, freep.com is still produced with Pantheon Builder, and those of you who remember that beloved product know that it doesn't natively support an environment in which destination folders rotate daily (e.g. /2003/07/01/sports/lions). So attaching the date onto the file name was the best workaround we could concoct.

Comment by Nathan Ashby-Kuhlman, posted August 7, 2003, 8:58 am

David, it’s interesting to me how how often Pantheon Builder has come up in the comments on this series. I can only imagine how full some of the folders on your Web server are by now!

If you wanted to, there might be a further workaround you could do. It looks like your server runs Apache, and if so, you could use mod_rewrite to make your public URLs independent of the way the files are stored internally. For example, a configuration line with a regular expression like this:
RewriteRule ^sports/lions/([0-9]{4})/([0-9]{2})/(0?([0-9]{1,2})) /([A-Za-z0-9]+) sports/lions/$5$4_$1$2$3.htm
would return the file located internally at
/sports/lions/lnote7_20030807.htm
whenever someone accessed the URL
/sports/lions/2003/08/07/lnote

Comment by David Blomquist, posted August 8, 2003, 11:43 pm

Yes, indeed you could do that, Nathan -- but not with Pantheon Builder, because Builder can only produce indexes using the actual file names (well, there are some exceptions, but it lacks regular expressions, so they're not really worth considering). So you'd have to write a widget to massage the file names going out as well as coming in. And frankly, I'm not sure it's worth the work.

Why? Well, I don't see much evidence in the user logs of people trying to hack URLs this way. You'd think it would be useful, but -- at least in the Free Press experience, and at a smaller site where I worked before coming to Detroit -- the vast majority of users just don't drill topically through the site. (The singular exception is sports, and there, the traditional reverse chron index seems to do the job.)

This is why we devote virtually all of our overnight editorial production time to the home page and main sports index, and why we archive these pages as "back issues." They are like the display windows at Marshall Field or Macy's, and every bit as important.

I'm not dissing a logical URL structure -- there are very good arguments for it, and you make them. I'm just saying that on my list of priorities, it isn't the first place I'm inclined to throw very scarce programming time.

Comment by David Marsh, posted August 11, 2003, 12:38 am

I have a few questions regarding creating permanent URL's.
By placing documents in sections like "http://domain.com/products/memory" doesn't it restrict the document from being associated with another section or product in this case?
By placing documents in a date hierarchy "http://domain.com/archives/2003/08/10" doesn't it restrict the document to a particular date? What happens if the document is updated with new information? The URL does not then indicate that the content has some new fresh updates to it. It may be considered old information and may be harder to find if searching for documents by hacking the URL.

Why not give each article/document/bog entry a unique title as the only identifying feature of the URL. Then it allows the site to change its hierarchy and taxonomy without affecting the URL. Documents can be re-classified or updated without creating any confusion by placing any structure or hierarchy in the URL.
Take Google as an example. When I am looking for information I am never thinking of a particular website or the URL hierarchy that may be used for the document I am interested in. I type some keywords and select a document I am interested in.
I agree the URL should be readable but it has no need for hierarchy in it. Having a hierarchy in the URL might give me some clues as to how the document was categorised at the time it was published but it may not be relevant anymore if the document has been reclassified or if the document has been updated or amended.
Keep it simple. Give it a unique title which I think should be a string from 1-50 characters.
I would love to hear any comments on whether I am on the right track here or not or some good reasons for creating hierarchies and how to manage these hierarchies in URLs.

Comment by Nathan Ashby-Kuhlman, posted August 11, 2003, 1:05 am

David (Marsh), maybe I can clarify things by saying that I’m not trying to propose a URL scheme for any kind of content on any kind of site. I am trying to propose a URL scheme for news articles on news sites. The key part of your argument as I read it is that having hierarchy in URLs can become very confusing later on if the document is reclassified or updated or amended. But on news sites, that almost never happens! News articles are published one day and maybe follow-ups are published the next. Sometimes articles are updated or revised within a single day, as more information about a news event gradually emerges, but news organizations almost never go back and alter older coverage (and as anyone who’s read “1984” knows, well they shouldn’t).

As long as original documents don’t get modified, whatever hierarchy they get put into at the time of their creation can stand for their entire lifetime without reorganization. So that’s why I think the hierarchy — particularly including the publication date — would work well on news sites. In particular, creating permanently unique titles on news sites would be almost impossible. For example, the news coverage this past week would have used up all possible permutations of “Arnold Schwarzenegger” “California” and “recall” very quickly. For non-news sites, though, I think you make a good point.

Comment by David Marsh, posted August 12, 2003, 12:13 am

I agree with your comments on creating a URL scheme for news sites and your proposed format. It is exactly what I would use and will use in the future. Thanks.

I was hoping to get some feedback on whether a hierarchy approach can work at all in terms of maintaining permanent URL's and how to maintain reclassified data that may now be tied to a different URL if the URL for the content were to be derived from the taxonomy/hierarchy of the content.

I'm just not sure. On one hand showing a hierarchy in the URL for content is a great way to give a person some feedback on their current position within the site hierarchy but on the other it ties that content to particular hierarchy forever.

My proposed flat structure with just a title seems a little inadequate but I can't see a good compromise.

POST A COMMENT on “Article URLs week: Day 2”