AppSwitching Diary

This is Loosely Coupled. Skip directly to: Site Navigation, Page Content, Search, Section Navigation.

home	news	weblog	resources	services	about
Weekly emails:					how to	advanced search

news

feature articles
news headlines
news archive
latest releases

comment

opinion
weblog
comments

resources

glossary
events

services

email bulletins
special reports
syndication

Glossary lookup:

> how to > AppSwitching diary

Sunday, April 20, 2003

Date and time formats in RSS 2.0

There are two main ways of timestamping an item in an RSS 2.0 feed. You can either use the <pubDate> tag specified in Dave Winer's RSS 2.0 specification, or you can use the <dc:date> tag from the Dublin Core schema. They have identical meanings. But they use different timestamp formats.

RSS 2.0 <pubDate> uses the date and time format defined in RFC # 822, a standard for internet messages that dates from 1982 (but using 4 digits for the year instead of the two specified in the original, pre-Y2K document). This is a very natural format, rendering timestamps as, for example, "Sat, 19 Apr 2003 23:40:26 GMT". Dublin Core's <dc:date>, on the other hand, uses a W3C submission based on ISO 8601, the International Standard for the representation of dates and times. This format is not as user-friendly, rendering the same example as "2003-04-19T23:40:26-00:00". For some reason, PHP's strtotime() function doesn't parse it, either, so if you or your subscribers are going to want to read the timestamp using PHP, you'll save them some trouble if you use <pubDate> instead.

Whichever you use, bear in mind that the timestamp provided can be either the creation date or the availability date (ie the publication date), which might be a date in the future rather than the past. The same tags may also be used in the <channel> element of the feed. Here, they are usually taken as being when the feed was last modified, but that's not formally required. Oddly, there's no commonly accepted tag for marking 'last modified' for items where it makes sense to record this information in addition to a 'first published' timestamp (for example in a feed of glossary definitions). Which just goes to show some of the difficulties you get into when you start trying to do really meaningful semantic markup.

PS: This is the first of the occasional articles and tips I mentioned in my previous posting, although the redesign I mentioned there is still pending at the moment. It seemed like a good idea to publish this anyway, since the information doesn't seem to be easily available anywhere else on the Web. I'm glad to see that there are moves afoot to put forward RSS 2.0 as a submission to the IETF, and I hope that will also encourage the emergence of readily available best-practice guidance that people can turn to when developing practical implementations.

posted by Phil 5:40 AM (GMT) | comments | link

Friday, March 28, 2003

Upcoming changes

Keeping an occasional diary of tips and observations about the experience of developing the Loosely Coupled website has been an interesting exercise, but it's ended up becoming rather sporadic. We're currently approaching a redesign of the Loosely Coupled site, and so as part of that we've decided to cease keeping the AppSwitching Diary as a separate weblog. The existing entries will remain online with the same permalinks, but any new material will either be added to the main Loosely Coupled weblog, or else published as separate, standalone articles.

As part of the redesign, we'll be turning this page into a more useful overview of the various articles that have appeared over the past nine months, grouping them according to various subject groups that will make it easier to find items on specific topics, such as a method for Adding external content management to Blogger Pro, instructions for Publishing comments online using Xara Modules, advice on Creating password-protected directories and How to remember all your passwords.

The new presentation format will be better suited to adding new articles, which we certainly intend to do, though on a less frequent basis than in the weblog. The most popular entries have tended to be those which give advice and instructions for specific tasks, and certainly we've accumulated quite a lot of knowledge that we could pass on. One of the enabling factors in giving up the weblog format has been the in-house development of a XML-based publishing system that we can use for publishing articles, and that's one potential topic we could cover, for example.

The RSS feed for this blog will continue to carry details of those articles, so it's worth continuing to subscribe, but new material will be infrequent, so another alternative will be subscribing to a new composite feed for the entire Loosely Coupled site, which will begin shortly. We'll post a last entry here with details of the new arrangements when they're finalized.

posted by Phil 2:20 PM (GMT) | comments | link

Friday, February 07, 2003

Filtered feeds

Although I couldn't find Network World's RSS feed last week, a helpful comment from the site's executive editor Adam Gaffin quickly pointed me in the right direction.

I immediately put the feed into Loosely Coupled's news page, taking the place of InfoWorld's feed, which has not been updated since that site had a redesign a few weeks back. This seems to be an oversight by the title's designers rather than a deliberate move. Jon Udell's weblog is also part of the Infoworld site (and also features on our news page). Today, it got a new look to bring it line with the look-and-feel of the rest of the site. It forms part of a new section called "iDiscuss", and right there on the iDiscuss home page, there's a box highlighting InfoWorld's RSS feeds. The Web Services feed I'd been using is there at the bottom of the list, but at the time I write this, it's still showing the same stale old stories from late January. Someone still needs to crank it back into life.

I will welcome its return, because filtered feeds are so much more convenient than global site feeds. The Web's biggest problem is oversupply of unfiltered information, and sites like Loosely Coupled are only useful to their readers if they stay as focussed as possible on their chosen topic. Being able to scrape InfoWorld's ready-filtered feed of web services stories has been a big help in that respect.

Coincidentally, I got a friendly email from CNET this week in reply to a syndication enquiry I'd sent in many months ago, which I'd long since despaired of ever getting answered. ""We began to promote our existing RSS feeds ... this year," it noted, complete with a pointer to the full list. It's good to see CNET encouraging takeup of its RSS feeds, but disappointing that they don't provide a way of filtering by topics of your own choice. If Network World can do this with its search engine, surely it's not beyond the wit of CNET to adapt its own much-vaunted search capabilities to the task?

Of course, they might well riposte that, if I'm so eager to filter feed results, why don't I do it myself instead of overloading their search servers? Which is a fair comment. That capability is now on the development list, as part of a shake-up of the news page to order stories by date rather than simply republishing each feed separately — which by the way will mean that it will at last conform to Dave Winer's definition of a news aggregator, although as others have noted, that's not the only definition in town. Categorization in particular is an important secondary attribute, and, as I've said before, I can see some exciting potential arising if the idea of aggregators republishing filtered aggregations catches on.

posted by Phil 2:33 PM (GMT) | comments | link

Tuesday, January 28, 2003

Liberal in what you accept

After a fair bit of weekend tinkering (and weekday catching up), the Loosely Coupled glossary now embodies my homage to the maxim attributed to Internet pioneer Jon Postel, "Be liberal in what you accept, and conservative in what you send."

The glossary will accept queries by any of three routes, but will always respond with a single answer conforming to a predictable format if it has a definition to offer. If it has no corresponding definition, it responds with a standardized page that allows further searching. Each of the three query alternatives conform to recognised standards, and all of it is designed to make it as easy and convenient as possible to use the glossary to find what you want to know.

All I need to do now, of course, having created the infrastructure, is populate the glossary with a few more definitions, at which point it will be promoted more aggressively via the main Loosely Coupled weblog and related content pages.

The three routes are:

Via Javascript bookmarklet, as described currently on the glossary highlights page. The bookmarklet code is easily editable for those who want to vary it, or to adapt it for browser platforms other than Internet Explorer. The default setup has the extra term "&df=s" which selects a short format suitable for a popup window. Omitting this (by deleting the characters +'&df=s' from the code) will result in a full page format appearing instead.

Via a simple HTTP-GET query, as for example in the following form:

When creating the form, set the name of the input text as q, and the action as http://looselycoupled.com/glossary/lookup.php, like this:
```
<form method="get"
action="http://looselycoupled.com/glossary/lookup.php">
<input type="text" size="24" name="q" value="" />
<input type="submit" value="lookup" /></form>
```
By typing the term you want to search for straight into your browser's address bar, preceded by http://looselycoupled.com/glossary/, for example: http://looselycoupled.com/glossary/UDDI or http://looselycoupled.com/glossary/loose coupling.

In all cases, the glossary is forgiving of excess trailing or leading spaces, and upper or lower case, making it easy to highlight or cut-and-paste words and phrases for looking up on the fly.

This has all been achieved thanks to the magic of PHP scripting, combined with nothing more complex than some XML files to store the definition data and Apache's Unix filesystem and 404 page-not-found error handling. The original inspiration for this approach came from Joe Gregorio's Well-Formed Web, although my actual implementation only gives the appearance of following his guidelines, as I've opted to hide the raw XML files so I can experiment with some proprietary elements away from public view.

I still have some minor enhancements to add to the published pages, but having finished the main functional coding I wanted to take the opportunity to make a posting while the thoughts are still fresh, and also to break the dearth of postings so far this week. Tomorrow I lay down my coding tools and return to my preferred role here as writer and publisher.

posted by Phil 2:52 PM (GMT) | comments | link

Friday, January 24, 2003

Spelling out acronyms

As an experiment, I'm publishing a separate RSS 2.0 file for every definition in the Loosely Coupled glossary. I'm not sure where this will lead, but I'm sure it will lead somewhere. One potential application is to provide a ready reference for spelling out acronyms. Because there's currently no easily accessible authority that journalists, analysts, marketing people and others can use to check up what terms like HTTP or BPML stand for, incorrect versions tend to proliferate. Indeed, as one early visitor to the glossary discovered this week, we ourselves blundered with our initial rendition of HTTP.

Having tightened up the checking procedures to make sure a similar error won't occur in the future, I'm ready to make a commitment that you'll always be able to trust that the spelt-out version of an acronym will be authoritative when you look it up in the Loosely Coupled glossary. I'm hoping I'll be able to add a field to the RSS feed that includes the spelt-out version, but for the moment I'm going to hold off as I want to think some more about the structure of those RSS files. In the meantime, the spelt-out version is in parentheses at the beginning of the definition (see below for a tip on how to extract this using PHP).

There is a problem in being punctiliously accurate, of course. It means that common misrenderings — such as giving the M in BPML as 'markup' when it should be 'modeling', or our initial substitution of 'transport' instead of the correct word 'transfer' for the second T in HTTP — will produce a 'not found' response when looked up in the glossary.

This problem reminds me of the discussion that ensued after Mark Pilgrim's O'Reilly article this week, where he recommended parsing RSS feeds in a way that was forgiving of badly formed XML. Purists maintain that aggregators shouldn't encourage bad behavior by feed publishers, whereas Mark took the position (quite rightly in my view) that, in order to provide the service their users expect, they have no choice but to do so. Likewise, I'm going to need to add some mechanism for handling common mis-spellings of popuar acronyms to the glossary. That again would be a useful addition to the RSS file, but it needs to be done in a way that won't lead to confusion between the correct and incorrect renderings, so I'll have to give some more thought to how it might work.

As I mentioned above, the correct rendering has a consistent format and position at the beginning of the definition. This means it can be extracted by looking for the opening and closing parentheses. However take care if using PHP when you specify the function for finding the opening parenthesis. As the first character, it's at position 0 in the description, but of course if there's no parenthesis (ie the glossary term is a word rather than an acronym) then the function will return false, which is another type of the value '0'. So to make sure a 'not found' doesn't produce the same result as 'found at position 0', in PHP you have to use three equals signs (ie 'identical to') rather than the normal two (ie 'equals') to evaluate the expression, as shown in the following example:

if (strpos($description, "(") === 0) {
// if the description has an opening parenthesis at position 0
  $end_paren = strpos($description, ")");
  // find the position of the closing parenthesis
  $spelt = substr($description, 1, $end_paren-1);
  // the spelt-out version is the text in between
  print $spelt; 
  } // print it out

posted by Phil 2:31 PM (GMT) | comments | link

Archives update

The cause of this problem is now closer to being identified, although it is not yet fully resolved. I had another repeat last week, this time when publishing a new weblog entry, and Blogger's Steve Jenson was able to find the corresponding records in the publishing logs. He reported back: "Your first publish took 2,560 seconds, or 43 minutes, to finally time out due to a network failure (which no one else experienced so I can only assume it happened on your hosting provider's side), and your other two publishes, which start before this faulty one has finished, each took around 3 seconds and were successful."

Separately, I've noticed when making DOS-FTP uploads direct to the server that it occasionally freezes for no apparent reason. On one of these occasions this week I ended the session and opened a new session. Lo and behold, the file I had been attempting to upload a new version of was sitting there as a 0k directory entry. So I think Steve is right; there's a glitch on my hosting provider's side that intermittently puts FTP transfers into limbo for extended periods, sometimes during a write operation, resulting in an empty file.

I'll report it to the provider's helpdesk and see if they can make any progress on sorting it out, but in the meantime this new evidence about what's happening is helping me to reduce the number of times the problem occurs. Now if I see a "transferring files" message in Blogger's editing console I know that it means the process has temporarily gone into limbo, and instead of panicking and immediately attempting to republish again, I just monitor the situation until I'm sure the process has completed. Meanwhile, if I want to update the archive template, I do it on my alternative server and copy the files across rather than risking it on the affected server.

This has cut down on the number of problems, but it is more of a pain than it should be, so I'm hoping to reach a permanent resolution soon. Steve says there will be a new version of Blogger within days, and that it will be much easier to debug publishing problems with the new software. In the meantime, I'm well impressed by Steve's commitment and effort to getting to the bottom of this.

posted by Phil 4:32 AM (GMT) | comments | link

Friday, January 10, 2003

Disappearing archives

I thought I had fixed the problem of blank archive pages, which I reported here on Wednesday, but I'm sorry to say the problem continued to recur throughout the day, long after I'd thought it fixed. I apologize if you were inconvenienced by the temporary absence of a large portion of the Loosely Coupled weblog archive throughout yesterday. The pages were restored this morning.

What I didn't realize on Wednesday — after I had restored the original archive pages — was that Blogger was still attempting to complete the 'republish archives' command from earlier on. So although I restored all the missing pages at 2:22am Pacific time, Blogger was soon back again wiping them clean. According to the timestamps on my directory listing, Blogger was busy republishing pages all day, not ceasing until more than twelve hours later, by which time it had erased the contents of practically every archive page. Evidently, eating my archives just the once was not enough for Blogger. Its insatiable hunger kept on driving it back for seconds until it had virtually scraped the blog clean.

Naturally, this has thoroughly shaken my faith in Blogger's archive republishing function. Although I'll report the problem, I can't see myself using the function on my live pages again — the risks are too great. In the short term, I can use my workaround. Later on, I'll substitute an alternative that I can trust.

posted by Phil 9:49 AM (GMT) | comments | link

Wednesday, January 08, 2003

Blogger ate my archives

Visitors trying to access archive pages for the Loosely Coupled weblog last night were mostly greeted by a blank screen after Blogger deleted their contents during a routine republishing exercise. The purpose of the exercise was to amend the copyright notice at the footer of each page, redating it for the new year and at the same time updating the company name. That meant amending the page template stored in the Blogger system and then using the "republish all" function to recreate each weekly archive page using the updated template.

But instead of replacing the old version of each page with the updated contents, Blogger's publishing process only got halfway through the task, with disastrous consequences. It successfully opened the old file and deleted the contents, but never got around to writing the new contents, leaving the files as empty shells. So any visitor clicking on a URL for one of those pages saw just an empty, white space in their browser.

Having seen that this was happening, I naturally attempted to rerun the publishing process to correct it. But with each fresh attempt, Blogger wiped out a further batch of archive pages. My blog was being eaten alive before my very eyes, and there seemed to be nothing I could do about it.

Although this is the first time I've experienced this problem when republishing archive pages, it has come up before when posting new entries using Blogger. Since several months have passed since then and it still hasn't been fixed, which tells me that it's not a universal problem with all Blogger users, and therefore it must be specific to the way Blogger interacts with my hosted server. I decided to put that theory to the test, by attempting to republish the archives to an alternative server hosted with a different provider. All I had to do was go into Blogger's setup console and change the server IP address, the server path for the blog and archive files, and the server username and password. Then I republished again, and it worked first time. After copying the resulting files back to my main server, I've restored the missing pages.

While doing so, I noticed that the FTP process is much faster on my alternative server, hosted at Hostcentric, than at my main server, hosted at Jumpline. This pretty much confirms my suspicion that the publishing glitch is caused by a timing problem between Blogger's publishing engine and Jumpline's FTP program. But of course that in turn means that the chances of getting it fixed any time soon may be quite slim. With a claimed 30,000 domains hosted, Jumpline hardly represents a big slice of Blogger's customers, and vice-versa. So neither support team has a big incentive for getting in touch with the other and resolving this issue. On balance, I imagine that Blogger has the greater incentive, since Jumpline is likely to be using the same tools as other hosting providers, and so fixing the problem here may also help resolve it in other instances. I'll report it to both helpdesks, anyway. But there are a couple of other options I should be considering as well.

Moving to a new hosting provider is one possibility. I'm reluctant to do so after all the effort I've spent educating Jumpline about DNS hosting — a topic that Jumpline at least is more understanding of than Hostcentric. But I should at least research alternatives, as it's always a good idea to have a fallback waiting in the wings.

The second possibility is to decouple Blogger's weblog publishing functionality from the final page publishing process. I'll need to think through how this could be done in a way that doesn't mess up the URLs and permalinks, but in principle it should be possible to set up Blogger to publish the raw weblog content to a set of staging files, and then have a separate process that reads the content from those staging files and writes out the published pages. Decoupling Blogger from the final publishing act would mean I could make changes to the page 'skins' without having to use Blogger to republish the results, and would also allow me to add features that Blogger doesn't support (for example, varying the skins according to the archive date, or adding a page-specific contents index at the top of each page).

I seem to be moving more and more towards using Blogger purely for editing and ordering my blog entries, while using a dedicated publishing system to do everything else (such as generating RSS feeds, contents pages, a commenting system and now the blog pages themselves). I may reach a point where I discover that there's a better platform than Blogger for this type of approach. In the meantime, grappling with these problems and fixes is helping me to figure out more and more about what it really means to be loosely coupled.

posted by Phil 3:43 AM (GMT) | comments | link

Building a website using plug-in online services: the Loosely Coupled experience

> how to > AppSwitching diary

Sunday, April 20, 2003

Date and time formats in RSS 2.0

Friday, March 28, 2003

Upcoming changes

Friday, February 07, 2003

Filtered feeds

Tuesday, January 28, 2003

Liberal in what you accept

Friday, January 24, 2003

Spelling out acronyms

Archives update

Friday, January 10, 2003

Disappearing archives

Wednesday, January 08, 2003

Blogger ate my archives

current

archives

Jan-Mar 2003

Oct-Dec 2002

Jul-Sep 2002

May-Jun 2002

Loosely Coupled weblog