Author Archives: daniel

Longer and longer URLs

The Age and SMH have joined News.com in embedding story names in their URLs.

So at The Age before it was
http://www.theage.com.au/articles/2004/11/01/1099262789668.html — now it’s
http://www.theage.com.au/news/National/TAB-locks-superglued/2004/11/02/1099262825340.html

News.com used to have stuff like
http://news.com.com/2100-1028_3-5435183.html — but now it’s
http://news.com.com/Fahrenheit+94711+expands+election-eve+pay-TV+airing/2100-1028_3-5435183.html

(Note how the slash in the News.com article screws-up the text embedded in the URL.)

It’s probably good for the search engines but it’s hopeless for passing URLs via email, as they now spill out over more than one line.

(If they want to maximise hits, what the Fairfax should prioritise is countering Google News’s opinion of The Age and SMH being subscriber only.)

PS. Thursday 8am. For those of us who want to quote SMH/Age URLs to people, you can still chop out all the embedded text, and replace it with “articles” so the example above becomes: http://www.theage.com.au/articles/2004/11/02/1099262825340.html

RSS

XML feeds are the fashionable thing these days. Something like it almost showed up with Active Channels in IE4, but it’s taken RSS (and to a lesser extent, Atom) to grab a foothold for it to really take off. Anything half-decent has it, and the number of hits that most blogs get from RSS readers is ever-increasing.

One of the questions to ponder when setting up a feed for people is this: Do you provide your full content (at least of recent items), kit and kaboodle in the RSS feed, or just summaries? Pushing everything out uses more traffic (not a problem unless your site is very well-read) but gives people the convenience of reading everything in their RSS reader. Conversely, if you’re trying to get people onto your site (for whatever reason; to get people to see your adverts is important for commercial sites) you’d probably lean towards summaries.

My blog provides everything, because it was set-up this way when I was playing with it, and when I inadvertantly switched it to summaries during a WordPress upgrade, people used to reading it all quite rightly complained, so I switched it back.

This site uses summaries. (WordPress provides the first X words of an entry, or a specific Summary field if it’s filled in.) While we’re not commercial exactly, it would be nice to get enough Google ad revenue to at least cover the hosting fees. For this, you need people visiting.

We’ve had some comments about this, expressing the view that this is a Bad Thing as it discourages readers who like to read everything from their RSS readers. That’s probably true for some people — unless the summary (human or computer-provided) is compelling enough, they won’t visit. But do they bother to visit if they can read everything from the RSS reader? Maybe if there’s pictures or they feel compelled to leave a comment. A visit is only a click away, after all.

For now, we’ve decided, like an aging 80s rocker clearing out his CD collection, to keep the Status Quo, but do a little tweaking of the feed to provide more text in the auto-summaries. Hopefully there’s enough interesting content appearing here to keep people coming back.

URL design

It pays to keep your URLs clean. Preferably just directories, no trailing filenames, and certainly no default.aspx type stuff on the end. Why? Because you’re aiming at humans, most of whom can’t remember that kind of stuff, and don’t want to be bothered typing it.

Everything that Jakob Nielsen wrote five years ago still applies. You want URLs that are memorable, easily typable, short enough to send in emails without getting chopped-up, that don’t automatically add weird parameters screwing up bookmarks and browser autocomplete, and can be passed by word of mouth.

Hey Joe, look at this site. www dot geekrant dot org

wins over

Hey Joe, look at this site. h t t p colon slash slash w w w dot geekrant dot org slash index dot php

every time.

This stuff is not hard. For Apache people, .htaccess works wonders. For the IIS crowd, fiddle with the default page settings. There is no excuse for www.microsoft.com/windows forwarding to www.microsoft.com/windows/default.mspx. Anybody who bookmarks that will be in for a shock the next time they move to a new scripting technology and change their file types.

Hide the default/index.html/asp/aspx/cgi/php/whatever from your users by linking back to your index pages without using the filename… eg root of this directory “./”, parent directory “../” and so on. Also aids in what Nielsen calls “hackable URLs”.

Redesign them by all means, give your 404s options to go to the home page, or search, or a site map. But don’t make your 404s jump to special page, changing the URL. Do you know how irritating it is to get a 404 that’s hidden what you typed, so you don’t know what you got wrong?

Though it’s become kinda fashionable to chop it, I still lean towards including www on the front in URLs, because it means you can put it in written form without the http:// and there’s no doubt what you’re talking about.

PS. Which browser vendor will be the first to hide http:// in the address bar when it’s not needed? Newbies really don’t need to be trying to type that every time, especially as no browser requires it to be entered.

PPS. Yeah I still call them URLs, not URIs. As the W3C says, an http URI is a URL. So there.

Blog spamming

At the time of writing, my main blog is under a sustained comment spamming attack. Over 50 spam comments today, all targeting the one old post, promoting a poker web site. At least one other WordPress-based blogger is getting them, so it’s not just me. And what’s interesting is they’re from a variety of different IP addresses, so assuming that’s not spoofed, it looks like the attack is coming from multiple zombies.

(Links in text deleted)

Author : poker (IP: 195.172.182.228 , 195.172.182.228)
E-mail : byob@y7263o.com
URL : http://www.poker-w.com
Whois : http://ws.arin.net/cgi-bin/whois.pl?queryinput=195.172.182.228
Comment:
7263 JUST A FEW LINKSFOR YOU TO CHECK OU WHEN YOU GET A CHANCE
Online poker
texas holdem poker
texas hold em

When I first saw this type of comment spam, I thought huh? What’s the point? Who is going to see such comments and click on them? Particularly in this case, with dozens of the same spams hitting one particular post. But the point is getting links to your sites into the search engines, and up the rankings. Whether it works or not I don’t know.

WordPress has a fair bit of flexibility when it comes to catching comment spam. The most useful generic setting is number of links in a comment. A surprising number of comment spams have heaps of links. You can also nominate keywords (though in 1.2 there was a bug in that if the final keyword on the list had a CR after it, every comment got caught). Caught comments go to moderation, so the never see the light of day. Handy for comment spam and for moderating particular users/IP addresses too.

Comment spammers, like other spammers, are getting cleverer. Hopefully the blogging community (and in particular those who write and update blogging software) will stay one step ahead of them.

Update Friday 07:30: The attack appears to be widening to more blog posts, and branching out to Viagra and weight-loss, but is still showing signs of being from the same source. To counter it, I have shutdown comment posting on entries more than 60 days old using Scott Hanson’s Auto Shutoff Comments plugin.

Defined: Wikipedia on blog comment spam.

Possible solution for WP?: Modification to comments code that ensures it can only be called from the form, not remotely. I’ll try this when I get the chance.

Update Friday 13:00: The patch above doesn’t work for this particular attack. Looks like this one spoofs the referrer… which makes sense, any decent spammer would think of that.

Any GeoCities users

For anybody who dabbles in GeoCities, they’re doing a little cleanup which means rarely accessed or updated sites may get the flick:

“We noticed that you haven’t updated your web site in a while. If you wish to keep your web site, we encourage you to update it within the next 30 days so that it will not be deleted due to inactivity. If your web site is deleted, visitors will no longer be able to access your web site and all files will be permanently deleted.”

I took a look at my site (which has bugger all on it) and got this warning:

Geocities Inactive warning

If you’ve got a site you occasionally glance at, now would be a good time to tinker a bit. And grab a copy of whatever’s on it, if you don’t already have it.

Do you really really want to open the file?

I know the spread of macro viruses via consumer products is a dangerous thing, and obviously Microsoft in particular have had to take action to help slow them down. But I’m not convinced the plethora of dialog boxes that now adorns every application is really the way to go.

For instance, if you open an MDB in Access 2003 that was created in Access 2000, you are likely to get no less than three separate security dialogs asking if you’re sure, if you’re really sure you want to open the file.

I’ve been using Access for some years, but I don’t know what an “unsafe expression” is. I created the MDB I’m opening, and it’s just got tables in it. No macros, no VBA modules, not even a report or query. There’s nothing unsafe in it. So I said No, don’t block the unsafe stuff you imagine is in this file. Give it all to me.

Having said no, I don’t want them blocked, it then complains that it can’t block them. Obviously it doesn’t trust me to answer sensibly, it really wants to block those imaginery unsafe items. But it can’t without sending me off to Windows Update to install Jet 4 SP 8 or later.

I had to really concentrate to work out what the Yes/No options at the bottom of the dialog are for. They’re nothing to do with blocking the alleged unsafe expressions, or installing the service pack. Nope. What it’s asking is if I still want to open the file.

Having ascertained that I don’t care about the unsafe expressions that don’t exist, and I still want to open the file… it asks me just one more time, by suggesting the bleeding obvious: “This file may not be safe if it contains code that was intended to harm your computer.” Well duh, no kidding.

The cunningly placed Cancel button on the left could easily lead one to click that by default. But finding and clicking the Open button finally really opens the file.

Now, why did I want to look at this file again?

Speed up Acrobat

Method 1: Install Acrobat Reader 6, then Trim out all the extraneous plugins. The same method apparently works with Acrobat 5, if you still have that.

Method 2: Downgrade to Acrobat 4 5.05, which does all the essential PDF-reading stuff, but is smaller and quicker.

OldVersion.com is a boon for finding old versions of freeware. For years I used ICQ98, since it was tiny and advert-free. I’m such an early adopter I have a 7 digit ICQ number 🙂 Now I generally use Trillian, since it’s multi-IM-standard, so I can talk to people on MSN and ICQ and Yahoo with only one multi-megabyte IM program sitting in memory.

Excel to HTML

I can’t believe how stupid Excel (2002/XP) was with the table of browsers the other day.

The plan was to get the numbers into Excel, copy/paste into a Frontpage table to strip back the formatting, then paste into WordPress.

Nup, bloody monstrous Excel tags right the way through it, which Frontpage couldn’t override, and evidently no easy way to strip. No combination of Paste Special would work. So for example, instead of <td></td> we got:

<td align="right" x:num="1.15E-2" style="color: windowtext; font-size: 10.0pt; font-weight: 400; font-style: normal; text-decoration: none; font-family: Arial; text-align: general; vertical-align: bottom; white-space: nowrap; border: medium none; padding-left: 1px; padding-right: 1px; padding-top: 1px"></td>

I kid you not. Now, I know about round-trip HTML, though I have my doubts that anybody uses it — firstly because it looks like crap in a web browser, and secondly because if you’ll want to edit it later, you’ll just keep an XLS copy. Besides, it’s badly implemented. The cell above was using the “Normal” style. It shouldn’t have had all the formatting crap embedded in it.

Word XP actually has a Save As Filtered HTML option to strip out all this crap. Excel XP doesn’t. (I haven’t checked Excel 2003 yet).

Plan 2 was to save it as HTML, load it into FrontPage and crop the HTML to paste into WordPress. Nup, trying to re-open it in FrontPage just threw it back to Excel. WTF?! Opening in UltraEdit (my preferred text editor) just revealed the same tags as above.

How can two Microsoft products that are part of the same suite, same version, operate so disastrously badly with one another, for something as simple as copying a table?

Plan 3? Oh bugger it, it’s only a few lines, just write it by hand.

If it were more I’d go install and run that clear The Useless Crap Out Of The HTML filter thing (oh look, they could do with clearing the crap out of their URLs too), but it refuses to install unless you have Office 2000. Wonderful.

Next time (after swearing a bit) I’ll probably save to CSV and then do a global replace from commas to table tags.

Surely there must be an easier way?

File listings from zips

Need to get a file list out of a zip file? Winzip is a fine product, but don’t muck about with their recommended method, setting up a text file printer driver to print a list to, then having to chop out the paper-style headings and linefeeds.

No… Instead go to Info-Zip and grab their command-line zip package.

unzip -l filespec.zip [Optional filespec if you don't want them all] > filelist.txt

Easy.

Browser wars 2

Who’s winning this time round? Is Firefox having any impact?

Here’s the stats for my most heavily trafficked site, top 15 agents:

Hits Percent User Agent
30815 11.88% Mozilla/4.0 (compatible; MSIE 5.5; Windows 98)
23816 9.18% Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
14375 5.54% Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CLR 1
10701 4.12% Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)
8880 3.42% Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET
8330 3.21% Mozilla/4.0 (compatible; MSIE 6.0; Windows 98)
8092 3.12% Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)
6784 2.61% Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/
5190 2.00% Mozilla/5.0 (Windows; U; Windows NT 5.1; rv:1.7.3) Gecko/2004
5144 1.98% Program Shareware 1.0.0
4832 1.86% Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; .NET CLR 1
3825 1.47% Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7) Gecko
3314 1.28% Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.7) Gecko
2982 1.15% Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; FunWebProd
2362 0.91% Atomz/1.0

Interesting that after all these years, IE5.5 is still the top hitting browser.

Gecko is Firefox and Mozilla and their derivatives. Probably a few copies of Netscape 7 floating around as well.

Atomz and Yahoo are spiders, obviously, though I’m not sure why Yahoo decided it would be good to tell us their spider is Mozilla compatible, ‘cos I bet it isn’t. Google comes through every so often, but doesn’t appear in the top 15 provided by my web site’s default report.

I have no idea what “Program Shareware 1.0.0” is. Any ideas, anybody?

No sign at all of Mac users, or indeed any OS other than Windows. Maybe if the list showed the top 50…

Getting these into some basic groups, we have:

Hits Percent User Agent
82008 31.60% MSIE 6
30815 11.87% MSIE 5.5
12329 4.75% Gecko

I could show you the two party preferred figures, but I scarcely need to: IE still rules the roost, though I’d bet Gecko/Firefox is slowly gaining momentum.

(Obviously I’m going to have to look beyond the top 15, because there must be an awful lot of minority combinations of OS/browser out there.)

It’ll be interesting to see how this pans out over the next few months.