Mudcat Café message #1751403

The Mudcat Café ^TM
Thread #91906 Message #1751403
Posted By: Jim Dixon
01-Jun-06 - 08:47 PM
Thread Name: Googling for Mudcat
Subject: RE: Googling for Mudcat

As I understand it, search engines such as Google use programs called web crawlers (click for Wikipedia article) to scan web sites and build an index.

The web crawler starts with the home page—that would be http://www.mudcat.org/ in this case—indexes that page, and then follows all the links it can find on that page, and indexes those pages, then follows all the links on those pages, and so on.

Some web crawlers have limits to the number of levels of links they will follow. Some web crawlers don't index the whole page, but limit their search to the first (insert arbitrary number) lines of text on each page. I don't know whether Google has any limits. If it does, its limits are probably higher than any other search engine.

At any given moment, there are lots of old threads on Mudcat that can't be reached this way. To view them, you have to type something into a search box. Web crawlers aren't smart enough to figure out what they should type into a search box in order to view every existing thread.

Come to think of it, neither am I. The only way I know to view every existing thread is to start with http://www.mudcat.org/thread.cfm?threadid=1 and increment the number from 1 up to whatever we are at now—at least 91914. Alternatively, you could start with http://www.mudcat.org/Detail.CFM?messages__Message_ID=1 and increment the number up to 1751399 or so.

I only know that because I happen to know a bit about how Mudcat works. I don't think a web crawler would figure it out. "Threadid=nnnnn" could mean anything.

Of course a programmer at Google, with a little investigation (or if someone clued him in) could easily write a special program that would search Mudcat this way, but I doubt that searching Mudcat is high enough on Google's priority list to warrant a specially-written program.