The Mudcat Café TM
Thread #47837   Message #1106121
Posted By: JohnInKansas
31-Jan-04 - 05:15 PM
Thread Name: Help: search engines
Subject: RE: Help: search engines
So Kendall's "searching my profile" is what some call "the vanity search." I think we had a thread some time back about what people get when they put their own name into Google. Some of our folk were quite amused - some not so much so.

The real clinker in the search engine scam business is that the crawlers that are used to search the web can only read html. Nothing that's out there in database formats will ever be found by most of them, and unfortunately most of the output by "intelligent life forms" (catters excepted?) seems to end up stuffed into some sort of data file. Google, and most others, can only find a link to such stuff if some one talks about it in an html site posting.

Example: DigiTrad is a database. Google can't/won't look inside it. Randomly, someone may post a link to something in the database on an html site, and Google may find that link, but for the most part DigiTrad is "off limits" to most of the popular search engines.

Example: Go to the specialized search engine at ArtCyclopedia and put in the name of your favorite (legitimate) artist. (Try Renoir or Freud if you can't think of one.) You'll get a result showing all the web museums with works by the artist. Pick a work, and do a Google "image" search for the same piece. You will find a link in Google to one of the museums with about the same frequency as you find Google links to DigiTrad. You will find all the poster shops that sell copies of the work, because they post in html, on html sites, but the museums are NOT indexed. Only incidental links to their stuff, when someone comments on an html site, will be found by Google.

On top of the limitation that the crawlers can't read database information, Google appears to stick to their policy of not initiating searches in .org or .edu sites. Thus you get links to stuff on them only when someone talks about someone who talked about something that was talked about … – when the crawler follows a thread of "random postings" that leads it to one of these sites. University library card files are a typical example of a type of resource about as "exempt" from being tapped by Google as the DigiTrad.

The somewhat cynical, but not inaccurate, assessment is that the common search engines don't search the information – they only search the gossip about the "real information" that's out there.

Estimates vary widely as to how much of the web is actually accessible to/through the popular search engines, but I haven't seen a credible estimate that puts it at higher than about 18%. (And I think that's an incredibly optimistic estimate.)

This isn't really a complaint about the search engines. They can be very useful; but their limitations need to be kept in mind when you really need information instead of "gossip."

John