The Mudcat Café TM
Thread #171833   Message #4157015
Posted By: cnd
07-Nov-22 - 08:36 AM
Thread Name: Archiving The Mudcat
Subject: Archiving The Mudcat
During the 'Cats most recent sabbatical, I was frustrated to discover that the vast majority of Mudcat's discussion pages remain unarchived. While the DigiTrad database has been picked up and mirrored by several online at various points in time, what remains (in my opinion) the main body of the site lives online here. This discussion is also partially inspired by SRS's recent contributions to the Tech: Tape to MP3 conversion thread.

In my opinion, the reason for the difficulty is as follows:
- The pagination: in other words, the method of counting when a new post is entered to an existing page
- The way the pages come up on a Google (or browser) search

To explain a little further:

When you click on a discussion thread from the main page, the link format is as follows
https://mudcat.org/thread.cfm?threadid=171822&messages=4
The "&messages" part following the threadid allows for a user to know whether the thread has new messages they haven't read yet in it by turning the link from purple (visited) to blue (unvisited) on most modern browsers. The 6 numbers after the threadid is the unique number reach thread possesses.

The browser search link format is as follows
https://mudcat.org/thread.cfm?threadid=43909,43909
This doesn't seem to always happen (for example, a Google search of "mudcat The Illiterate's Alphabet (Sid Kipper)" returns a link without the numbers repeated after the comma (ie, https://mudcat.org/thread.cfm?threadid=119676). I'm not certain what this functionality does.

All this to say, the variety of linking formats makes it difficult for web services such as archive.org to index the website, and difficult to find if a page actually has been archived at all. During periods when the 'cat is regrettably down, or, God forbid, the site ceases to be hosted, a vastly large quantity of content would cease to be hosted.

Another factor is that the links from Mudcat aren't often posted elsewhere on the WWW -- if they are, archive.org, for example, will periodically "crawl" and search for new links on other archived sites. And, for whatever reason, the 'Cat seems to have avoided the auto crawls. I'm unsure if this is some sort of privacy/web setting, or something else entirely.

I don't have any easy solutions to this, just something I thought I would bring to people's attention. Maybe one of the more tech-savvy people among us can come up with an automated solution, or knows of a quick and easy one? Personally, I have been archiving pages important to me since late 2020 so I can reference them in the event of outages. But manually logging each and every one of the hundreds of thousands of threads would be a monumental undertaking.