|
|||||||
Archiving The Mudcat |
Share Thread
|
Subject: Archiving The Mudcat From: cnd Date: 07 Nov 22 - 08:36 AM During the 'Cats most recent sabbatical, I was frustrated to discover that the vast majority of Mudcat's discussion pages remain unarchived. While the DigiTrad database has been picked up and mirrored by several online at various points in time, what remains (in my opinion) the main body of the site lives online here. This discussion is also partially inspired by SRS's recent contributions to the Tech: Tape to MP3 conversion thread. In my opinion, the reason for the difficulty is as follows: - The pagination: in other words, the method of counting when a new post is entered to an existing page - The way the pages come up on a Google (or browser) search To explain a little further: When you click on a discussion thread from the main page, the link format is as follows https://mudcat.org/thread.cfm?threadid=171822&messages=4The "&messages" part following the threadid allows for a user to know whether the thread has new messages they haven't read yet in it by turning the link from purple (visited) to blue (unvisited) on most modern browsers. The 6 numbers after the threadid is the unique number reach thread possesses. The browser search link format is as follows https://mudcat.org/thread.cfm?threadid=43909,43909This doesn't seem to always happen (for example, a Google search of "mudcat The Illiterate's Alphabet (Sid Kipper)" returns a link without the numbers repeated after the comma (ie, https://mudcat.org/thread.cfm?threadid=119676). I'm not certain what this functionality does. All this to say, the variety of linking formats makes it difficult for web services such as archive.org to index the website, and difficult to find if a page actually has been archived at all. During periods when the 'cat is regrettably down, or, God forbid, the site ceases to be hosted, a vastly large quantity of content would cease to be hosted. Another factor is that the links from Mudcat aren't often posted elsewhere on the WWW -- if they are, archive.org, for example, will periodically "crawl" and search for new links on other archived sites. And, for whatever reason, the 'Cat seems to have avoided the auto crawls. I'm unsure if this is some sort of privacy/web setting, or something else entirely. I don't have any easy solutions to this, just something I thought I would bring to people's attention. Maybe one of the more tech-savvy people among us can come up with an automated solution, or knows of a quick and easy one? Personally, I have been archiving pages important to me since late 2020 so I can reference them in the event of outages. But manually logging each and every one of the hundreds of thousands of threads would be a monumental undertaking. |
Subject: RE: Archiving The Mudcat From: Stilly River Sage Date: 07 Nov 22 - 08:54 AM There is Mudcat content at Archive.org and at the Library of Congress, but not everything. (And my search isn't landing on it right now.) The Wayback Machine at Archive.org will help you find a lot of things, or maybe just the name of a thread you're trying to recall, as a starting point. It doesn't go into the layers much. 1,290 crawls through Mudcat is an indication that they are interested in this material. There is a subscription service at the Internet Archive called Archive It, but I don't know if that would serve. |
Subject: RE: Archiving The Mudcat From: cnd Date: 07 Nov 22 - 09:00 AM Thanks for those links. I didn't mean to imply they weren't interested in it; more a suggestion that their service is behind current output and that it would be prudent to get ahead of things. |
Subject: RE: Archiving The Mudcat From: pattyClink Date: 07 Nov 22 - 09:26 AM This is an important subject. I hope we can find a way to make sure Mudcat appears on searches without Herculean efforts. I am loath to blame us, though. MOST searches with the current engines provide poor results in comparison to the past, they have been deranged by SEO and algorithms to boost commercial interests, have they not? |
Subject: RE: Archiving The Mudcat From: pattyClink Date: 07 Nov 22 - 09:53 AM Not to contradict you, cnd. I agree with all you have laid out. Just so frustrated with failing search capabilities that I guess I am curious how much of a factor that is. |
Subject: RE: Archiving The Mudcat From: Stilly River Sage Date: 07 Nov 22 - 10:36 AM When people post about search difficulties I link to one of the threads that discusses methods that do work, like this one. There is some discussion and links to follow on Joe's FAQ. |
Subject: RE: Archiving The Mudcat From: MaJoC the Filk Date: 07 Nov 22 - 11:32 AM One thing I've been gently contemplating is something along the lines of the cacheing mirrors set up for the Astronomical Database Service. I admit I set up UKADS at Nottingham, but I was following a known-good recipe from ADS Central in the States somewhere, and it was on a Sun system (which shows how long ago it was). The externally-visible structure of the site was also somewhat simpler than the 'Cat .... which is where the OP came in. But the basic idea sounds, erm, sound: a mirror site which caches what it's called on to relay. In the ADS's case, they intentionally threw away older and less-requested pages, but if we didn't do that we'd end up over time with a full mirror. €0.02 from the demented keyboard of: MaJoC the Filk |
Subject: RE: Archiving The Mudcat From: Joe Offer Date: 07 Nov 22 - 11:47 AM Here is the link for the Library of Congress archive of Mudcat: Same software as at archive.org, but I think it is a more complete archive. If it could be tweaked to make it more complete, let me know. We have a nice relationship with the American Folklife Center at the Library of Congress. -Joe- |
Subject: RE: Archiving The Mudcat From: DaveRo Date: 08 Nov 22 - 03:00 AM cnd wrote: Personally, I have been archiving pages important to me since late 2020 so I can reference them in the event of outages.There is a facility of my Mudcat Browser Tools browser addon that does exactly that. |
Subject: RE: Archiving The Mudcat From: Richard Mellish Date: 08 Nov 22 - 05:41 AM I echo the concern. It seems that many of our eggs are in some other baskets, but not all of them. We should have a complete archive, updated at frequent intervals. And that raises the additional question of how frequent. |
Subject: RE: Archiving The Mudcat From: MoorleyMan Date: 08 Nov 22 - 10:17 AM I'm speakin' as a mere mortal without fancy browser tools or add-ons or apps or masses of tech knowhow... Now I guess what I'm askin' is - if (as I understand it) www.archive.org is a repository for, or access point for, individual song threads (especially useful when the Cat is down), is there an index somewhere that would enable me to find a specific song/thread? Since each thread has a numeric ID, it seems logical that they're sortable - or is that a dumb question? |
Subject: RE: Archiving The Mudcat From: Bill D Date: 08 Nov 22 - 04:34 PM I often find stuff thru Google faster than directly at Mudcat..and often more hits. |
Subject: RE: Archiving The Mudcat From: Joe Offer Date: 08 Nov 22 - 09:54 PM I think it's pretty hard to search the archives of Mudcat at archive.org or the Library of Congress, so that's a problem - I need to learn how to search those archives. But Google is pretty good at it. If Mudcat is down, I Google what I'm seeking. In the top line of each entry in Google results, there are three dots. Click them, and lots of information will appear - choose the "cached" option at the bottom, and you will see Google's cache of the thread. Even when Mudcat is down, I can find stuff on Mudcat through Google, almost as fast as I can find it on Mudcat. And it's my job to help people find stuff at Mudcat - and I like doing it. It you have trouble, ask me. -Joe- joe@mudcat.org |
Subject: RE: Archiving The Mudcat From: GUEST,Ed Date: 09 Nov 22 - 09:10 AM Joe, Many thanks for the information regarding Google's cached copy. Never knew that before. I'd very much wanted to find an early thread recently, and assumed that I'd have to wait for Mudcat to come back online. Thanks again, Ed |
Subject: RE: Archiving The Mudcat From: DaveRo Date: 09 Nov 22 - 04:18 PM Stilly River Sage wrote: There is a subscription service at the Internet Archive called Archive It, but I don't know if that would serve.Archive It FAQ I don't think it's appropriate. It's aimed at organisations. Certainly it would be expensive. The Web Archive aka the Wayback Machine is a set of incomplete snapshots of the mudcat website at different moments over the years. If mudcat disappears if would be totally impracticable to reconstruct it from these glimpses. In some ways mudcat - the discussion threads anyway - would be easy to archive. It's not very dynamic, though posts can and do change. Stuff mainly gets added - it's a big ever-growing heap. Just save the new or amended posts periodically - weekly or monthly - to a read-only database. Store the data in the cloud, e.g. in AWS S3. A search facility could be added for when mudcat is down or only once it's clear that mudcat is never coming back up. But it couldn't be done without Max's agreement and cooperation - it's his website. And it needs someone, or preferably a group of people, to do it and fund it, maybe through crowdfunding. I don't know anything about how mudcat is run. But it's obviously unreliable and it does concern me that people are spending a lot of effort posting lyrics and things into it, maybe under the impression that it'll be available for posterity. |
Share Thread: |
Subject: | Help |
From: | |
Preview Automatic Linebreaks Make a link ("blue clicky") |