Mudcat Café message #1936714

The Mudcat Café ^TM
Thread #97655 Message #1936714
Posted By: Mick Pearce (MCP)
14-Jan-07 - 07:41 PM
Thread Name: Digital Tradition Upgrade?
Subject: RE: Digital Tradition Upgrade?

I've mentioned my approach to the design before, but I'll give a description of what I've been doing with the DT.

I still think the way forward is to use a relational database with the semantic structure of the song files. In my database I have tables for:

  Songs - with titles and identifier fields (Child, Laws, Roud, DT#)

  Keywords for Songs - this links the Song table to all the keywords for the song

  SongLines - the entire DT data, but with type added to each line eg Title, Song Text, Notes etc (I think I have 12 types)

  Tunes for Songs - this links the song table to all the tunes given for it

  Tunes - the individual tune names with links to the abc

  Abc - the abc lines for each tune.

Both the text data and the tune data were generated programmatically from the original DT files (in fact I think it's not the last release I've got loaded but the one before). There was a bit of manual processing aferwards (mostly when the heuristics for separating song notes from song lines in the DT failed. I added a switch button to my viewer so that I could go through the files and have selected lines switched from text to notes or vice versa. There were also a handful of multivoice SongWright files that I didn't convert to abc - the extra programming to handle it didn't seem worth it for the few files affected).

I've written a Java browser to access the database (I'm using Java for intended platform independence). The browser part displays all the titles in a tree (title name under start letter nodes) and I make use of the line type information when displaying selected songs - I keep the song notes separate from the song text. I can display the score (I'm currently using abcm2ps to generate the score and I've got a version of the Java postscript interpreter Toastscript that I've
modified to allow it to display and print the score from within the browser. I may automatically display the music in the browser later.

I have implemented a simple karaoke midi player in the browser that displays the song-syllables along with the tune, one line at a time. The midi is generated using abc2midi, so you can see I'm not totally Java yet! My personal opinion is that you don't necessarily need the karaoke to happen on the score the way the current DT browser does. While that's nice, it's a bugger to do unless you're implementing the score writer yourself (which I'm not, certainly not for now; even
later I'd prefer to generate postscript and forgo bouncing ball/highlighting the score), and I'm of the opinion that if you can read the music you don't need the karaoke and if you use the karaoke
you don't need the score.

For searching I allow searches against the following items:
  Identifier: Child, Laws, Roud, DT, exact titles, exact keywords
  Substrings of Titles or keywords
  Substrings of single lines of song text or song's notes
  Substrings in any line of a song text, any line of a song's notes, any line as in the original DT for the song

The search item can be a list of phrases separated by && (for and) or || (for or), each phrase optionally starting with ~ to negate the test (match-> no match, string in item -> string not in item). It's not particularly elegant, but I chose this because in Java it's almost trivial to split the individual items out. At present I can only apply the tests to a single item at a time - eg I couldn't seach for (Child=1) AND ('Nic Jones' In Song Notes). (Though I return the search result titles as a node on the Browse tree, so I could in principle apply a new search to songs only within a node - I don't at the moment, but it shouldn't be too hard to do. I do allow the results of several searches to be added to the result node or clear out the old entries first). At present I force all '&&' or all '||' to be used in a test, but that's because I couldn't be bothered handling parentheses to set the precedence of (A&&B)||C versus A&&(B||C). I could allow both now if I was willing to accept the default operator precedence, but choose not to. In theory there's no limit to how many operators can appear, but some of the whole song searches could take a long time with a lot of conditions. (All theses searches essentially only generate one of three types of queries - a simple one against the Song table, a simple join of Song with Keywords or Song with the DT Lines, a sequence of joins of Song with DTLines linked by INTERSECT or UNION). Also the conjunctions are not order dependent William && Mary would find 'William loves Mary' and 'Mary loves William' (though it would be trivial to allow something like 'William > Mary' to pick out 'William loves Mary' by creating a simple LIKE/regexp condition in the generated search; in fact I like that idea so much I might do it!. I'm not in favour of general regexp expressions being allowed though - it's too easy for people to get them wrong!).

So examples of searches I can do are:
    Child#:        Child#: 4||8       finds all entries marked as Child #4 or Child#8
    Title:         William && ~Mary   finds song titles containing William but not Mary
   Song line:      Dilston && ancient find songs with both of these words in the same line
  All song lines:  Dilston && ancient finds songs with 'Dilston' in any line of the song and 'ancient' in any line of the song
    (the last will find DERWENTWATER'S FAREWELL, the previous wont find anything)
  
Text searches are currently case insensitive, but the code generating the SQL query can already generate a case sensitive search if I ever add that as an option in the search strings.

This search set can run everything that my Windows DT version can do sslighly faster on some of the searches, but essentially the same order of magnitude), but can run the more complex searches too. Note that in the substring searches I search for any substring, not whole words. So 'Hall' will match 'shall' and 'halleluja' too. The database I'm using (Apache Derby - a pure Java database) doesn't support any whole word text searches . I could index all (non-trivial) words in the database and use those, but I don't find the absence of such searches a problem. There are relational databases that support document handling better than this - Oracle, SQL server and MySQL have support for word and phrase searches (I can't find phrases that span two lines for example) and I may try to do this with my MySQL version of the database (the Java code can access this just by selecting a different database driver, so I shouldn't need to change anything else to run my current version - I currently have the data in Apache Derby, MySQL and MS Access, and could run the browser from any of them - I could also have it as a client to a server of my Derby DB - though I'm usually running on Derby, part of my heading towards an all Java system. I wanted to develop a system that used only free components and for relational databases both MySQL and Derby are free and I can use either from Java, Derby just has a smaller footprint - about 2Mb for the code.)

I haven't added the editor form yet, but that's essentially just putting a lot of fields on a form (I've designed it, just not programmed it) and updating the database from it. My ultimate aim is to use it to add my own songs and tunes to it (I'll just flag DT orignal files separately). I'll also extend the tune info for a song to include links to external Audio files (midi/wav/mp3) and let me be
selected for the song along with the current tunes available.

This is more technical stuff than I'd normally post, but I thought you might be interested.

For the DT I think there are more significant things that can be done to simplify the administration of the system and increase the usefulness of the online database.

Having a proper Add Song form on Mudcat would be my choice for the most important change to the system (moreso than the browser system). At the simplest I'd include Titles, Song Text and Notes plus the Identifier information. It could be stuffed as is into the online database - it needs no more than a flag field to distinguish it from the last released DT.

The searches available could include the songs from the start and the editors could easily locate the new songs for tidying up and possible aapproval for inclusion as release songs (or even deletion - keep an archive note - "Song entered but duplicate of...". It would then be trivial to issue updates to the distributed versions of the DT. The song could remain 'open' for annotations - where people could post possible corrections to the song. These could be displayed along with the notes when displaying the song and again the editors could review these when preparing a new release (or just ongoing - you'll always know when there are open annotations) and either alter the song or leave them as notes. I'd do the same for tunes too.

But as I've said before whatever way the DT goes is fine with me. It's a great resource and it's been provided free by people who put in the time and effort just for the love of it. While I've written my own browser (partly just to get the hang of Java, partly to get some things the way I'd prefer them, and still work in progress, though I have found Java very quick and easy for development - although the browser's not finished, I've probably not spent more than a week on creating it), the (Windows) version that came with the DT is fine (well it's annoying having a new window open for each song you look at - that's one change I did make: the song select/search result tree is on the same page as the song display, I only open a new window for the score view/print). And despite the disproportionate space writing about my browser and its searches compared to my ideas for the online system, it's that last one that I'd really like to have considered.

Mick