Lyrics & Knowledge Personal Pages Record Shop Auction Links Radio & Media Kids Membership Help
The Mudcat Cafesj



User Name Thread Name Subject Posted
Raedwulf BS: Wordle (488* d) RE: BS: Wordle 21 Feb 22


Since I wrote this a few weeks ago for something (private) elsewhere, this might (or might not) be of interest to some of you. Slightly edited for the different environment...

~~~~~~

To E, or not to E, sorry, what is the question?

The answer is E, the question, however you choose to investigate it, is what is the most common letter in English? Interest in the topic of letter frequency is astonishingly old (well, I was astonished, anyway). The first known analysis was done by an Arab mathematician, Al-Kindi, who lived from 801 to 873 (he was one of those Arabic scholars who was tasked with translating Greek scientific & philosophical treatises into Arabic, and played an important part in the introduction of Indian i.e. Arabic numerals; you lot won't recall earlier entries on numbers & counting). It's not impossible that it may even go back to Roman times.

The reason? Cryptanalysis. Al-Kindi used his results to formally develop the method for breaking ciphers. Old Julius (Caesar, that is) has turned up more than once for various reasons in these entries (wot you 'as not seen!), notably for his calendar reforms. Although he didn't invent ciphers, which have been in use for anything up to 3,000 years, he is the first recorded user.

The Caesar cipher is a simple shift cipher e.g. A becomes D, B becomes E. However complicated your cipher is, if it has a regular pattern where one letter is always represented by the same other letter, whether it's from simple shifting, or using a more complex key, then knowing the letter frequency of the language you're trying to decipher is a vital component of that.

You can determine letter frequency in various ways. Samuel Morse, when trying to assign the simplest dot-dash keying to the most frequent letters in his famous Code, simply counted up the number of letters in sets of printers type. E came out top, 60 times more common than Z (12,000 vs 200). Essentially, this is a simple way of performing text analysis. The idiosyncracy of this is that a relatively small number of words dominate in language – he, she, the, in, of, and & so on(!). It's not necessarily a fault; indeed, it's an advantage if the ciphered message is in plain language. But “the cat sat on the mat” presents an entirely different proposition to the abbreviated “cat sat mat”, where half the message is inferred, rather than included, without the clarity of the message being affected.

Another way of doing it is to go through the main entries in a comprehensive dictionary. Someone did that with the Concise OED of 1995; E still came out top, but Q was last; the former at 11.16% was 56.88 times more likely to appear than Q. The idiosyncracy there is your results will vary according to your dictionary. The Wiki entry on the topic has E top with 11%, but J last with 0.21% (Q is now on 0.24%). I don't know which dikker their numbers are based on.

A third way is to include all variants of a word in your analysis, but that biases certain letters because of suffixes. If abstract is the main entry, then you have abstracts, abstracting, abstracted, abstractise... What do you mean I made that last one up!

If you choose to do it on initial letters of words, the frequency of certain combinations of letters such as 'th', sc', sh' and so on means you get an entirely different set of results. On text analyis, T is the most common initial letter (all those The's & Then's & These's!) followed by A (E is 15th); dictionary analysis gives you S, then C (E, 12th). But it's useful knowledge if you're setting up a filing system or anything else, such a multi-volume encyclopaedia, that is split on an A to Z basis.

As for the least frequent, the last four are almost entirely consistent. On initial letter dictionary analysis, Y drops in; J, almost 3 times as likely to be an initial, is promoted to 6th last. Otherwise, it's always J, Q, X & Z, in various orders. Which explains a lot about Scrabble scoring!

As for what prompted this, it's all those of you (you know who are!) who keep posting your Wordle scores. I finally took a look at it (& got it in 3 smug). You might just as well start with stone or stain or stair or store & go from there. Although my one & only go started with “taken”, which was handy, since “spiky” then led to “perky”!

Finally, in other languages, the proportions are naturally different, but the letters aren't necessarily. E is even more frequent in French (14.71%) & German (16.4%) than in English, and a whopping 18.91% in Dutch. You have to go to languages such as Portuguese, Esperanto, Polish, Turkish & Icelandic for that to change (to A, and only in Polish (4th) & Icelandic (5th) is E not 2nd). A, I, & S consistently appear in the top 7 in all of those, with N, R, & T most likely to fill the other 3 spots.

~~~~~

My own strategy is to start with words such as stone, train, notes, that use the most common letters. Whether (& how) you choose any of them is pretty irrelevant. Only blind luck will get you 1/6, and it takes nearly as blind luck to get 2/6 (I've done it once in 20 goes, with two more "if I'd chosen that word instead..."). On the first line, eliminating the most likely letters is as useful as scoring, given the random element. On Sunday, I think, the first 8 letters I chose were all fails; the last 2 on the second line were in the wrong place. I was slightly fortunate to get the word on the 3rd line, but it was also an educated guess, given what had been eliminated.

Whilst there is no logic to it, I try to avoid repeating "opening line" letters (and also opening letter) the following day. There's no reason why stone shouldn't be followed by stein, stair, stain, stale... But it doesn't seem to work that way; I try to change vowels (generally ignoring U, which is rather less common, for the first try). If the word was tacit (a few days ago), I probably went for something like crone or snore next day. It's strictly not logical, but I do it anyway! ;-)

The three important things to remember:

The same letter can appear more than once (this is not explained);
The dictionary is US (despite the creator being British), so rubbishy spelling like center & favor are valid, rather than centre & favour as it should be (and yes that's a deliberate match to the blue-touch paper! :p);
It is rumoured that plurals are not in the (known to be fixed) word list. There are plenty of 5 letter words terminating in S, but whilst bonus, I daresay, is included, sites, rates, teams, etc won't be. I can't speak for the truth of that, but in my limited experience I've never seen one (nor e.g. gives).

Lastly, although I've never set Wordle to hard mode, I've never played it any other way. I have friends that sometimes do that, but I don't see the point. The object is to guess the word in the fewest number of tries; leaving out letters you know are in the word guarantees failure. So I don't. But maybe that's just me!




* And extra-lastly, someone above suggested a bunch of Q-words to start with. Given that that's in the bottom 4 for frequency & pretty much guarantees you a following U, the least likely vowel by some margin... Good luck with that as a strategy! ;-)


Post to this Thread -

Back to the Main Forum Page

By clicking on the User Name, you will requery the forum for that user. You will see everything that he or she has posted with that Mudcat name.

By clicking on the Thread Name, you will be sent to the Forum on that thread as if you selected it from the main Mudcat Forum page.
   * Click on the linked number with * to view the thread split into pages (click "d" for chronologically descending).

By clicking on the Subject, you will also go to the thread as if you selected it from the original Forum page, but also go directly to that particular message.

By clicking on the Date (Posted), you will dig out every message posted that day.

Try it all, you will see.