Lyrics & Knowledge Personal Pages Record Shop Auction Links Radio & Media Kids Membership Help
The Mudcat Cafemuddy

Post to this Thread - Sort Descending - Printer Friendly - Home


Tech: Google NGram Viewer

JohnInKansas 20 Dec 10 - 12:27 PM
Bonnie Shaljean 20 Dec 10 - 12:42 PM
bobad 20 Dec 10 - 12:47 PM
Jim Dixon 20 Dec 10 - 01:32 PM
Jim Dixon 20 Dec 10 - 01:51 PM
JohnInKansas 20 Dec 10 - 02:03 PM
Jim Dixon 20 Dec 10 - 02:39 PM
Jim Dixon 20 Dec 10 - 02:58 PM
Jack Campin 20 Dec 10 - 09:47 PM
JohnInKansas 20 Dec 10 - 09:56 PM
JohnInKansas 20 Dec 10 - 10:02 PM
Share Thread
more
Lyrics & Knowledge Search [Advanced]
DT  Forum
Sort (Forum) by:relevance date
DT Lyrics:





Subject: Tech: Google NGram Viewer
From: JohnInKansas
Date: 20 Dec 10 - 12:27 PM

While not an earthshaking development, Google has launched a new site that lets you look at a plot vs time of how frequently a word (or words) of your choice have appeared in print.

It's at NGram, for those who'd like to experiment.

It appears to be based on books completed in their campaign to "digitize all the books there are." The default setting only shows usages back to 1800, and I didn't try to see whether you get anything useful before then; but it does allow you to put in mulitple words to compare the relative popularity of different terms (for the same thing might be interesting?), or other "important" things you might want to know about a word.

Sort of fun for a few minutes.

Maybe of some use to the serious minded, but I probably don't qualify to make that evaluation.

John


Post - Top - Home - Printer Friendly - Translate

Subject: RE: Tech: Google NGram Viewer
From: Bonnie Shaljean
Date: 20 Dec 10 - 12:42 PM

It sounds really neat, and is certainly attracting some press attention. I haven't had a chance to properly get into it yet, but will.

The New York Times and Scientific American have both written about it (these articles each have a Page 2 as well, clickable at the bottom)

http://www.nytimes.com/2010/12/17/books/17words.html?scp=1&sq=500%20billion%20wo

http://www.scientificamerican.com/article.cfm?id=google-books-culture

And there's another piece in Discover Magazine (please excuse non-clickie but the URL is so long I'm afraid it'll constipate the Link Maker):

http://blogs.discovermagazine.com/notrocketscience/2010/12/16/the-cultural-genome-google-books-reveals-traces-of-fame-censorship-and-changing-languages/


Post - Top - Home - Printer Friendly - Translate

Subject: RE: Tech: Google NGram Viewer
From: bobad
Date: 20 Dec 10 - 12:47 PM

The word "fuck" has an interesting distribution, it's too bad the time line only goes back to 1800.


Post - Top - Home - Printer Friendly - Translate

Subject: RE: Tech: Google NGram Viewer
From: Jim Dixon
Date: 20 Dec 10 - 01:32 PM

I graphed "shanty" and "chanty" together—because I've always been confused about which spelling was to be preferred—and I found that "chanty" peaked in 1800 and has been declining ever since, while "shanty" didn't appear until around 1820, peaked at around 1900, and has been generally declining except for a secondary peak around 1940, which probably coincides with the folksong revival.

Of course, "shanty" could mean a building, which makes the conclusion a bit doubtful.

I don't think you have to be particularly serious-minded to like this; you just have to be curious about words. In fact, maybe the more serious-minded you are, the sooner you get bored with it!


Post - Top - Home - Printer Friendly - Translate

Subject: RE: Tech: Google NGram Viewer
From: Jim Dixon
Date: 20 Dec 10 - 01:51 PM

You *can* put in a date before 1800, or after 2000.

"Fuck" was almost as popular in the 1780s as it is now, while it just about disappeared from print between 1820 and 1960! Fascinating!


Post - Top - Home - Printer Friendly - Translate

Subject: RE: Tech: Google NGram Viewer
From: JohnInKansas
Date: 20 Dec 10 - 02:03 PM

Bonnie's Discover Magazine Link seems to work okay in my preview.

I showed the site to LiK and she tried looking for "Digitalis" and a couple of other toxic herbals. Aside from a resolution to sniff things carefully for a few days (who knows what was on her mind?) she did get some blips on the chart before 1800, although the plot had a rather "sparse" appearance prior to 1800. It's hard to tell whether the words she picked just weren't used then, or whether Google just hasn't incorporated as many books from previous times.

John


Post - Top - Home - Printer Friendly - Translate

Subject: RE: Tech: Google NGram Viewer
From: Jim Dixon
Date: 20 Dec 10 - 02:39 PM

You know who could really benefit from a tool like this: writers of historical fiction! They could use it to choose words that are appropriate to the period they are writing about.

Many's the time, while watching a movie about a historical event, I've had the realism spoiled for me by hearing a character use a word or expression that just seemed too modern to me.

With a little more massaging of the data, you might be able to create a program that would work something like a spell-checker, flagging words in a manuscript that are possibly inappropriate to a certain period, and maybe even suggesting a better word!

Maybe you could even use such a program to detect historical forgeries!


Post - Top - Home - Printer Friendly - Translate

Subject: RE: Tech: Google NGram Viewer
From: Jim Dixon
Date: 20 Dec 10 - 02:58 PM

JohnInKansas: Your wife wouldn't be a writer of historical murder mysteries, would she? It would be reassuring if she were.

I know from looking at lots of old books through Google, that the older a book is, the harder it is to scan and digitize the text without a lot of "scannos" (the term equivalent to "typos"—meaning errors introduced by inaccurate scanning).

I'll bet the data from before 1800 is less reliable than more recent data, and that's why Google adopted 1800 as an arbitrary default start date.

Typefaces were different then, paper was coarser, and paper is more likely to have been damaged over the years. You might think you're counting instances of "bach" (the composer) but the computer is including a lot of instances where the common word "back" was misread as "bach."

Your chances of getting an accurate count would be better, I would expect, if you're looking for a longer word or phrase that doesn't resemble any other common word.

It's unlikely, I think, that any other word would be mistaken for "digitalis."


Post - Top - Home - Printer Friendly - Translate

Subject: RE: Tech: Google NGram Viewer
From: Jack Campin
Date: 20 Dec 10 - 09:47 PM

You can see there's a problem with this by searching for "god" (I tried everything since 1500). There shouldn't be dramatic spikes in that - general trends, but not huge sudden oscillations. But there are. The corpus can't be big enough.


Post - Top - Home - Printer Friendly - Translate

Subject: RE: Tech: Google NGram Viewer
From: JohnInKansas
Date: 20 Dec 10 - 09:56 PM

I don't know what she may have in mind. She hasn't written much of anything "creative"*** that I know of since she took a writing class in college. She does read lots of cheap trashy novels (Elizabeth Peters, Anne Perry, Tim Harrison, Jonathan Kellerman are some on her bookshelf if somebody knows who they are). I suppose I might read a couple of them to see what ideas she's getting, but I get bored with all the heaving bosums and such if they're not heaving right at me.

The suggestion appears above: "With a little more massaging of the data..." I believe I saw a link down at the bottom of the page that said you can download the database they're using for the plots/searches, so a clever person might be able to download and "personalize" it to suit more specific interests.

*** ... except for the little book she almost keeps track of her checks and debit card charges in.

John


Post - Top - Home - Printer Friendly - Translate

Subject: RE: Tech: Google NGram Viewer
From: JohnInKansas
Date: 20 Dec 10 - 10:02 PM

You can see there's a problem with this by searching for "god" ..."

The Dictionary of Ancient Deities that I picked up recently claims to list 10,000+ names that have been used for her. You probably can't expect them to be too accurate at tracking someone with that many aliases.

John


Post - Top - Home - Printer Friendly - Translate
  Share Thread:
More...

Reply to Thread
Subject:  Help
From:
Preview   Automatic Linebreaks   Make a link ("blue clicky")


Mudcat time: 19 August 11:30 AM EDT

[ Home ]

All original material is copyright © 1998 by the Mudcat Café Music Foundation, Inc. All photos, music, images, etc. are copyright © by their rightful owners. Every effort is taken to attribute appropriate copyright to images, content, music, etc. We are not a copyright resource.