| 
  • If you are citizen of an European Union member nation, you may not use this service unless you are at least 16 years old.

  • You already know Dokkio is an AI-powered assistant to organize & manage your digital files & messages. Very soon, Dokkio will support Outlook as well as One Drive. Check it out today!

View
 

Google Books Ngram Viewer Cheat Sheet

Page history last edited by Alan Liu 9 years, 6 months ago

Cheat Sheet of Parameters That Can Be Set for the Ngram Viewer

(excerpted and quoted with adaptation from About Google Ngram Viewer)

 

  • Wildcard search ("search phrase*")
    • When you put a * in place of a word, the Ngram Viewer will display the top ten substitutions. For instance, to find the most popular words following "University of", search for "University of *".
  • Inflection search ("search phrase_INF")
    • An inflection is the modification of a word to represent various grammatical categories such as aspect, case, gender, mood, number, person, tense and voice. You can search for them by appending _INF to an ngram. For instance, searching "book_INF a hotel" will display results for "book", "booked", "books", and "booking":
  • Part-of-speech Tags
    • ("searchword_Verb")
    • ("searchword_Noun")
    • ("searchword_ADJ")  adjective
    • ("searchword_ADV")  adverb
    • ("searchword_PRON)  pronoun
    • ("searchword_DET)  determiner or article
    • ("searchword_ADP)  an adposition: either a preposition or a postposition
    • ("searchword_NUM)  numeral
    • ("searchword_CONJ)  conjunction
    • ("searchword_PRT)  particle
    • ("searchword_ROOT)  root of the parse tree     These tags must stand alone (e.g., _START_)
      • Example: Consider the word tackle, which can be a verb ("tackle the problem") or a noun ("fishing tackle"). You can distinguish between these different forms by appending _VERB or _NOUN: , etc.
      • Most frequent part-of-speech tags for a word can be retrieved with the wildcard functionality.  For example: query cook_*:
  • Stand-alone usage of Part-of-speech tags (above tag used in the format "_tags_")
    • For example, you can use the DET tag to search for "read a book," "read the book", "read that book," "read this book," and so on
  • Start and End of Sentences ("_START_") ("_END_")  
    • The Ngram Viewer tags sentence boundaries, allowing you to identify ngrams at starts and ends of sentences with the START and END tags, for example: "_START_ President Lincoln")
  • Dependency Relations ("mainword=>dependentword")
    • Sometimes it helps to think about words in terms of dependencies rather than patterns. Let's say you want to know how often tasty modifies dessert. That is, you want to tally mentions of tasty frozen dessert, crunchy, tasty dessert, tasty yet expensive dessert, and all the other instances in which the word tasty is applied to dessert. For that, the Ngram Viewer provides dependency relations with the => operator:\.
  • Root Word in Sentence ("_ROOT_=>searchword")
    • Every parsed sentence has a _ROOT_. Unlike other tags, _ROOT_ doesn't stand for a particular word or position in the sentence. It's the root of the parse tree constructed by analyzing the syntax; you can think of it as a placeholder for what the main verb of the sentence is modifying. So here's how to identify how often will was the main verb of a sentence: "_ROOT_=>will".  This will return results in which "will" is part of the sentence Larry will decide. but not Larry said that he will decide, since will isn't the main verb of the latter sentence.
  • Ngram Compositions
    • The Ngram Viewer provides five operators that you can use to combine ngrams: +, -, /, *, and :.
+ sums the expressions on either side, letting you combine multiple ngram time series into one.
- subtracts the expression on the right from the expression on the left, giving you a way to measure one ngram relative to another. Because users often want to search for hyphenated phrases, put spaces on either side of the - sign.
/ divides the expression on the left by the expression on the right, which is useful for isolating the behavior of an ngram with respect to another.
* multiplies the expression on the left by the number on the right, making it easier to compare ngrams of very different frequencies. (Be sure to enclose the entire ngram in parentheses so that * isn't interpreted as a wildcard.)
: applies the ngram on the left to the corpus on the right, allowing you to compare ngrams across different corpora.
  • Searching inside Google Books
    • Below the graph, we show "interesting" year ranges for your query terms. Clicking on those will submit your query directly to Google Books. Note that the Ngram Viewer is case-sensitive, but Google Books search results are not.
  • Corpus Selection ["(searchworld:eng_2012)", "(searchword:fre_2012)", etc.]
    • The : corpus selection operator lets you compare ngrams in different languages, or American versus British English (or fiction), or between the 2009 and 2012 versions of our book scans. Here's chat in English versus the same unigram in French: (chat:eng_2012) versys (chat:fre_2012)
    • Corpora: Below are descriptions of the corpora that can be searched with the Google Books Ngram Viewer. All corpora were generated in either July 2009 or July 2012; we will update these corpora as our book scanning continues, and the updated versions will have distinct persistent identifiers. Books with low OCR quality and serials were excluded.
Informal corpus name Shorthand Persistent identifier Description
American English 2012 eng_us_2012 googlebooks-eng-us-all-20120701 Books predominantly in the English language that were published in the United States.
American English 2009 eng_us_2009 googlebooks-eng-us-all-20090715
British English 2012 eng_gb_2012 googlebooks-eng-gb-all-20120701 Books predominantly in the English language that were published in Great Britain.
British English 2009 eng_gb_2009 googlebooks-eng-gb-all-20090715
Chinese 2012 chi_sim_2012 googlebooks-chi-sim-all-20120701 Books predominantly in simplified Chinese script.
Chinese 2009 chi_sim_2009 googlebooks-chi-sim-all-20090715
English 2012 eng_2012 googlebooks-eng-all-20120701 Books predominantly in the English language published in any country.
English 2009 eng_2009 googlebooks-eng-all-20090715
English Fiction 2012 eng_fiction_2012 googlebooks-eng-fiction-all-20120701 Books predominantly in the English language that a library or publisher identified as fiction.
English Fiction 2009 eng_fiction_2009 googlebooks-eng-fiction-all-20090715
English One Million eng_1m_2009 googlebooks-eng-1M-20090715 The "Google Million". All are in English with dates ranging from 1500 to 2008. No more than about 6000 books were chosen from any one year, which means that all of the scanned books from early years are present, and books from later years are randomly sampled. The random samplings reflect the subject distributions for the year (so there are more computer books in 2000 than 1980).
French 2012 fre_2012 googlebooks-fre-all-20120701 Books predominantly in the French language.
French 2009 fre_2009 googlebooks-fre-all-20090715
German 2012 ger_2012 googlebooks-ger-all-20120701 Books predominantly in the German language.
German 2009 ger_2009 googlebooks-ger-all-20090715
Hebrew 2012 heb_2012 googlebooks-heb-all-20120701 Books predominantly in the Hebrew language.
Hebrew 2009 heb_2009 googlebooks-heb-all-20090715
Spanish 2012 spa_2012 googlebooks-spa-all-20120701 Books predominantly in the Spanish language.
Spanish 2009 spa_2009 googlebooks-spa-all-20090715
Russian 2012 rus_2012 googlebooks-rus-all-20120701 Books predominantly in the Russian language.
Russian 2009 rus_2009 googlebooks-rus-all-20090715
Italian 2012 ita_2012 googlebooks-ita-all-20120701 Books predominantly in the Italian language.

Comments (0)

You don't have permission to comment on this page.