Dictionary creation using Word Index - William Shakespeare Thesaurus cont.
 
 
  In the previous part we expanded the Shakespeare dictionary with words like unwash'd
  But there are may other words in the *missing.csv file that could be added.
  Let’s talk about Word index.
  When you load a dictionary into the Editor, there is a whole section of functions on the head-word part that works with something called an Index.
 
  
 
  What is Index? Those are the words you may find at the end of a book that point somewhere into the book text where such word was used.
  What does it have to do with a dictionary? Isn’t dictionary in fact an index? 
  Yes it is. But a good thesaurus could have 50 thousand entries. Each entry has many words inside. Just think of word said and all the synonyms like uttered, 
  stated, or even words as such (said, can be used in that meaning as well)
 
  
 
  And we could have 50 thousand of such bundles of related words, or even more.
  That gives us easily millions of words. Yes, many will be duplicates, but that’s how 
  dictionaries are. 
  The point is, if we want to do some operation, such as locating related words on the 
  web, this would then take a huge amount of time.
  If the software let’s you do it on the entire dictionary then after a few days, you will get 
  pretty bored looking at the screen fetching data, and then you would just stop it. Of 
  course you will get some enhancement to the dictionary but you would have no idea 
  what they were. These could be random words that you don’t even care about.
  That’s why the Index is there. The index point’s to a very small part of the dictionary, but 
  also the part we are interested in the most.
  So what is the Index and how do you get one?
  Well, it could be a few words or it could be a few thousand words. And you don’t need to 
  prepare such index manually. Loading an entire book will populate the index with a few 
  thousand of unique words. (A normal novel could have 7000-9000 unique words)
 
 
  I think you may see the point already. Working on a thematic thesaurus, such as Shakespearean Thesaurus, we can load the work of William 
  Shakespeare as the Index and thus limit the operations to the words we are really interested.
 
  
 
  But let’s not get that far. We already have a lot of dictionary done for us with the 
  Dictionary Synthetizer, which does more complete shakedown of the words.  
  We still have the *unknown* words text file. As you remember, there are two 
  parts to it, No Synonyms part and Unknown words part. In the previous chapter 
  we talked about the unknown words part - the words that no dictionary could 
  give you a good answer. But there was a good chunk of other words, that are 
  missing from our Thesaurus.
 
 
  We can look at the unknown.csv file again, import it in the Wordsheet. This time, let’s select the bottom part (unknown/misspelled words) and delete it so 
  we will keep only the top part.
 
 
  The top part has words that are recognized, but no good synonyms were found. This 
  could be a great candidate for our dictionary index!
  The bottom part has words that the lexicon didn’t recognize at all. We could use that part 
  as well, in theory, but the chance of fetching synonyms are slim. Hence we will work on 
  the top part. Export it to a new csv file.
 
  
  
  
 
  With our combined dictionary (from the first part) loaded in 
  the Dictionary Editor, load the doctored csv file as our index.
 
 
  With the Index loaded, we can now use the Previous/Next buttons to browse through 
  the index. And to see it works, the dictionary should give us no result for these words - 
  they were thrown out after all.
 
  
  
 
  Now let’s see what we can do. Press the Lookup 
  Single entry button.
  This will fetch results from various places on the 
  web.
  But do not save this! 
 
 
  Before you do anything, let’s just click Delete Entry button a few times to delete all the entries we just got. (Or use delete Head-word) Why? Because this is 
  entirely not what we want to do! We don’t want to put swooning (Shakespearean word) as the head-word and then fainting (non Shakespearean word), as 
  synonyms. That would be a translation dictionary not a thematic thesaurus. We want just the opposite!
 
  
 
  One of the function on the toolbar says Reverse Index Lookup. And that is exactly what we need. It 
  grabs word from the Index, then find synonyms and reverse them. It puts those synonyms as head-
  words (fainting) and the Index word (‘swooning’) as the synonym entry.
 
 
  In general, this requires a bit of leap of faith. We assume that those synonyms and head-words are reversible. This would not be always true for a general 
  terms, but here, the words we loaded as Index are words that were already thrown out as not very common. This is where the function works best - on an 
  index that is very specific to the theme.  If you use it on a very generic index you would then get very questionable entries since for generic terms the head-
  word and synonym are not always reversible, giving sometimes ridiculous results. 
   
  Let’s stay on the word swooning and use this option and then answer no when the question will pop up about rewinding the index, because we wan’t to 
  work only on this word. After the word is processed, stop the fetching other words and then go back to the ‘swooning’
   by searching for it.
 
 
  There is a generic Clean Up button that would take care of multiple entries, but leaving swooning 
  in both verbs and nouns. Or we can use Deep Clean, that would also clean multiple entries across 
  the type, leaving us with only with a single entry (in this case).
 
  
  
 
  Swooning is right there, and not just once. Now we need to clean it.
 
  
 
  It is quite possible that you created a grandiose Thematic thesaurus, only then ruining it by accident with entries that do not belong there. And suddenly 
  William Shakespeare gives youi suggestions like ‘electronic’ or iPhone.
  No problem, you don’t have to start all over.
 
  
 
  It did happen a few times to us too. 
  “Remove all synonyms that are NOT in index” is just the exact tool to fix that.
  In case of our Shakespeare Thesaurus, we would load the entire “Complete works of 
  William Shakespeare” as our index and let it proceed. It would take a bit of time, but it 
  would remove any wrong words, yet still leave it searchable with modern terms head-
  words.
 
 
  To be continued…
 
  
 
  Wait, it is not there!
  And that’s a good thing.
  Because we didn’t want swooning as 
  head-word.
  Now search for fainting.
 
 
  With this knowledge, we could let the Reverse Index lookup run on the entire index. It will take time to fetch all the data. The best time would be 
  probably let it run overnight.
 
 
  Knowing that you could fix over-zealous thesaurus later, you could refine certain parts of the dictionary without too much planning. At the end you 
  could always cut off the wrong words.
 
 
  
 