Dictionary creation using Word Index - William Shakespeare Thesaurus cont.
In the previous part we expanded the Shakespeare dictionary with words like unwash'd
But there are may other words in the *missing.csv file that could be added.
Let’s talk about Word index.
When you load a dictionary into the Editor, there is a whole section of functions on the head-word part that works with something called an
Index.
What is Index? Those are the words you may find at the end of a book that point somewhere into the book text where such word was used.
What does it have to do with a dictionary? Isn’t dictionary in fact an index?
Yes it is. But a good thesaurus could have 50 thousand entries. Each entry has many words inside. Just think of word said and all the synonyms
like uttered, stated, or even words as such (said, can be used in that meaning as well)
And we could have 50 thousand of such bundles of related words, or even more.
That gives us easily millions of words. Yes, many will be duplicates, but that’s
how dictionaries are.
The point is, if we want to do some operation, such as locating related words on
the web, this would then take a huge amount of time.
If the software let’s you do it on the entire dictionary then after a few days, you
will get pretty bored looking at the screen fetching data, and then you would just
stop it. Of course you will get some enhancement to the dictionary but you
would have no idea what they were. These could be random words that you
don’t even care about.
That’s why the Index is there. The index point’s to a very small part of the
dictionary, but also the part we are interested in the most.
So what is the Index and how do you get one?
Well, it could be a few words or it could be a few thousand words. And you don’t
need to prepare such index manually. Loading an entire book will populate the
index with a few thousand of unique words. (A normal novel could have 7000-
9000 unique words)
I think you may see the point already. Working on a thematic thesaurus, such as Shakespearean Thesaurus, we can load the work of
William Shakespeare as the Index and thus limit the operations to the words we are really interested.
But let’s not get that far. We already have a lot of dictionary done for us
with the Dictionary Synthetizer, which does more complete shakedown of
the words.
We still have the *unknown* words text file. As you remember, there are
two parts to it, No Synonyms part and Unknown words part. In the
previous chapter we talked about the unknown words part - the words
that no dictionary could give you a good answer. But there was a good
chunk of other words, that are missing from our Thesaurus.
We can look at the unknown.csv file again, import it in the Wordsheet. This time, let’s select the bottom part (unknown/misspelled words) and
delete it so we will keep only the top part.
The top part has words that are recognized, but no good synonyms were found.
This could be a great candidate for our dictionary index!
The bottom part has words that the lexicon didn’t recognize at all. We could use
that part as well, in theory, but the chance of fetching synonyms are slim. Hence
we will work on the top part. Export it to a new csv file.
With our combined dictionary (from the first part)
loaded in the Dictionary Editor, load the doctored csv file
as our index.
With the Index loaded, we can now use the Previous/Next buttons to browse
through the index. And to see it works, the dictionary should give us no result
for these words - they were thrown out after all.
Now let’s see what we can do. Press the Lookup
Single entry button.
This will fetch results from various places on the
web.
But do not save this!
Before you do anything, let’s just click Delete Entry button a few times to delete all the entries we just got. (Or use delete Head-word) Why?
Because this is entirely not what we want to do! We don’t want to put swooning (Shakespearean word) as the head-word and then fainting (non
Shakespearean word), as synonyms. That would be a translation dictionary not a thematic thesaurus. We want just the opposite!
One of the function on the toolbar says Reverse Index Lookup. And that is exactly what we
need. It grabs word from the Index, then find synonyms and reverse them. It puts those
synonyms as head-words (fainting) and the Index word (‘swooning’) as the synonym entry.
In general, this requires a bit of leap of faith. We assume that those synonyms and head-words are reversible. This would not be always true
for a general terms, but here, the words we loaded as Index are words that were already thrown out as not very common. This is where the
function works best - on an index that is very specific to the theme. If you use it on a very generic index you would then get very
questionable entries since for generic terms the head-word and synonym are not always reversible, giving sometimes ridiculous results.
Let’s stay on the word swooning and use this option and then answer no when the question will pop up about rewinding the index, because we
wan’t to work only on this word. After the word is processed, stop the fetching other words and then go back to the ‘swooning’
by searching for it.
There is a generic Clean Up button that would take care of multiple entries, but leaving
swooning in both verbs and nouns. Or we can use Deep Clean, that would also clean
multiple entries across the type, leaving us with only with a single entry (in this case).
Swooning is right there, and not just once. Now we need to clean it.
It is quite possible that you created a grandiose Thematic thesaurus, only then ruining it by accident with entries that do not belong there.
And suddenly William Shakespeare gives youi suggestions like ‘electronic’ or iPhone.
No problem, you don’t have to start all over.
It did happen a few times to us too.
“Remove all synonyms that are NOT in index” is just the exact tool to fix that.
In case of our Shakespeare Thesaurus, we would load the entire “Complete works
of William Shakespeare” as our index and let it proceed. It would take a bit of
time, but it would remove any wrong words, yet still leave it searchable with
modern terms head-words.
To be continued…
Wait, it is not there!
And that’s a good thing.
Because we didn’t want swooning as
head-word.
Now search for fainting.
With this knowledge, we could let the Reverse Index lookup run on the entire index. It will take time to fetch all the data. The best time
would be probably let it run overnight.
Knowing that you could fix over-zealous thesaurus later, you could refine certain parts of the dictionary without too much planning. At
the end you could always cut off the wrong words.