Dictionary creation using CSV - William Shakespeare Thesaurus
This is intended for word-geeks who want to look more in-depth at creating a specific dictionary (thesaurus) using CSV format.
Let’s just do it by following an example.
We are going to make a better Shakespearean thesaurus than we get by just loading TXT book.
First, we will start by downloading the The Complete Works of William Shakespeare from project Gutenberg in Plain Text UTF-8 format.
Open the txt file, it could be in notepad or in CQuill (Cquill will add it to the project) and remove any text that is not part of the book - that
is the beginning “This eBook is for the use of anyone anywhere…” because we don’t want Shakespearean dictionary to know words like
ebook, or ‘electronic’ and then also from the end which is a long paragraph about project Gutenberg. “*** END OF THE PROJECT
GUTENBERG EBOOK”, because again, we don’t want Will to talk like 21st century lawyer.
Now save the file (or export from CQuill to txt (Document Export As…)
The standard, easy method is to go to Dictionaries - Synthetize Thesaurus and we will start with that. This would automatically fill the
dictionary with as many words as it could guess.
At this moment, there is not much to it, just load the *.txt book you just cleaned in previous step and wait a few minutes. The system will
churn the words, and when it is finished, look at the bottom. Check the Save Unknown words to txt file, and then hit Save and name your
dictionary. For example myShakespeare.dic. The AI word cruncher created 51 thousand entries with still yet about 12 thousand unknown
words. (Your numbers may be different, as there may be improvements in the dictionary over time)
Saving will create the dictionary *.dic file and also save a file called
MyShakespeare_unknown.csv.txt
This is a txt file formated as standard csv - a comma separated table. You
can open it in notepad, or hit the Open button on the dialog and it will open
the file for you.
The “unknown” text file consist of two parts
No Synonyms part and Unknown/misspelled words following it.
The ‘no synonyms’ part are words that we know for sure to be English words. Some
are names and they would not have synonyms. Other are a various archaic forms.
But none of those are our concern right now.
The bottom part of the TXT file does list words that are completely unknown to
Vocabulary Synth, either misspelled, or in this case many would be specific to how
Shakespeare wrote. Like unwash’d or anchor’d.
The weird format you see is the CSV format. Except it is missing the head word,
hence it starts with a comma.
So line ,(),anchor’d means:
missing_head_word,(missing_POS_type),anchord’d
Note: The second item in parenthesis is POS type. JJ means Adjective.
In general, POS types for the dictionary can be:
•
(noun) or (NN) - Noun
•
(verb) or (VB) - Verb)
•
(adj) or (JJ) - Adjective
•
(adv) or (RB) - Adverb
In order to fill the missing head-word we would have to write in the document:
unwashed,(VB),unwash'd
You can certainly do it in the notepad, but CQuill has a pretty awesome spreadsheet document. So why don’t we fire it up! It will make things
much easier.
Create new Wordsheet, then go to menu Document and use Import CSV.
Alternatively you can just copy and paste the lines from the notepad to the Wordsheet and it will format them as a table.
So here we have a huge table (12000 or so lines - yours may vary). I don’t think we are going to manually fill all the missing entries, but
we can at least fill some by search and replace.
There seems to be quite a few shortened words where instead of -ed, are ‘d, like unwash’d
That’s easy to fix- but remember, we don’t want to replace words on the right side, These
are the Shakespearen words that we WANT. We are missing English head-words so we
need to create them in the first column. Let’s use the power of Wordsheets.
First select the whole column, by clicking on its header [C]
Now hit Ctrl+C to copy all the cells bellow. We need them in the first column, but for now, let’s just paste them temporarily to some free column
on the right side. Just select another empty column and hit CTR+V. Be patient, you are working with 20k lines!
So now we have a column where we can mess with the words and turn them into today’s English. With the [D] column still selected go to menu
Wordsheet and select Replace Selection.
We are going to replace ‘d with ed
Note the Delete cell check box. This indicates that if there is no match (the word has no
‘d in it) we also want to delete that cell. Now run it.
The text in column D will disappear, but wait! Scroll down to find the replaced items.
These are our new head words. Now we can copy the column D over to the column A where it belongs (again select the very top header [D],
then CTR+C, , select header [A], CTRL+V), then delete the column [D] as we don’t need it anymore (select the header [D] and press DEL)
Let’s just explain what we had done. The first column is the head word, that is the word you are searching the dictionary for. In
Shakespearean Thesaurus, these would be English words. The second is the POS file - the 0 you see in the table is not zero, these are
empty () parenthesis.
The few lines we successfully replaced would mean that if we search our thematic thesaurus for word like renewed, the thesaurus will offer
us renew’d
As you go through the lines you may see that replacing ‘d was not
enough. In general Shakespeare seemed displeased with e, but there
are also other short versions like: whatsoe’er
We will go about it the same way, copy the entire column C into column D as a temporary place (you can also copy it to a new Wordsheet
if this is what you prefer!) and then replace let say ‘red with ered.
But after that, how do we merge the newly fixed column [D] with the already somehow populated column [A]? That’s easy, just paste it
over as before. If the cell doesn’t exist (and we instructed the replace command to Delete cells that do no match) it won’t be pasted over.
A deleted cell does not have an empty string - it has nothing, it doesn’t exist, so it cannot be pasted over anything. Just try it, it works!
The same way we can replace ‘ry, ‘st, ‘red, ‘ring, ‘er one at the time, then merge the column with the A column. Now if you look at the A
column, there are still some words that may have ‘ in them. If you are going to replace those directly in the column A, make sure you
uncheck the Delete cell, so you won’t delete the words you already fixed before! You can manually change things here and there, delete
entries that are completely useless and then just save the CSV file by going to Document - Export CSV.
You may have noticed that we didn’t care about the missing POS type and that’s because if it is not specified (or a bogus entry, like empty
parenthesis we have right now), the import process will try to determine the type itself.
You could also experiment trying to replace eth with es. For example accuseth > accuses, but you should look at the results, because it
would also replace words like Elizabeth and mess up with other words.
Let’s go back to our Synthetize Thematic Thesaurus. Here we will use the Import CSV button, but before that, let’s talk about the options:
Unlike the TXT import, the CSV has a few peculiar options that don’t
say much. To be honest, they were mostly added when we were
building some of our thesauri (plural of thesaurus apparently).
If nothing else, it could be a note for future ourselves.
CSV Import options:
Normal
It creates the headword, and fill the entry with the other words on the line in CSV file - so this option creates one entry per line in the
dictionary. No trickery.
The CVS line: happy,(JJ), jolly, merry
Creates:
•
happy,(JJ), jolly, merry
Symmetrical self-reference
The idea is that if the word happy has synonym jolly and merry, then both merry and jolly should have synonym happy. This works in many
cases, unless the entry is a bit free spirited. For example the 1911 Roget’s thesaurus that we used as part of our General Thesaurus has
only some 1400 entries, but each entry has ten, twenty or sometimes more words in them. Symmetrical reference would magically create
40000 entries out of it. Sadly, many would be pretty big stretch, mostly because the 1911 thesaurus takes synonyms more as “somehow”
similar words, but not entirely.
Hence, this option should be used if you can vouch for the word’s in entry (or for example if there is only one entry like in our case. Both
whatsoe’er and whatsoever are obviously interchangeable.
The CVS line: happy,(JJ), jolly, merry
Creates:
•
happy,(JJ), jolly, merry
•
jolly,(JJ),happy
•
merry,(JJ),happy
Cyclical Reference
Now, you may be saying: shouldn’t jolly also have merry as synonym? Yes and no. In this case it would work fine and the entry : jolly,(JJ),
happy, merry would work. But there are many cases where this wouldn’t be true. The words in the entry are synonym to the head-word,
sometimes even symmetrical, but not all of them will be synonyms to each other. We are not going to look through 40 thousand entries, are
we?
This option would creates gigantic cross-referenced thesaurus with hundred thousand of words in it, but it would be a pretty sorry
thesaurus, giving you synonyms that would require a suspension of belief.
Now this option still exist, because there is a valid reason, for example if you are building a Rhyming dictionary. If the headword rhymes
with the entry words, so should all the entries in some way rhyme with each other.
Normally doing something like this would create a huge file (we are talking about Gigabytes), and that would be of no use, so there is
another trick - Pointer Reference. It basically creates only one entry per line, but then it goes word by word through the entries and creates
a fake - pointer only - headword that points to the original entry.
The CVS line: happy,(JJ), jolly, merry
Creates:
•
happy,(JJ), jolly, merry
•
jolly,(JJ), >[see happy]
•
merry,(JJ),>[see happy]
The file will be very small, but hugely criss-cross-referenced. The downside is that editing would be nearly impossible as most of the entries
just point somewhere else. There isn’t much more use besides the Rhyming dictionary, to be honest.
Reverse dictionary
This is like the symmetrical self reference, except it never creates the original entry:
The CVS line: happy,(JJ), jolly, merry
Creates:
•
jolly,(JJ),happy
•
merry,(JJ),happy
Why we would ever want something like that? Yes we do, and a lot of times! Imagine you are building a thematic thesaurus, and let’s take
Shakespeare as an example:
•
haggard,(JJ), wild, unmanageable, untrainable
Now you see, building thematic thesaurus with CSV like that actually makes zero sense. If we search for haggard, we will get synonyms
wild, unmanageable, untrainable, but that is not what we want at all. That would be a translation from Shakespearean to English. We want
to type ‘wild’ and see Shakespearean ‘haggard’. So Reverse Dictionary.
In fact we actually don’t want word unmanageable or untrainable as the synonyms in the dictionary at all, because Shakespeare never used
these words (He did use ‘wild’ though with many other meanings).
Hence the first entry where wild, unmanageable and untrainable is on the right side, like written above will and should NEVER be created in
a thematic dictionary, and if you need to have wild there, then you need to define it like so: wild,(JJ),wanton, flighty, frivolous and use
Reverse Dictionary option.
AI Expand Head Word
Here is a bit of trickery, let’s go back to our original CSV file and pick one entry
•
imprisoned,(JJ),imprison'd
The thing is, there are many other words that word imprison’d is synonym to (like jailed) which we don’t have in the CSV. But using this
option will create them anyway!
It will create entries like:
•
imprisoned,(JJ),imprison'd
•
jailed,(JJ),imprison’d
•
incarcerated,(JJ),imprison’d
Granted, we are using a bit of leap of faith here, but in general it works, and creates much more robust thesaurus for free. Let’s just try
Normal at first with our CSV file:
Click the Import from CSV and select the file we created in previous step.
So we got good 2719 entries (head-words) from our big CSV. As you
remember, most of the lines were empty without the head-word so they
don’t get imported.
This time it took a bit longer but we somehow created 24 thousand entries.
What!?
In fact the process created 145 thousand entries, but then many entries
were merged and cleaned.
We do use leap of faith here, hoping that the synonyms to the head-words
are reasonable. In most cases they are, but there could be questionable
items. You can’t have free lunch though.
Let’s try AI Expand Head Word option
Ok, let’s test test it. We can do it still in the same window.
Just type word “horrible” into the box and hit search.
As you can see, we have a pretty interesting hits.
If you tried the same word in the previous (Normal) option, you will get
zero hits. Of course we never had word ‘horrible’ in CSV file. So we’ve got
a lot for a very little using some trickery.
You may try some other words - mostly verbs as that is what we have in
the CSV file. Yes, there will be misses here and there, but we’ve got an
entire thesaurus build from almost nothing.
Wait, that’s actually not our thesaurus! Didn’t we use the Complete Works
of William Shakespeare to create the thesaurus? Because we are playing
right now only with the unrecognized words!
Hit the save button and select CQuill Partial Dictionary File *.dcc as the type.
The reason for *.dcc instead of *.dic is that dcc files won’t be seen by CQuill as
viable dictionaries to show in the Thesaurus selection. And that’s what we want,
this is still a partial dictionary.
Otherwise, they are the same file format.
Let’s sum up where we are:
We have myShakespeare.dic file created from the Complete works of William Shakespeare. These are words that the system
recognized. Then we took the *unknown file and exctracted a few thousand words that we could fix quickly without going through them
manually and we created shakespeare-extra.dcc
Now we need to put them together. At this moment I would suggest to make copy of the myShakespeare.dic as it may come in handy.
For the next part we use our Dictionary Editor
Fine, let’s save it
Once there, hit the Load button
and load the myShakespeare.dic
Now we can hit Merge button and then locate the dcc file we created from the CSV words to Merge the CSV created dictionary with this
one.
Now our dictionary grew with additional entries and much more words in those entries.
The same way we can add other words, filling the missing entries in the *unknown file, or creating a whole
new CSV file from words we can find online.
The result dictionary will be created
with William Shakespeare words, but
only the ones we recognize today.
The process will omit any unknown
words (and in the case of
Shakespeare, that would be a good
40%!)
Tip: To scroll faster through 20k lines, hold down Shift while turning the mouse wheel.
This dictionary has already a lot’s of words. Let’s try a few:
Now, let’s try again. And here we go - the entry ‘funny’ got a few new
words like crook’d.
Despite the fact that we didn’t fill much of the 20K
unknown words, the Shakespearean thesaurus is
already shaping up. But of course we can fix it
even further by trying to locate the missing
keywords from web.
To be continued….
We will try to refine the dictionary using Index, Web Crawler in the next chapter
We could already test the dictionary and type a few words in the search entry to see the
suggestions.
These would be all words that Shakespeare would use, but they are no surprises.
This is because the Vocabulary Synth doesn’t fully understand the way Shakespeare wrote and
as you can see, a good chunk (12k) words were thrown in the Unknown file. These would be
the typically peculiar words, that we could, at least partially, sneak in back to the thesaurus.