Monday, February 19, 2018

Syntax Char RNN for Context Encoding


Syntax Char RNN attempts to enhance naive Char RNN by encoding syntactic context along with character information. The result is an algorithm that, in selected cases, learns faster and delivers more interesting results than naive Char RNN. The relevant cases appear to be those that allow for more accurate parsing of the text.

This blog post describes the general idea, some findings, and a link to the code.


As both a writer and a technologist I have for some time now been interested in the ability to programmatically generate language that is at once creative and meaningful. Two of my previous projects in this context are Poetry DB and Poem Crunch. I also wrote a novel that incorporates words and phrases generated by a Char RNN that had been trained on the story's text.

Andrej Karpathy's now-famous article on RNNs was a revelation when I first read it. It proved that Deep Learning can generate text in ways that at first appear almost magical. It has afforded me a lot of fun ever since.

However, in the context of generating meaningful text and language creatively, it ultimately falls short.

It is helpful to remember that Char RNN is essentially an autoencoder. Given a particular piece of text, let's say this blog post, training will build a model that, if fully successful, will be able to reproduce the original text exactly: it will generate exact copies of the original text from which it learned and created the model.

The reason for Char RNN's widespread employment in fun creative projects is its ability to introduce novelty by either tuning the temperature hyperparameter or, more commonly, as a side effect of imperfect learning.

To be sure, imperfect learning is the norm rather than the exception. For any text beyond a certain level of complexity, a naive Char RNN will reach a point during training when it can no longer improve its model.

This naturally leads to the question, can the Char RNN algorithm be enhanced?

Context encoding

Char RNN encodes individual characters, and the sequence of encodings can be learned using for example LSTM units to remember a certain length of sequence. Aside from the relative position of the character encodings, the neural network has no further contextual information to help it 'remember'.

What would happen if we added other contextual information to the character encodings? Would it learn better?

Parts of Speech

Parts of Speech are structural parts of sentences and a fairly intuitive candidate for the problem at hand. Although POS parsing hasn't always been very accurate, SyntaxNet and spaCy have been setting new benchmarks in recent times. Even so, accuracy might still be a problem (more on that later), but they certainly hold promise.

So how does POS parsing fit into Char RNN?

Let's take a look at the following sentence and its constituent parts.

Bob   bakes a  cake

We can see that the 'a' in 'bakes' and the 'a' in 'cake' are contextually different. The first is part of a verb and the second is part of a noun. If we were able to encode the character and POS together, for each character across the whole text, we would cover a sequence longer than is practical for an LSTM to remember. In other words, the model would understand syntactical structure in a more generic sense than with naive Char RNN.

[space] + SPACE
b + VERB
a + VERB
k + VERB
e + VERB
s + VERB
[space] + SPACE
a + DET
[space] + SPACE
c + NOUN
a + NOUN
k + NOUN
e + NOUN

One way of achieving this is, for each character,  to create a new composite unit that captures both the character and the pos type. So if we create separate encodings for the characters and the pos categories, eg. a = 1, b = 2, etc. and NOUN = 1, VERB = 2 etc., then we could do something along the lines of:

a + VERB = 1 + 2 = 3

However this creates a new problem, namely one of duplicates. I.e. we'd end up with lots of cases that have the same final encoding (in this example, the final encoding is 3):

a + VERB = 1 + 2 = 3


b + NOUN = 2 + 1 = 3

A better solution would have to ensure the encodings are completely separate before sorting them back into consecutive indexes to ensure uniqueness.

But the real problem with this solution is that, although we now have an encoding influenced by both characters and types, we've lost each unit's individual quality. In other words, the importance of a character encoded along with one type of POS unit is no longer properly weighted against a character of a different type of POS unit. Instead, it has simply become a composite type of its own.

An improved approach would be to encode both the character and the type as independent data, albeit of the same unit, and let the LSTM do the rest.

A Syntax Char RNN tensor might then look as follows:

[[ char, type ], [ char, type ] ... ]

However this heavily favours the type encoding over the character encoding, which in turn will skew the weightings.
A more balanced encoding might be:

[[ char, char, char, char, type ], [ char, char, char, type ] ... ]
              word 1                          word 2

A sense of unevenness remains, because some words are longer than others: why should each word receive just one type encoding? This is something left as a refactoring improvement for later.

For the time being, experimentation showed a kind of optimum results from adding two type encodings per word, as follows:

[[ char, char, char, char, type, type ], [ char, char, char, type, type ] ... ]
                  word 1                              word 2


Char RNN can generate surprising turns of phrase and novel combinations of words, but longer extracts often read like gibberish. The hope was that context encoding might improve this state of affairs by strengthening the overall sentence structure represented in the model.

The SyntaxNet installation also installs DRAGNN and a language parsing model. Due to problems I had getting consistent results from SyntaxNet, I eventually settled on DRAGNN instead.



The first benchmark was based on the Tiny Shakespeare corpus. The following snippets are from checkpoints with equivalent validation loss, trained using the same hyperparameters (allowing for the proportionately longer sequence length in the Syntax Char RNN due to the additional type encodings).

Naive Char RNN (temperature: 0.8)

with him; there were not high against the nurse,
and i, as well as half of his brother should prove,
thou ask my father's power,
in this life of the world command betwixt
of greep and displent, rup in manrown:
and thou dost command thy stors; and take our woes,
that star in the sea is well.
 Now is a cunning four gloves of all violer on
himself and my friend traitor's jointure by us
to be holy part that were her horse:' the miles
for this seat with me in the island from scards
shall have stone your highness' weech with you.
 And unjust while i was born to take
with hardness from my cousin when i forget from me.

the unrelery, reign'd with a virtuous tongue,
to blush of his harms, and as sweet, if they
cape of england's purple say well true song hence,
shall appetite straight hath the law with mine?
 The composition should know thy face of my heart!

 Second huntsman:
i know him, and with my chast my mother.

Syntax Char RNN (DRAGNN parser; temperature: 0.8)

ITrue words, his son, fair despair is, and the guard brother; 
always you tell, 
though to see so many trees; i come!
 would you have no redress of joy of the march?
I, what i come up.
 i not a gentlemen cred; and as this old farewell.
 sir, she's a duke.
 so straight is the tyrant, shape, madness, he's weigh; 
which of the confail that e'er said and gentle sick ear together, 
we will see the backs.

Note: Syntax Char RNN output has been reformatted, but remains otherwise unaltered

I think it is easy to agree that the naive Char RNN generated text reads significantly better. There is a somewhat interesting punchiness to Syntax Char RNN's shorter dialogue sections, but that's about all it has in its favour.

This was, frankly, disappointing.

However, there is a reasonable chance that inaccurate parsing could be influencing the results. The DRAGNN model probably doesn't generalise well to Shakespearian English.

Would prose offer a better benchmark?

Jane Austen

The works of Jane Austen was used next. They would almost certainly parse more accurately.

The results this time were rather surprising. Syntax Char RNN raced away, reducing its loss pretty quickly. After one hour and fifteen minutes on my laptop CPU, Syntax Char RNN hit a temporary minimum of 0.771.

After the same time frame and with the same hyperparameters (again allowing for a slight adjustment of sequence length due to the extra type encodings for POS), naive Char RNN went as low as 1.025 - still nowhere near the Syntax Char RNN checkpoint.

I left it running overnight and it still reached only 0.944 after just over 8 hours.

This was interesting.

What about the quality of the generated text?

Naive Char RNN (loss: 0.944; temperature: 0.8):

"i beg young man, nothing of myself, for i have promised to be whole 
that his usual observations meant to rain to yourself, married more word
--i believe i am sure you will allow that they were often communication 
of it, but that in that in as he was in the power of a hasty announte 
by print,  to have given me my friend; but at all," said lady russell, 
so then assisted to  his shall there must preferred them very ill--
considertaryingly, very pleasant, for my having forming fresh person's 
honour that you bowed really satisfied we go.

Syntax Char RNN (DRAGNN parser; loss: 0.771; temperature: 0.8)

Mr. Darcy was quite unrescrailous of place, which and want to carry 
by the attractions and fair at length; and if harriet's acknowledge by 
engaging it for a sorry day he must produce ithim. I could be 
mr. Knightley's compray of marrying in the rest of the disposition, 
and i was particularly successful. He has been much like her sister 
to julia, i wishin she believed he may want pause by the room with 
the same shade. "" but indeed you had obliged to endeavour to concern 
of a marriage of his yielding.""

The results are roughly comparable, neither are special. If pushed I'd say I prefer the latter over the former, it reads a little better.


spaCy is an amazing set of tools made available by the good folks at Explosion AI. Unlike SyntaxNet, or even DRAGNN, it is a breeze to use.

The spaCy parser's data had an interesting effect on training. A training run with DRAGNN data reached a loss of 0.708 after just under 6 hours, then failed to go lower for the rest of its total run of over 16 hours. The spaCy parser achieved 0.686 after 5.5 hours, and its best loss of 0.650 after just under 8 hours.

Here are snippets from the relevant checkpoints.

Syntax Char RNN (spaCy parser; loss: 0.686; temperature: 0.8)

You beg your sudden affections and probability, that they could not 
be extended to herself, but his behaviour to her feelings were 
very happy as miss woodhouse--she was confusion to every power 
sentence, and it would be a sad trial of bannet, but even a great 
hours of her manners and her mother was not some partiality of 
malling to her from her something considering at mr. Knightley, 
she spoke for the room or her brother.

Syntax Char RNN (spaCy parser; loss: 0.650; temperature: 0.8)

In the room, when mrs. Elton could be too bad for friendship by it 
to their coming again. 
She was so much as she was striving into safety, and he knew 
that she had settled the moment which she said," i can not belong 
at hints, and where you have, as possible to your side, 
i never left you from it in devonshire; and if i am rendered 
as a woman," said emma," they are silent," said elizabeth," 
because he has been standing to darcy, and marianne?"" oh! 
No, no, well," said fitzwilliam, when the subject was for a week, 
that no time must satisfy her judgment.

Note: Lines have been wrapped, but formatting remains otherwise unaltered

Both are quite readable, except for the injudicious use of quote marks (a problem that is most likely the result of redundant spaces picked up during pre-processing).

Even allowing for over-fitting, it is quite clear that in the case of Jane Austen's text the snippets generated by Syntax Char RNN are more readable and cohesive than those from naive Char RNN. Among the former, those produced via spaCy also show a marked improvement over the results produced via DRAGNN parsing.

Since the only significant difference between the Syntax Char RNN runs trained on Jane Austen texts were the data from the two different parsers, these findings suggest that accuracy of parsing between DRAGNN and spaCy likely accounts for the difference in performance and readability between the runs. This in turn suggests that a lack of accurate parsing accounts for the poor results achieved with Tiny Shakespeare.


The code is available on github. Comments and suggestions welcome.

The majority of work was around pre-processing (parse and prepare). For training and sampling I was able to build on the existing Pytorch Char RNN by Kyle Kastner (which in turn credits Kyle McDonald, Laurent Dinh and Sean Robertson). I altered the script interface, but the core process remains largely the same.  

Concerns and Caveats

The approach and implementation isn't ideal. Below are some considerations.

  1. Pre-processing is complex. It has to make assumptions about the parser input, bring character and syntax encoding together, and try to remove data that can skew the weightings.
  2. Pre-processing can be slow. It can take anything from a few seconds to tens of minutes, depending on the size and complexity of the file.
  3. POS parsing is imperfect. The results suggest spaCy is doing a better job than DRAGNN, but even at its best it will have errors.
  4. Applicability is limited to text that is at least consistently parseable by an available parser. Poetry, archaic language, social media messages etc. likely fall outside this scope.
  5. Some POS units are known to cause skewing by introducing extra spaces. For example "Mary's" becomes:
's    : PART++POS
All punctuation are separated out as well, for example:
, : PUNCT++,
The effect is that these parts of speech remain tokenised when encoded, resulting in redundant spaces on one side of each token. Unless they are subsequently removed, spaces become overrepresented in the resulting encoding, affecting weightings for all character representations ever so slightly. The code manages to remove some of these redundant spaces - with occasional side effects - but not all.

To Do

Avenues to investigate, features to add.
  1. Reduce the number of redundant space encodings
  2. Calculate a more granular weighting for type
  3. Investigate further candidates for context encoding, over and above syntax
  4. Investigate more elegant ways of grouping encoding contexts
  5. Add validation loss for comparison to training loss
  6. Estimate parser accuracy for a specific text
  7. Run on GPUs!


The findings suggest that in some cases where parsing is accurate and consistent, Syntax Char RNN trains faster and achieves better results than naive Char RNN. This lends support to the hypothesis that accurate contextual encodings, over and above syntactic Parts of Speech, can improve Char RNN's autoencoding.

While they come with several caveats, the findings nonetheless warrant further experimentation and clarification.

Sunday, January 01, 2017

2016 - The Year in Books

2016. There may never be another year during which I read so many of the Great Classic Novels for the first time. Let me list them: War and Peace by Leo Tolstoy, The Brothers Karamazov by Fyodor Dostoyevksy, Ulysses by James Joyce, Don Quixote by Miguel de Cervantes Saavedra, Moby Dick by Herman Melville, and The Mill on the Floss by George Eliot. I also chucked in a few of the great plays for good measure: Hamlet, King Lear, and Twelfth Night (this one I'd read before, and remains a favourite), all by William Shakespeare.

As a bonus, I also had the chance to read two of the most beautiful and startling philosophical treatises: On Liberty by John Stuart Mill and On the Genealogy of Morals by Friedrich Nietzsche.

But back to the novels. It is difficult to do justice to any those great works individually, let alone all of them. Their collective influence on arts and culture in the West is practically immeasurable.

The epic scope and narrative invention of War and Peace is legendary, but it is even more breathtaking when actually read. The array of characters, the depth of their characterisation and the movements of history combine to provide rich nourishment for the soul, and reveals the sophistication and nobility of the Russian spirit.

Moby Dick was a real surprise for its intellectual ambition. One expects adventure on the high seas, and instead is given something much more: the enterprising American spirit as seen at once through its cultural links to Europe and Britain (Shakespeare looms large) and forging its own way, expanding, pondering the nature of its own spirit.

Ulysses is a juggernaut of linguistic invention and deliberate intellectual playfulness. It is perhaps the least accessible of these great classics, and perhaps also the most divisive, but its intellectual rewards are great and in a sense it remains ahead of the times.

But it is Don Quixote for which I want to reserve the most emphatic recommendation, in part because I believe it is the most easily overlooked, and too readily dismissed as antiquated or irrelevant. It is not. It is unique among nearly all of the great classics for being truly, laugh-out-loud funny. More than 400 years have not dimmed the humour. How much funnier still it must have seemed to contemporaneous ears who understood the subtler references that are lost to time and translation.

Don Quixote is not only funny, but also full of pathos. The main character centres in himself something of both the ridiculous and the sublime, and while we are treated to the former most of the time, the shape of the latter emerges over time, especially in Volume 2.

Personally, I found Volume 2 to be even better than Volume 1. Its latter two thirds are as funny as anything in Volume 1, and yet it also treats of more serious matters. I particularly marvelled at and appreciated the story's innovative reference to characters' knowledge of the first volume, published ten years before it. This is an ingenious device that seems more at home in the 20th or even the 21st century than in a novel from the early 17th century. If there can be any doubt that Don Quixote is inventive and linguistically imaginative, this fact alone should dispel it at once.

It is a pity that English readers (myself included) cannot appreciate the full craftiness of the language at work, in particular the contrast between the deluded knight errant's Old Castillian and his compatriots' modern Spanish.

All the other classics seem to take themselves a bit too seriously when we compare them to Don Quixote, and it is only when placed next to Shakespeare that we find a similar use of comic devices in great literature.

2016 marked 400 years since Shakespeare died, and all year long his works were commemorated with performances that are set to continue well into the New Year and beyond. How many of us knew that 2016 also marked 400 years since the passing of Miguel de Cervantes, the author of Don Quixote?

Remedy that neglect immediately, and place Don Quixote on your reading list!

2016: A Torch Gone Out

Let's wipe away 2016, but first, let's set the record straight. Was it really such a bad year? Such a sad year? Yes! It's not just the celebrities who passed away - although that had a lot to do with it.

Think about it: a terrible war in Syria, thousands upon thousands of refugees, threats of terrorism, and the sense that politics was slowly turning on its head: first Brexit, then Trump. With these undercurrents churning in our collective unconsciousness, a bit as if the poles are slowly switching, suddenly many of our culture icons passed away. In a weird and distorted fashion, they must have seemed like the visible casualties of a known but unseen undercurrent. Vulnerable heroes who were unable to bear up any longer.

Or another sign of the uncertainty of our collective future. The old guard, whose hopes can no longer sustain this new world, leaving us to work it out.

Either way don't believe the statistics. It's not about the numbers. It's the context as much as the individual stories.

First David Bowie died. Pop stars' cultural reach are almost unparalleled, but David Bowie isn't just a Justin Bieber or a Lady Gaga. He changed the rules of pop. Among pop stars he was an immortal.

And then there's Prince. And George Michael. Losing both of them is more than a mere annual tally of statistics. As individual stars they are almost peerless. If you speak to those who came of age in the 80s and ask them to pick their top 10 male solo artists, the triumvirate of Michael Jackson, Prince, and George Michael will make almost every list. In fact, many might pick them in their top 5, maybe even their top 3. Michael passed away in 2009, now we've lost the other two. How is that not traumatic?

So let's forget the whole statistical mumbo jumbo, there are simply not that many David Bowies, Princes, and George Michaels to go around.

And dare I mention one more hidden knot in this already knotted ball: Freddie Mercury. Many are still mourning the man who died 25 years ago. Who can forget David Bowie and George Michael singing for him. Feeling the pressure, anyone?


We haven't even mentioned Leonard Cohen yet. Sure, his reach wasn't as broad as those pop singers', but on a song-by-song basis it went deeper. Cohen's was an intimate art. His poetic approach ensured that he touched the soul. His career spanned six decades. Who is left alive from that generation, a singer songwriter in the same league? A few, perhaps: Bob Dylan, Joni Mitchell, Neil Young, Paul Simon. Not many.

I can't speak for others, but there was certainly a feeling of "too soon" in the passing of many other beloved actors and cultural pioneers: Alan Rickman, Zaha Hadid, and yes, Carrie Fisher. And a sense of disbelief that some of the names and faces who have been around ever so long should have gone away: Zsa Zsa Gabor, Gene Wilder, Nancy Reagan, Richard Adams.

There is no doubt, however, that it is the wider political and social unease that has amplified the significance of those passing. And this confluence of factors means that 2016 really felt like the moment when a torch went out.

The old guard are passing the gate, and those of us left behind can only wonder: where do we go from here?