Monday, February 19, 2018

Syntax Char RNN for Context Encoding


Syntax Char RNN attempts to enhance naive Char RNN by encoding syntactic context along with character information. The result is an algorithm that, in selected cases, learns faster and delivers more interesting results than naive Char RNN. The relevant cases appear to be those that allow for more accurate parsing of the text.

This blog post describes the general idea, some findings, and a link to the code.


As both a writer and a technologist I have for some time now been interested in the ability to programmatically generate language that is at once creative and meaningful. Two of my previous projects in this context are Poetry DB and Poem Crunch. I also wrote a novel that incorporates words and phrases generated by a Char RNN that had been trained on the story's text.

Andrej Karpathy's now-famous article on RNNs was a revelation when I first read it. It proved that Deep Learning can generate text in ways that at first appear almost magical. It has afforded me a lot of fun ever since.

However, in the context of generating meaningful text and language creatively, it ultimately falls short.

It is helpful to remember that Char RNN is essentially an autoencoder. Given a particular piece of text, let's say this blog post, training will build a model that, if fully successful, will be able to reproduce the original text exactly: it will generate exact copies of the original text from which it learned and created the model.

The reason for Char RNN's widespread employment in fun creative projects is its ability to introduce novelty by either tuning the temperature hyperparameter or, more commonly, as a side effect of imperfect learning.

To be sure, imperfect learning is the norm rather than the exception. For any text beyond a certain level of complexity, a naive Char RNN will reach a point during training when it can no longer improve its model.

This naturally leads to the question, can the Char RNN algorithm be enhanced?

Context encoding

Char RNN encodes individual characters, and the sequence of encodings can be learned using for example LSTM units to remember a certain length of sequence. Aside from the relative position of the character encodings, the neural network has no further contextual information to help it 'remember'.

What would happen if we added other contextual information to the character encodings? Would it learn better?

Parts of Speech

Parts of Speech are structural parts of sentences and a fairly intuitive candidate for the problem at hand. Although POS parsing hasn't always been very accurate, SyntaxNet and spaCy have been setting new benchmarks in recent times. Even so, accuracy might still be a problem (more on that later), but they certainly hold promise.

So how does POS parsing fit into Char RNN?

Let's take a look at the following sentence and its constituent parts.

Bob   bakes a  cake

We can see that the 'a' in 'bakes' and the 'a' in 'cake' are contextually different. The first is part of a verb and the second is part of a noun. If we were able to encode the character and POS together, for each character across the whole text, we would cover a sequence longer than is practical for an LSTM to remember. In other words, the model would understand syntactical structure in a more generic sense than with naive Char RNN.

[space] + SPACE
b + VERB
a + VERB
k + VERB
e + VERB
s + VERB
[space] + SPACE
a + DET
[space] + SPACE
c + NOUN
a + NOUN
k + NOUN
e + NOUN

One way of achieving this is, for each character,  to create a new composite unit that captures both the character and the pos type. So if we create separate encodings for the characters and the pos categories, eg. a = 1, b = 2, etc. and NOUN = 1, VERB = 2 etc., then we could do something along the lines of:

a + VERB = 1 + 2 = 3

However this creates a new problem, namely one of duplicates. I.e. we'd end up with lots of cases that have the same final encoding (in this example, the final encoding is 3):

a + VERB = 1 + 2 = 3


b + NOUN = 2 + 1 = 3

A better solution would have to ensure the encodings are completely separate before sorting them back into consecutive indexes to ensure uniqueness.

But the real problem with this solution is that, although we now have an encoding influenced by both characters and types, we've lost each unit's individual quality. In other words, the importance of a character encoded along with one type of POS unit is no longer properly weighted against a character of a different type of POS unit. Instead, it has simply become a composite type of its own.

An improved approach would be to encode both the character and the type as independent data, albeit of the same unit, and let the LSTM do the rest.

A Syntax Char RNN tensor might then look as follows:

[[ char, type ], [ char, type ] ... ]

However this heavily favours the type encoding over the character encoding, which in turn will skew the weightings.
A more balanced encoding might be:

[[ char, char, char, char, type ], [ char, char, char, type ] ... ]
              word 1                          word 2

A sense of unevenness remains, because some words are longer than others: why should each word receive just one type encoding? This is something left as a refactoring improvement for later.

For the time being, experimentation showed a kind of optimum results from adding two type encodings per word, as follows:

[[ char, char, char, char, type, type ], [ char, char, char, type, type ] ... ]
                  word 1                              word 2


Char RNN can generate surprising turns of phrase and novel combinations of words, but longer extracts often read like gibberish. The hope was that context encoding might improve this state of affairs by strengthening the overall sentence structure represented in the model.

The SyntaxNet installation also installs DRAGNN and a language parsing model. Due to problems I had getting consistent results from SyntaxNet, I eventually settled on DRAGNN instead.



The first benchmark was based on the Tiny Shakespeare corpus. The following snippets are from checkpoints with equivalent validation loss, trained using the same hyperparameters (allowing for the proportionately longer sequence length in the Syntax Char RNN due to the additional type encodings).

Naive Char RNN (temperature: 0.8)

with him; there were not high against the nurse,
and i, as well as half of his brother should prove,
thou ask my father's power,
in this life of the world command betwixt
of greep and displent, rup in manrown:
and thou dost command thy stors; and take our woes,
that star in the sea is well.
 Now is a cunning four gloves of all violer on
himself and my friend traitor's jointure by us
to be holy part that were her horse:' the miles
for this seat with me in the island from scards
shall have stone your highness' weech with you.
 And unjust while i was born to take
with hardness from my cousin when i forget from me.

the unrelery, reign'd with a virtuous tongue,
to blush of his harms, and as sweet, if they
cape of england's purple say well true song hence,
shall appetite straight hath the law with mine?
 The composition should know thy face of my heart!

 Second huntsman:
i know him, and with my chast my mother.

Syntax Char RNN (DRAGNN parser; temperature: 0.8)

ITrue words, his son, fair despair is, and the guard brother; 
always you tell, 
though to see so many trees; i come!
 would you have no redress of joy of the march?
I, what i come up.
 i not a gentlemen cred; and as this old farewell.
 sir, she's a duke.
 so straight is the tyrant, shape, madness, he's weigh; 
which of the confail that e'er said and gentle sick ear together, 
we will see the backs.

Note: Syntax Char RNN output has been reformatted, but remains otherwise unaltered

I think it is easy to agree that the naive Char RNN generated text reads significantly better. There is a somewhat interesting punchiness to Syntax Char RNN's shorter dialogue sections, but that's about all it has in its favour.

This was, frankly, disappointing.

However, there is a reasonable chance that inaccurate parsing could be influencing the results. The DRAGNN model probably doesn't generalise well to Shakespearian English.

Would prose offer a better benchmark?

Jane Austen

The works of Jane Austen was used next. They would almost certainly parse more accurately.

The results this time were rather surprising. Syntax Char RNN raced away, reducing its loss pretty quickly. After one hour and fifteen minutes on my laptop CPU, Syntax Char RNN hit a temporary minimum of 0.771.

After the same time frame and with the same hyperparameters (again allowing for a slight adjustment of sequence length due to the extra type encodings for POS), naive Char RNN went as low as 1.025 - still nowhere near the Syntax Char RNN checkpoint.

I left it running overnight and it still reached only 0.944 after just over 8 hours.

This was interesting.

What about the quality of the generated text?

Naive Char RNN (loss: 0.944; temperature: 0.8):

"i beg young man, nothing of myself, for i have promised to be whole 
that his usual observations meant to rain to yourself, married more word
--i believe i am sure you will allow that they were often communication 
of it, but that in that in as he was in the power of a hasty announte 
by print,  to have given me my friend; but at all," said lady russell, 
so then assisted to  his shall there must preferred them very ill--
considertaryingly, very pleasant, for my having forming fresh person's 
honour that you bowed really satisfied we go.

Syntax Char RNN (DRAGNN parser; loss: 0.771; temperature: 0.8)

Mr. Darcy was quite unrescrailous of place, which and want to carry 
by the attractions and fair at length; and if harriet's acknowledge by 
engaging it for a sorry day he must produce ithim. I could be 
mr. Knightley's compray of marrying in the rest of the disposition, 
and i was particularly successful. He has been much like her sister 
to julia, i wishin she believed he may want pause by the room with 
the same shade. "" but indeed you had obliged to endeavour to concern 
of a marriage of his yielding.""

The results are roughly comparable, neither are special. If pushed I'd say I prefer the latter over the former, it reads a little better.


spaCy is an amazing set of tools made available by the good folks at Explosion AI. Unlike SyntaxNet, or even DRAGNN, it is a breeze to use.

The spaCy parser's data had an interesting effect on training. A training run with DRAGNN data reached a loss of 0.708 after just under 6 hours, then failed to go lower for the rest of its total run of over 16 hours. The spaCy parser achieved 0.686 after 5.5 hours, and its best loss of 0.650 after just under 8 hours.

Here are snippets from the relevant checkpoints.

Syntax Char RNN (spaCy parser; loss: 0.686; temperature: 0.8)

You beg your sudden affections and probability, that they could not 
be extended to herself, but his behaviour to her feelings were 
very happy as miss woodhouse--she was confusion to every power 
sentence, and it would be a sad trial of bannet, but even a great 
hours of her manners and her mother was not some partiality of 
malling to her from her something considering at mr. Knightley, 
she spoke for the room or her brother.

Syntax Char RNN (spaCy parser; loss: 0.650; temperature: 0.8)

In the room, when mrs. Elton could be too bad for friendship by it 
to their coming again. 
She was so much as she was striving into safety, and he knew 
that she had settled the moment which she said," i can not belong 
at hints, and where you have, as possible to your side, 
i never left you from it in devonshire; and if i am rendered 
as a woman," said emma," they are silent," said elizabeth," 
because he has been standing to darcy, and marianne?"" oh! 
No, no, well," said fitzwilliam, when the subject was for a week, 
that no time must satisfy her judgment.

Note: Lines have been wrapped, but formatting remains otherwise unaltered

Both are quite readable, except for the injudicious use of quote marks (a problem that is most likely the result of redundant spaces picked up during pre-processing).

Even allowing for over-fitting, it is quite clear that in the case of Jane Austen's text the snippets generated by Syntax Char RNN are more readable and cohesive than those from naive Char RNN. Among the former, those produced via spaCy also show a marked improvement over the results produced via DRAGNN parsing.

Since the only significant difference between the Syntax Char RNN runs trained on Jane Austen texts were the data from the two different parsers, these findings suggest that accuracy of parsing between DRAGNN and spaCy likely accounts for the difference in performance and readability between the runs. This in turn suggests that a lack of accurate parsing accounts for the poor results achieved with Tiny Shakespeare.


The code is available on github. Comments and suggestions welcome.

The majority of work was around pre-processing (parse and prepare). For training and sampling I was able to build on the existing Pytorch Char RNN by Kyle Kastner (which in turn credits Kyle McDonald, Laurent Dinh and Sean Robertson). I altered the script interface, but the core process remains largely the same.  

Concerns and Caveats

The approach and implementation isn't ideal. Below are some considerations.

  1. Pre-processing is complex. It has to make assumptions about the parser input, bring character and syntax encoding together, and try to remove data that can skew the weightings.
  2. Pre-processing can be slow. It can take anything from a few seconds to tens of minutes, depending on the size and complexity of the file.
  3. POS parsing is imperfect. The results suggest spaCy is doing a better job than DRAGNN, but even at its best it will have errors.
  4. Applicability is limited to text that is at least consistently parseable by an available parser. Poetry, archaic language, social media messages etc. likely fall outside this scope.
  5. Some POS units are known to cause skewing by introducing extra spaces. For example "Mary's" becomes:
's    : PART++POS
All punctuation are separated out as well, for example:
, : PUNCT++,
The effect is that these parts of speech remain tokenised when encoded, resulting in redundant spaces on one side of each token. Unless they are subsequently removed, spaces become overrepresented in the resulting encoding, affecting weightings for all character representations ever so slightly. The code manages to remove some of these redundant spaces - with occasional side effects - but not all.

To Do

Avenues to investigate, features to add.
  1. Reduce the number of redundant space encodings
  2. Calculate a more granular weighting for type
  3. Investigate further candidates for context encoding, over and above syntax
  4. Investigate more elegant ways of grouping encoding contexts
  5. Add validation loss for comparison to training loss
  6. Estimate parser accuracy for a specific text
  7. Run on GPUs!


The findings suggest that in some cases where parsing is accurate and consistent, Syntax Char RNN trains faster and achieves better results than naive Char RNN. This lends support to the hypothesis that accurate contextual encodings, over and above syntactic Parts of Speech, can improve Char RNN's autoencoding.

While they come with several caveats, the findings nonetheless warrant further experimentation and clarification.