Sunday, July 22, 2018

Detecting Similarity of Textual Style and Content

similarity.py performs rudimentary detection of textual style and content. Basically it uses the predictive capability of the pytorch-char-rnn autoencoder to check the likelihood of a character in a provided input text against character sequences in an existing trained model (trained on some other text).

The average of likelihoods across the provided input text is calculated to provide a broad indication of the similarity of style and content of the input text compared to the original text on which the model was trained. In particular it provides a similarity score as a percentage (higher means more similar).

For example a sentence from the original modelled text should come up with high similarity, typically scoring over 97%. A text in the same language, but written in a very different style might score over 90% but not as high.

An input text written in a totally different language should score significantly lower, eg. 80-85%. If the texts do not share all the textual characters, for example the Turkish alphabet compared to the Roman alphabet, the score will drop even more.

Under the hood the script actually detects variance, and then converts it to a similarity score for convenience. The lower the detected variance, the more like the original text the provided input text is.

The script is provided as part of my pytorch-char-rnn repo.

Below are some examples:

Example 1: Compare English text from Jane Austen's Persuasion with a model trained on Jane Austen's fiction.

python2.7 similarity.py \
--text "Sir Walter Elliot, of Kellynch Hall, in Somersetshire, was a man who, for his own amusement, never took up any book but the Baronetage; there he found occupation for an idle hour" \
--checkpoint checkpoints/austen_checkpoint.cp \
--charfile data/austen_chars.pkl 
Parameters found at checkpoints/austen_checkpoint.cp... loading

Detected similarity: 99.15%

Example 2: Compare German text from the Bible with a model trained on Jane Austen's fiction.

python2.7 similarity.py \
--text "Am Anfang schuf Gott Himmel und Erde. Und die Erde war wüst und leer, und es war finster auf der Tiefe; und der Geist Gottes schwebte auf dem Wasser." \
--checkpoint checkpoints/austen_checkpoint.cp \
--charfile data/austen_chars.pkl 
Parameters found at checkpoints/austen_checkpoint.cp... loading

Detected similarity: 83.84%

In principle the technique can be improved by creating a larger window for comparison. In other words not just character by character, but character sequence by character sequence across a moving window. A bit like LSTM in reverse. It isn't clear whether all the information is available to make this possible, I'll have to do a bit of digging around the model's saved state.

I'll leave that as an exercise for another day.