Skip navigation
Talk to Ditto the donkey and help him learn English Convo.co.uk - Learning bit by bit

How Experiment 2 Works

Updated 25 July 2007 (version 4)

Experiment 2, Simple Emotion Modelling, combines a statistically based classifier with a dynamical model. The Naive Bayes classifier employs single words and word pairs as features. It allocates user utterances into nice, nasty and neutral classes, labelled +1, -1 and 0 respectively. This numerical output drives a simple first-order dynamical system, whose state represents the simulated emotional state of the experiment's personification, Ditto the donkey.

Introduction

Experiment 2 was set up primarily to investigate whether an abstract concept such as pleasant or unpleasant emotional content of an utterance could be learnt by a statistical classifier using a simple feature set. It is also an experiment in affective computing.

In the original version of the experiment, a binary classifier was used. The first training set of nice and nasty utterances was selected from the log files of an online chatterbot, Mabel (now defunct), with some added utterances more relevant to the character of Ditto the donkey. Although the performance with this training set was rather unsatisfactory, it did perform the useful task of logging users' attempts to make Ditto happy or unhappy. These user utterances were in turn used as training examples to improve the performance. They were also employed as the basis of research into improved classification methods, which resulted in the second version of the experiment. The experiment was further refined by adding automatic guessing of unrecognised words, and subseqently by adding begin and end markers to each utterance. This fourth version is now online.

Cleaning

Each user utterance is first transformed into a standard form, a process we call cleaning. For the purposes of this experiment a 40-character alphabet is used: the lower-case letters a through z, the digits 0 through 9, and the four characters ? (question mark), ! (exclamation mark), ' (single quote or apostrophe) and _ (underscore, used as a more visible substitute for the space character). A number of string transformations are performed, but no spelling correction, grammar correction or other improvements are made.

In version 4, the beginning and end of an utterance are marked by the metawords _{_ and _}_ respectively.

Here are some examples of user utterances and their cleaned versions:

Original

Cleaned

Hello Ditto. How are you today?

_{_hello_ditto_how_are_you_today_?_}_

YOUR STUPID!!!

_{_your_stupid_!_!_!_}_

$%&@#}<[

_{__}_

You are fantastic... NOT!

_{_you_are_fantastic_not_!_}_

do´nt...

_{_do'nt_}_

Thats Interesting!

_{_thats_interesting_!_}_

I"am from The Netherlands,.

_{_i'am_from_the_netherlands_}_

Classification

In the first version of the experiment classification was into two classes, nice and nasty. Neutral or ambiguous utterances were forced into one class or another. The classifier used character bigrams as features, i.e. pairs of characters such as th and ck. With a 40-character alphabet there are 1,600 possible bigrams, and the feature set encompassed all of them. Despite this simplistic approach, reasonably good classification was achieved, with an accuracy approaching 0.7. (For comparison, random classification would give an expected accuracy of 0.5.) This very encouraging result spurred on the search for an improved classifier.

The current version of the experiment uses ternary classification into nice, nasty and neutral classes, represented internally by integers +1, -1 and 0 respectively. After considerable experimentation with different feature sets (variable and fixed-length substrings), it was decided to use words and word-pairs as the features. The criterion for a single word or word-pair to be included in the feature set is that it should appear more than once in the cleaned training set. This eliminates typographical errors and once-off misspellings. The features are extracted automatically and are mostly English words, correctly or incorrectly spelled.

For the purposes of this experiment, a word is defined as any sequence comprising an underscore character followed by one or more non-underscore characters, followed by an underscore character, e.g. _word_. A word-pair is defined similarly as an underscore character followed by one or more non-underscore characters, followed by an underscore character, followed by one or more non-underscore characters, followed by an underscore character, e.g. _two_words_. In version 4, the three metawords _{_, _}_ and _*_ are regarded as words. (The metawords cannot come from user input because the cleaning routine removes {, } and * characters.)

The advantage of using word-pairs as features is that a large amount of word order information is preserved in the mapping from utterances to features, information that is lost if single words are used as the only features. On the other hand, our research showed that single-word features on their own have a higher classification accuracy than word-pair features alone. By combining the two, the accuracy achieved is higher than for either set of features on its own.

A "bag-of-features" strategy is employed, with a Bernoulli event model: if a feature appears at least once in an utterance it scores 1; if not, it scores 0. This binary score is independent of the feature's frequency and position within the utterance.

The selection of training examples and their allocation to classes is done manually, so there is undoubtedly some chance that bias and human error will creep in. But this is unavoidable with supervised machine learning of subjective concepts.

Training and Testing

The following figures are correct as at July 2007, with word guessing incorporated (see below). The training set for this experiment currently comprises 7,060 examples: 2,531 nice, 2,567 nasty and 1,962 neutral utterances. The resulting feature set comprises 5,833 features. When calculating the feature probabilities for the Naive Bayes classifier from the training set, conventional Laplace smoothing is applied. Training and testing are done offline and the relevant files are then uploaded to update the online experiment.

Testing is performed using leave-one-out cross-validation. This means that one example is omitted and the system is trained on the 7,059 remaining examples; then the omitted example is classified by the newly trained system. The results — the program's classification and the true class, which may differ — are logged, and the train/test procedure is repeated for each of the 7,060 examples. The mean accuracy — the proportion of correct classifications — is 0.780 +/- 0.013 (99% confidence interval) in version 4.

In more detail, the classification results are as follows:

Result

Quantity

True nice

2,130

True nasty

2,079

True neutral

1,298

False nice

401

False nasty

488

False neutral

664

"True nice" means that the system classified an unseen utterance as nice, and this was the correct class. "False nice" means that the system's classification was nice, but the true class was either nasty or neutral.

Word Guessing

With the second version of the experiment running online, a study of the 7,060 examples used for training revealed that fully 22% of them contained words that occurred only once in the training set. Since the features are based on words that occur at least twice in the training set, this means that 22% of the training examples contain at least one word unrecognised by the system. This seems unsatisfactory.

One way around this would be to allow all words to become features, not just those that occur more than once. However, this leads to a much increased number of features (around 14,000 by the time word-pairs are included), which not only slows down the running but actually makes the classification less accurate.

In an attempt to improve the situation, a process of word guessing was introduced in the third version of the experiment. Before training, a lexicon (word list) is compiled from words that occur at least twice in the training set. Then words that occur only once are replaced by the metaword _*_, signifying an unrecognised word. The feature set is built from these modified examples, and comprises the lexicon plus the set of word-pairs that occur at least twice. Word-pairs may include metawords, for example _you_*_, or _{_there_, or _!_}_.

During training, each utterance is first cleaned then submitted to the word guesser, which works as follows. Every word is checked to see if it is in the lexicon. If not, the lexicon is searched for the most similar word. The similarity calculation uses a Jaccard similarity measure (modified by Laplace smoothing) based on character bigrams, and produces a value lying between 0 and 1. If the most similar word has a similarity value greater than 0.5, the unrecognised word is replaced by that word; if not, it is replaced by the _*_ metaword. The threshold value of 0.5 is justified by regarding the similarity as a Bayesian belief value, in which case a probability of 0.5 corresponds to "don't know".

This scheme works quite well in replacing typos and common misspellings; for example _dito_ would be correctly guessed as _ditto_ and _recieving_ as _receiving_.

Disappointingly, incorporation of this word guessing scheme does not actually make much difference to the classification accuracy, which improved from 0.766 in version 2 to 0.775 in version 3, with the same training set. Incorporation of metawords in version 4 increased the accuracy slightly more, to 0.780. (All these figures have a 99% confidence interval of +/-0.013.) A more positive way of looking at this result is to say that the classifier is already robust in the presence of unrecognised words, so guessing them correctly does not yield much improvement!

The word guesser is of course used not only in training but also in online operation, where it tries to guess previously unseen words as well as the ones occurring only once in the training set. It replaces them with the most similar lexicon word, or substitutes the _*_ metaword if it fails to find a close enough match.

Dynamical Model

When an utterance is received online from a user, it is first cleaned then the feature subset is extracted and sent to the Naive Bayes classifier, which produces a numerical output: +1 for nice, -1 for nasty, and 0 for neutral.

This numerical representation of the class, C, forms the input to a simple first-order dynamical system. The system's emotional state variable, E, varies between -100 (extremely unhappy) through +100 (extremely happy). The initial state is E0 = 0 (neither happy nor unhappy). With each utterance the new state Ek+1 is calculated from the previous state, Ek, and C:

  {

-30 + 0.7 Ek

if C = -1

Ek+1 =

0.8 Ek

if C = 0

 

30 + 0.7 Ek

if C = +1

Thus if C is repeatedly -1, {Ek} tends to -100; if C is repeatedly 0, {Ek} tends to 0; and if C is repeatedly +1, {Ek} tends to 100.

The result is rounded to the nearest integer and, to provide the spice of a little unpredictability, a random number in [-5, 5] is added to the calculated value. The result is then constrained to lie in [-100, 100].

Conclusion

Experiment 2 demonstrates that it is possible for a simple statistical classifier to learn abstract concepts like emotional intent from textual input, using single words and word-pairs as features. This suggests that it should be possible to train similar classifiers to extract other affective and semantic information from user utterances, given suitable training sets.

Although the demonstrated classification accuracy is only 78%, there are many human tasks for which this order of accuracy is sufficient. (Three out of four successes would be considered excellent in many sporting activities, for instance; and a first-class degree at a British university traditionally corresponds to an overall mark of 70% or greater.) Our chosen challenge for the Convo system is not to improve the classification accuracy (a difficult task), but rather to design a conversational system that performs robustly, despite its imperfect comprehension of the user's intended meaning. We draw inspiration from the essentially flawed nature of human perception and cognition. "To err is human."

Experiment 2 also shows that a simple one-dimensional emotional spectrum can be adequately simulated by ternary input feeding a first-order dynamical system. It may be possible to extend this to multiple affective dimensions in future work.

Links

Home · Experiments · Technical · About us