The above title of this post is my dissertation which I recently defended and submitted, and finally got my PhD 🙂 It is published here. One important question though: what does this even mean?
TL;DR: Interacting with machines is becoming commonplace in today’s world. However, one problem faced by machines in processing human language is ambiguity: a sentence can have more than one possible meaning, depending on the underlying syntactic structure, among other things. Luckily, speech has more cues (prosody) than text, and such cues can minimize this ambiguity. This work introduces novel methods of mapping prosody to syntax to allow machines to use prosody to identify the underlying syntactic structure of spoken sentences, and hence resolve ambiguity and better parse these sentences.
The main idea is that you can say the exact same words, but the way you’re saying them in your speech (which is roughly prosody and intonation), can lead to different meanings. Therefore, prosody provides cues within human speech to guide the listeners to the intended meaning. Let me introduce the idea through this clip from the series “friends”.
In both Monica’s and Rachel’s versions, the sentence is “got the keys”, but the intonation is different, leading to two different interpretations. Here is another example:
This is supposed to be a funny joke when read silently, because of the ambiguity in the expression “buy her flowers” which can mean that either you buy the flowers she sells, or you buy flowers for her. However, if you read it aloud, you may find that you say it differently for each interpretation, making it less funny.
Therefore, cues in human speech, in terms of intonation and prosody, may be helpful to identify the intended meaning of a sentence with potentially more than one meaning, and more than one possible underlying syntactic structure. An important part of this dissertation was dedicated to sentences with ambiguities, such as “I saw the boy with the telescope” (which means either I have the telescope or the boy has the telescope).
In my research, I asked a number of people to record sentences such as this telescope sentence, and then asked other people to identify what the intended meaning is, and the accuracy was around 63% when subjects were presented with the audio of these ambiguous sentences (compared to around 49% when presented only with the text of these sentences). This finding is consistent with previous literature, indicating that listeners can use the cues in speech to identify the meaning of an ambiguous sentence (e.g. Snedeker and Trueswell, 2003). Using the same recordings, I measured the duration of words and the silent intervals between them, and used this information as features to machine learning (using decision trees), and the machine learning system was able to achieve an accuracy comparable to humans (ranging between 63-73%).
Going further, I wanted to check if prosody can be helpful in improving parsing in general. Parsing roughly means identifying the underlying syntactic structure for a sentence (or sequence of words). The idea is that if we have multiple possible parse trees, generated by different automatic parsers for a given sentence, can prosody help us decide which of these trees is the most likely?
The biggest challenge for this is that prosody cannot be readily expressed in a form that is compatible with syntactic structure, which is typically represented as a recursive tree. Prosody has a a structure, and parts of it can be observed in terms of “prosodic breaks”, or how much there is “disjuncture” between words, and it can be reflected into duration of words, and pauses/silent intervals between them, in addition to patterns of pitch and intensity, among other acoustic properties. All of these unfold linearly in time, while in syntax words can be nested and embedded recursively within a phrase within a phrase, and it might be that two consecutive words belong to completely different phrases.
Therefore, I believe the following part is the main contribution of this dissertation: to propose a simple, linear representation of syntactic structure, that is compatible with prosody.
To understand this representation, we need to be aware that there are two types of parsing (in terms of the type of syntactic structure used): constituency parsing and dependency parsing. Constituency parsing is based on phrase structure grammar, which is normally used in all areas of linguistics (mainly by grouping words into phrases, e.g. noun phrase, verb phrase, sentence … etc). Dependency parsing is more popular in computational linguistics, and most modern parsers are of the dependency type. The representation proposed here is based on dependency parsing.
Based on dependency structure, the idea is very simple: for any pair of consecutive words, identify where is the head of each word (for example in the sentence above (both “the” and “morning” depend on the word “flight” to the right, so “flight” is the head of both words). Based on the data from the Switchboard corpus, there are only 12 possible configurations, as shown below (the exact algorithm for calculating the numbers is in the dissertation, in chapter 4).
When analyzing these configurations against the distribution of prosodic breaks, we start to see some interesting patterns (also in chapter 4). If one of the two words in the pair depend on the other one, there is very small likelihood of a prosodic break between them. However, if there is no direct dependency between the two words, there is a higher likelihood of a prosodic break. The highest likelihood though is when the left word in the pair depends on a word further to the left, and the right word depends on a word further to the right, such as in the example below “is that || what you’re saying”, where the highest likelihood of a prosodic break is between “that” and “what”.
These dependency configurations are related to prosody then, so this is how we use them to improve parsing:
The goal was to parse all the sentences in the Switchboard corpus for conversational speech using a number of parsers (spaCy, Google SyntaxNet, and clearNLP), and build an ensemble system to choose which parse hypothesis from the three parsers is the most likely for any given sentence. I used these dependency configurations, along with the prosodic information (word durations and silent intervals between them), and fed them into a neural network (RNN/LSTM) to give a score to any given parse hypothesis, and the ensemble system chooses the parse with the highest score. Success here is that if the ensemble system can consistently choose the better parse from the output of the three parsers, so that the overall parsing metric achieved is better than each individual parser.
The main finding was that we could achieve an improvement of around 1% in the main parsing metrics (UAS) when using dependency configurations, as compared with the best individual parser. We also achieved a further improvement of 0.4-0.5% when using prosodic information combined with depenency configurations. Overall, this suggeststhat this approach can make use of prosody in improving parsing performance, and can achieve parsing performance that is higher than any individual parser.
Finally, the main outtake of this research is that prosody can be helpful to automatically resolve ambiguous sentences and identify the more likely parses for a given sentence, based on the novel approach of dependency configurations, which provides a simple representation of syntactic structure that is compatible with prosody. There is a number of other insights that we’ll talk about in future posts. Please let me in the comments if something needs further clarification.