A conversational agent is a computer program that can engage in a conversation with a human user.
Conversational agents, also known as dialogue systems or chatbots, serve various purposes. They are used in voice interfaces for tasks like playing music, controlling lights, setting timers, and getting weather updates. They also provide entertainment through engaging conversations and can be utilized in clinical settings for mental health support. Additionally, conversational agents enable access to services such as booking hotels or purchasing books. Examples of conversational agents include personal assistants like Siri, Alexa, Cortana, and Google Assistant, found on phones and smart speakers.
There are two main categories of conversational agents: chatbots and task-oriented dialogue systems.
Chatbots engage in extended conversations, aiming to mimic unstructured human-human interactions, often for entertainment or therapeutic purposes.
Task-oriented dialogue systems are goal-based agents designed to solve specific tasks, such as booking a restaurant or managing a shopping list.
These systems typically interface with personal assistants, cars, appliances, and robots.
graph TD
A[Conversational Agents]
B[Open-domain]
C[Task-oriented]
D[Data-driven]
E[Rule-based]
F[Finite states]
G[Frame-based]
H[Generatrieve]
I[Retrieval]
A --> B
A --> C
B --> D
B --> E
C --> F
C --> G
D --> H
D --> I
Properties of dialog
In spoken dialog, participants take turns to speak. Turns can vary in length, ranging from a single word to multiple sentences. Speakers negotiate turn-taking, deciding when to speak and when to yield the floor. In cases where both speakers talk at the same time, such as when the client interrupts the agent, the agent knows to stop talking and allow the client to make a correction.
One important aspect for conversational agents is end-point detection. This involves determining when the user has finished speaking and when it is appropriate for the agent to start responding. However, this can be challenging as people often pause in the middle of their turns and speakers typically start their turn immediately after the other person. Traditional conversational agents tend to wait for the speaker to completely stop talking before processing the utterance and providing a response.
Another significant consideration is barge-in detection, which refers to the ability of the conversational agent to recognize when the user is speaking and allow them to interrupt the system. This feature enables a more interactive and dynamic conversation between the user and the agent.
Speech acts
According to Bach and Harnish (1979), speech acts can be categorized into different types:
Constatives: These speech acts commit the speaker to something being the case. Examples include answering, claiming, confirming, denying, disagreeing, and stating.
Directives: These speech acts are attempts by the speaker to get the addressee to do something. Examples include advising, asking, forbidding, inviting, ordering, and requesting.
Commissive: These speech acts commit the speaker to some future course of action. Examples include promising, planning, vowing, betting, and opposing.
Acknowledgments: These speech acts express the speaker’s attitude regarding the hearer with respect to some social action. Examples include apologizing, greeting, thanking, and accepting an acknowledgment.
Understanding the different types of speech acts can provide valuable insights into the dynamics of conversation and the actions performed through language.
Examples of Speech Acts:
“Turn up the music!” → DIRECTIVE (User asking a person/system to do something (“Turn up the music”) is issuing a DIRECTIVE, a kind of COMMAND)
“What day in May do you want to travel?” → DIRECTIVE (Asking question that requires an answer is also a way of issuing DIRECTIVE: it’s as if they are (very politely) commanding the client to answer.)
“I need to travel in May” → CONSTATIVE (By contrast, user stating a constraint (like ‘I need to travel in May’) is issuing a CONSTATIVE.)
“Thanks” → ACKNOWLEDGEMENT (User thanking the system is issuing an ACKNOWLEDGMENT.)
In any joint activity, it is crucial for participants to establish common ground, which refers to a shared understanding of the current state of the world and the task at hand. This mutual understanding enables effective and coherent communication. The principle of closure, as discussed by Clark (1996) and Norman (1988), posits that agents performing an action require evidence that they have successfully completed it. This principle applies to speech as well: speakers need to ground each other’s utterances by acknowledging that the hearer has understood them. Grounding is essential in human-machine interaction too. For example, elevator buttons light up when pressed to indicate that the request has been acknowledged. Humans constantly ground each other’s utterances in conversation through various means:
Explicit Acknowledgment
A: “And you said returning on May 15th?”
C: “Uh, yeah, at the end of the day.”
A: “OK.”
Repetition
C: “OK, I’ll take the 5ish flight on the night before on the 11th.”
A: “On the 11th? OK.”
Implicit Acknowledgment Using Continuers
C: “I need to travel in May.”
A: “And, what day in May did you want to travel?”
Grounding is equally important for chatbots to ensure effective communication with users.
Examples
Consider the following examples, where a user responds “No” to the system’s question:
Without Explicit Grounding:
System: “Did you want to review more of your profile?”
User: “No.”
System: “What’s next?”
With Explicit Grounding:
System: “Did you want to review more of your profile?”
User: “No.”
System: “Okay, what’s next?”
In the second example, the system’s use of “Okay” grounds the user’s response, indicating that the system has processed and understood the user’s “No.”
This acknowledgment enhances the clarity and efficiency of the interaction, reducing potential confusion.
Conversation Structure
Conversations have a distinct structure that can be analyzed and understood. This structure includes the concept of adjacency pairs, where a speech act is followed by a corresponding response. For example, a question is typically followed by an answer, a proposal is followed by acceptance or rejection, and compliments are often met with down players. Understanding this structure can help systems generate appropriate responses.
In some cases, the second part of an adjacency pair may not immediately follow the first part. These two parts can be separated by a side sequence, which is a sub-dialogue that occurs between them. For example, a correction sub-dialogue may occur when one party corrects the other’s statement. Similarly, a clarification sub-dialogue may occur when one party seeks clarification on a previous question or statement. These side sequences add complexity to the conversation structure.
In addition to side sequences, conversations often involve pre-sequences. These are initial statements or questions that set the context or lead into the main part of the conversation. For example, a user may start by asking about the system’s capabilities before making a specific request. Pre-sequences help establish the flow and direction of the conversation.
Conversations can be controlled by one person or have mixed initiative. In some cases, one person takes the lead and directs the conversation, while in others, the initiative shifts between participants. This mixed initiative is common in human conversations, but can be challenging for NLP systems, which often default to simpler styles where either the user or the system takes the lead. This limitation can sometimes lead to frustration for human users.
Inference plays a crucial role in dialogue understanding. It involves drawing conclusions or making assumptions based on the information provided. For example, when a client responds to an agent’s question by mentioning a meeting from the 12th to the 15th, the agent is expected to infer that the client’s desired travel dates fall within that range. Understanding and making appropriate inferences is essential for effective communication in dialogue systems.
Open-domain Chatbots
Open-domain chatbots engage in extended conversations with users, aiming to mimic unstructured human-human interactions. These chatbots are designed to provide entertainment, companionship, and sometimes therapeutic support. They are not task-oriented and do not have a specific goal beyond engaging the user in conversation. They can be rule-based or data-driven.
Rule-based chatbots use a fixed set of rules to generate responses. Examples include ELIZA (which uses pattern-action rules) and PARRY (a computational model of schizophrenia that adds a mental model to rule-based responses).
Men are all alike.> IN WHAT WAYThey're always bugging us about something or other.> CAN YOU THINK OF A SPECIFIC EXAMPLEWell, my boyfriend made me come here.> YOUR BOYFRIEND MADE YOU COME HEREHe says I'm depressed much of the time.> I AM SORRY TO HEAR YOU ARE DEPRESSED...You are like my father in some ways.> WHAT RESEMBLANCE DO YOU SEEYou are not very aggressive but I think you don't want me to notice that.> WHAT MAKES YOU THINK I AM NOT AGGRESSIVEYou don't argue with me.> WHY DO YOU THINK I DON'T ARGUE WITH YOUYou are afraid of me.> DOES IT PLEASE YOU TO BELIEVE I'M AFRAID OF YOUMy father is afraid of everybody.> WHAT ELSE COMES TO MIND WHEN YOU THINK OF YOUR FATHERBullies.> DOES THAT HAVE ANYTHING TO DO WITH THE FACT THAT YOUR BOYFRIEND MADE YOU COME HERE
NOTE
ELIZA appears to provide coherent responses in language and offers therapeutic guidance. Additionally, ELIZA demonstrates the ability to recall information from earlier in the conversation, as indicated by the reference to the beginning.
ELIZA was designed to emulate a Rogerian psychologist by reflecting the patient’s statements back to them. This type of conversation allows the psychologist to assume a position of limited knowledge about the real world. It is worth nothing that chatbots attempting to pass the Turing test often adopt a similar approach.
Example
Patient: “I went for a long boat ride”
Psychologist: “Tell me about boats”
In this exchange, the psychologist does not assume ignorance about boats, but rather has a specific conversational goal in mind.
Eliza’s transformation rules:
Pattern: (* YOU * ME)
Transform: (WHAT MAKES YOU THINK I 3 YOU)
In the pattern, the asterisk (*) captures a sequence of words. In the transform, the ‘3’ refers to the third captured content from the pattern, which corresponds to the second asterisk (*).
" You hate me" → "WHAT MAKES YOU THINK I HATE YOU"
The rules in Eliza are categorized based on keywords. Each keyword has a specific pattern and a list of possible transformations. Keywords can be associated with multiple patterns.
Keyword
Pattern
Transformations
YOU
(* YOU * ME)
(WHAT MAKES YOU THINK I 3 YOU) (WHY DO YOU THINK I 3 YOU)
Keywords ranked in order of specificity: giving the following user interaction I know everybody laughed at me potential matching rules could be:
Keyword
Pattern
Possible response by ELIZA
I
(I *) -> (YOU SAY YOU 2)
YOU SAY YOU KNOW EVERYBODY LAUGHED AT YOU
Everybody
(Everybody *) -> (WHO IN PARTICULAR ARE YOU THINKING OF)
WHO IN PARTICULAR ARE YOU THINKING OF
NOTE
“I” is a more general rule, while “Everybody” is more specific and interesting.
Rule preference can be implemented by storing keywords with a rank.
If no keyword matches, ELIZA will choose a non-committal response such as:
PLEASE GO ON
THAT’S VERY INTERESTING
I SEE
The memory trick accounts for the last sentence of the conversation:
If “MY” is the highest keyword, a transform from the MEMORY list is randomly selected.
The output of the transform is applied and stored in a queue.
If no keyword matches a sentence later on, the MEMORY from the queue is returned.
Memory transforms:
* MY * -> LET'S DISCUSS FURTHER WHY YOUR 3
* MY * -> EARLIER YOU SAID YOUR 3
* MY * -> DOES THAT HAVE ANYTHING TO DO WITH THE FACT THAT YOUR 3
Ethical Implications of Chatbots
Anthropomorphism and Privacy
One of the notable ethical concerns surrounding chatbots is the tendency for people to develop deep emotional connections with them. This was evident with ELIZA, a chatbot developed by Joseph Weizenbaum. In fact, some individuals became so emotionally involved with ELIZA that they requested privacy during their conversations with the chatbot. This raised concerns about the privacy implications of storing and analyzing these conversations, as users felt they were engaging in private and personal discussions with the software, despite knowing it was just a program.
Weizenbaum himself expressed worry about the extent to which people confided in ELIZA and whether they were misled about the chatbot’s understanding capabilities. He feared that society was making a mistake by relying too heavily on computers for decision-making and personal choices, thereby removing the human element from important aspects of life.
Ethical Implications
Sherry Turkle, a researcher who studied users of ELIZA and other similar systems, found that while face-to-face interaction is crucial to the human experience, individuals also develop relationships with technological artifacts. Some users even described using ELIZA as a form of diary, using it as a means to privately explore their thoughts and emotions.
When designing chatbot systems, it is essential to carefully consider the potential benefits and harms that may arise from their use. This includes addressing concerns related to privacy, emotional attachment, and the impact of relying on technology for personal and emotional support.
Parry (1971)
PARRY is a computational model of schizophrenia that was developed as another chatbot with a focus on clinical psychology. It was specifically designed to study schizophrenia and had a similar pattern-response structure to Eliza, but with more advanced capabilities in terms of control structure and language understanding. What set PARRY apart was its inclusion of a model of mental state, with variables representing levels of Anger, Fear, and Mistrust.
The procedure of PARRY involved starting with all mental state variables set to low levels. Each user statement had the potential to change the levels of Fear and Anger, with insults increasing Anger, flattery decreasing Anger, and mentions of delusions increasing Fear. If the user input did not contain any malevolent content, the levels of Anger, Fear, and Mistrust would decrease. The response generated by PARRY depended on the current mental state of the chatbot.
PARRY achieved a significant milestone by being the first system to pass a version of the Turing test in 1972. Psychiatrists were unable to distinguish interviews with PARRY from transcripts of interviews with individuals diagnosed with paranoid schizophrenia.
Emotion classes
If we want to perform sentiment analysis with more than two classes (positive/negative), which emotion classes should we consider?
Scherer’s typology of affective states categorizes emotions into the following:
Emotion: relatively brief episode of response to an event, such as anger, sadness, joy, fear, shame, pride, or desperation.
Mood: subjective feeling of low intensity but long duration, often without an apparent cause, such as cheerfulness, gloominess, irritability, listlessness, depression, or buoyancy.
Interpersonal stance: affective stance taken toward another person in a specific interaction, such as being distant, cold, warm, supportive, or contemptuous.
Attitudes: enduring preferences or predispositions towards objects or persons, such as liking, loving, hating, valuing, or desiring.
Personality traits: emotionally laden, stable personality dispositions and behavioral tendencies, such as being nervous, anxious, reckless, morose, hostile, envious, or jealous.
Ekman proposes a model with six basic emotions, which are: Happiness, Surprise, Fear, Sadness, Anger, and Disgust.
On the other hand, Plutchick’s wheel of emotion categorizes emotions into eight basic emotions in four opposing pairs:
Joy-Sadness
Anger-Fear
Trust-Disgust
Anticipation-Surprise
In the spatial model, an emotion is represented as a point in a two- or three-dimensional space, with the following dimensions:
Valence: the pleasantness of the stimulus.
Arousal: the intensity of the emotion provoked by the stimulus.
Dominance: the degree of control exerted by the stimulus.
Data-driven Chatbots
Two common approaches for corpus-based chatbots are response by retrieval and response by generation. In response by retrieval, the chatbot uses information retrieval techniques to find an appropriate response from a corpus based on the current context. On the other hand, in response by generation, the chatbot uses a language model to generate a response given the dialogue context.
Corpus-based chatbots heavily rely on large corpora. These chatbots require hundreds of millions or even billions of words to effectively generate responses. Various types of conversations can be used as corpora, such as transcripts of telephone conversations, movie dialogues, paid crowdworker conversations, and pseudo-conversations from social media posts. However, it is crucial to remove any Personally Identifiable Information (PII) from the corpora to ensure privacy.
Retrieval-based response generation
Given user turn , and corpus of conversations, find response that is most similar to query. Similarity can be computed using lexical search (TF-IDF cosine)
or semantic search (as dot-product between sentence embeddings):
Generation-based response generation
Response production is essentially a language modeling task, where the model generates each response token by conditioning the generation on the encoded query . In other words, the model selects the most probable word from the vocabulary given the query and the previously generated response tokens .
For dialogues, the query also includes the previous responses. This can be achieved by fine-tuning an encoder-decoder or decoder-only model using conversational data.
Warning
Neural chatbots often generate repetitive or uninteresting responses, such as “I’m OK” or “I don’t know,” which can lead to a conversation shutdown.
To address this issue, modifications can be made to the basic model for response generation. One approach is to use diversity-enhanced versions of beam search instead of greedily selecting the most likely response. Another option is to train the model with diversity-focused training objectives. Additionally, adding minimum length constraints can help generate longer and more engaging utterances.
A generative model can generate responses from informative text rather than dialogue. For example, to respond to a query like “Tell me something about Beijing,” the model can collect sentences from public lectures and news articles, search them using information retrieval (IR) based on query expansion from the user’s turn, and provide a response based on the retrieved information. This approach can be further augmented with a Retrieval-Augmented Generation (RAG) model, which uses IR to retrieve passages from Wikipedia, concatenates each Wikipedia sentence to the dialogue context with a separator token, and gives it as context to the language model. The language model learns to incorporate the text into the response.
Also, it’s possible to combine responses from different generators, such as a GPT-2 model fine-tuned on EmpatheticDialogue and another GPT-2 model fine-tuned to paraphrase content from Wikipedia. Rule-based generators can also be used to produce scripted conversation about a movie or musician, providing fun facts and engaging the user.
Pros
Cons
Fun
They don’t really understand
Good for narrow, scriptable applications
Giving the appearance of understanding may be problematic
Rule-based chatbots are expensive and brittle
IR-based chatbots can only mirror training data
Task-oriented Dialogue Agents
Task-based dialogue agents, also known as goal-oriented dialogue agents, are designed to assist users in accomplishing specific tasks. These tasks can range from setting a timer, making a travel reservation, playing a song, to buying a product. Task-oriented agents often employ either a frame-based architecture or a dialogue-state architecture.
The Frame-based architecture relies on predefined rules to fill frames and execute actions. This approach has been in use for over 40 years and is still widely used in industrial task-based dialogue agents. It involves defining frames, which represent the actions or functions that the agent can perform, along with slots that serve as variables or arguments for those functions. The values assigned to these slots are initially unknown and are filled based on user input.
On the other hand, the Dialogue-state architecture is an extension of the frame-based approach and incorporates a finite-state machine. While more commonly found in research systems, some aspects of this architecture are making their way into industrial systems as well. The dialogue-state architecture focuses on maintaining a dynamic representation of the conversation state, allowing for more flexible and context-aware interactions between the agent and the user.
Frame-based architecture
Frame-based dialogue agents, also known as frame-driven dialog systems, were first proposed in 1977 with the introduction of the Genial Understanding System (GUS). This architecture consists of a knowledge structure representing user intentions and one or more action frames, each containing slots with values.
A frame in this context refers to an action and its associated slots, which need to be filled with specific information. Each slot is accompanied by a question posed to the user to gather the necessary data.
Slot
Type
Question
ORIGIN
city
What city are you leaving from?
DEST
city
Where are you going?
DEP DATE
date
What day would you like to leave?
DEP TIME
time
What time would you like to leave?
AIRLINE
line
What is your preferred airline?
The GUS system
The GUS system demonstrated advanced capabilities, including handling coreference expressions, understanding implicit constraints, and managing complex dialogues. For example, it could interpret phrases like “the first one” referring to a specific flight or understand temporal references like “Friday in the evening” indicating the following Friday.
The control structure of the GUS frame architecture involves the system asking questions to the user to fill the slots. Users can provide information for multiple slots simultaneously. Once a frame is filled, the system can run a database query to access relevant information or create a booking.
GUS systems typically consist of multiple frames, such as car or hotel reservations, route information, or airfare price inquiries. The system needs to detect which slot of which frame the user is filling and switch the dialogue control accordingly. Slots can also have condition-action rules, allowing for dynamic behavior. For example, once the user specifies the destination, it can be set as the default stay location for a hotel booking frame.
The natural language understanding (NLU) component of GUS extracts three key elements from the user’s utterances: domain classification, user intent determination, and slot filling. Domain classification identifies the topic the user is referring to, while user intent determination determines the task or goal the user wants to accomplish. Slot filling involves extracting specific values from the user’s utterance to instantiate the task.
Example
Take in consideration the following user utterance: “Show me morning flights from Boston to San Francisco on Tuesday.”
Slot
Value
Domain
Air-travel
Intent
Search
ORIGIN-CITY
Boston
ORIGIN-TIME
Morning
ORIGIN-DATE
Tuesday
DEST-CITY
San Francisco
In industry applications, rule-based slot filling and template-based generation are commonly used. Handwritten rules, such as regular expression, are employed to recognize user intents and fill slots accordingly. Templates, which can be fixed or contain variables, are used to generate pre-built response strings. This approach allows for efficient and structured dialogues with users.
Pros
Cons
High precision
Expensive and slow to create rules
Can provide coverage if the domain is narrow
Can suffer from recall problems
Dialogue-state architecture
The Dialogue- or Belief-State Architecture is a more advanced version of the frame-based architecture. It incorporates dialogue acts, machine learning (ML), and enhanced natural language generation. This architecture is foundational for various industrial systems, particularly in slot understanding through ML.
Dialogue-State Architecture is composed of four main components:
Natural Language Understanding (NLU): Utilizes machine learning to extract slot fillers from user utterances.
Dialogue State Tracker (DST): Maintains the current state of the dialogue, including the user’s most recent dialogue act and slot-filler constraints.
Dialogue Policy: Determines the system’s next action or response and can range from simple policies, like asking questions until all slots are filled, to more complex strategies that decide when to answer questions or seek clarification.
Natural Language Generation (NLG): Generates more natural and less templated utterances.
Dialogue acts merge speech acts and grounding into a single, canonical representation that is independent of natural language. This representation is often used in applications like restaurant recommendation systems (e.g. user informs the system they want Italian food near a museum and confirms that the price is reasonable).
Tag
Sys
User
Description
X
X
Open a dialogue and give info
X
X
Give info
X
X
Request value for a given
X
Request alternative with
X
X
Explicitly confirm
X
Implicitly confirm and request value of
X
Implicitly confirm and request value of
X
X
Affirm and give further info
X
Negate and give corrected value
X
Deny that
X
X
Close a dialogue
Machine learning classifiers can be trained to map words to semantic frame-fillers. This process requires a large set of labeled sentences.
Example
Input
Output
”I want to fly to San Francisco on Monday please”
Destination: San Francisco Depart-time: Monday
This training involves labeling each word with the corresponding slot, such as using B-DES (begin) and I-DES (inside) tags for the destination slot.
Example
Word
I
want
to
fly
to
San
Francisco
on
Monday
please
Tag
O
O
O
O
O
B-DES
I-DES
O
B-DEPTIME
O
A BERT model can be trained for sequence labeling, and extracted slot values must be normalized to the correct form in the ontology (e.g., “SFO” for San Francisco).
Dialogue state tracking involves detecting dialogue acts and filling slots. These tasks are often performed jointly.
Example
Sentence: “I’d like Cantonese food near the Mission district”
Detection: inform(food=cantonese, area=mission)
The classifier identifies the dialogue act based on the current sentence embedding and prior dialogue acts.
A simple dialogue state tracker runs a slot-filler after each sentence.
Detecting Correction Acts
Correction acts occur when the system misrecognizes an utterance, prompting the user to correct it through repetition, rephrasing, or negation. Corrections are challenging for systems to understand and are often misrecognized more frequently than non-corrections due to hyper-articulation (exaggerated prosody).
User: " I said BAL-TI-MORE, not Boston"
Detecting corrections in spoken dialogue involves identifying features such as changes in prosody and recognizing repeated or rephrased utterances.
Feature
Examples
Lexical
Words like “no”, “correction”, “I don’t”, swear words, utterance length
Semantic Similarity
Word overlap or embedding dot product between candidate correction act and user’s prior utterance
Phonetic
Phonetic overlap between candidate correction act and user’s prior utterance (e.g., “WhatsApp” may be incorrectly recognized as “What’s up”)
Prosodic
Hyperarticulation, increases in F0 range, pause duration, and word duration, generally normalized by the values for previous sentences
ASR Confidence
ASR confidence, language model probability
Dialog Policy and Generation
Dialogue Policy During each turn, the system can predict the action based on the entire conversation history using the formula: . Alternatively, the system can predict the action based on the current dialogue state (filled frame slots) and the last user/system utterance using the formula: . To ensure that the system has understood the user, two mechanisms are used: confirming understanding with the user and rejecting utterances that the system has likely misunderstood.
Confirming Strategies There are two types of confirming strategies in dialogue systems: explicit confirmation and implicit confirmation.
Explicit confirmation involves the system directly asking the user for confirmation. For example, the system may ask, “Is this what you meant?” or “Did I understand you correctly?” This approach allows users to easily correct any misrecognitions by simply answering “no” to the confirmation question. However, explicit confirmation can sometimes make the conversation feel awkward and increase its length.
S: Which city do you want to leave from? U: Baltimore. S: Do you want to leave from Baltimore? U: Yes.
Implicit confirmation is a more subtle approach where the system confirms the user’s input without explicitly asking for confirmation. Instead, the system uses techniques such as paraphrasing or summarizing the user’s input to indicate understanding. Implicit confirmation can help maintain a smoother conversation flow and reduce the length of the dialogue.
U: I want to travel to Berlin S: When do you want to travel to Berlin?
When deciding between explicit and implicit confirmation, there is a trade-off to consider. Explicit confirmation provides a clear opportunity for users to correct any misunderstandings, but it may disrupt the conversation flow. On the other hand, implicit confirmation can maintain a smoother conversation, but it may not provide an explicit opportunity for users to correct misrecognitions. It’s important to strike a balance between these two strategies based on the specific requirements and constraints of the dialogue system.
Rejection Strategy In addition to confirmation, rejection is another action that the system can take to express a lack of understanding. Instead of simply repeating the misunderstood utterance, the system can respond with a message like: “I’m sorry, I didn’t understand that.”
System: When would you like to leave? Caller: Well, um, I need to be in New York in time for the first World Series game. System: <reject>. Sorry, I didn't get that. Please say the month and day you'd like to leave. Caller: I wanna go on October fifteenth.
Sometimes, certain utterances are rejected multiple times because the user is using complicated language that the system cannot follow. To overcome this, dialogue systems often employ progressive prompting. Instead of repeating the same question, the system provides more detail and guidance to the user on what they could say. This approach helps the user understand what kind of response the system is expecting and encourages them to provide a clearer and more understandable answer.
In dialogue systems, Automatic Speech Recognition (ASR) or Natural Language Understanding (NLU) systems can assign a confidence value to indicate how likely they understood the user’s input. This confidence value can be estimated based on various factors, such as the acoustic log-likelihood of the utterance, prosodic features, and the ratio of the score of the best interpretation to the second-best interpretation. To determine the appropriate action, a confidence threshold can be set.
Confidence Value
Description
Action
Low confidence, likely to be incorrect
Reject
Above the threshold
Confirm explicitly
High confidence, likely to be correct
Confirm implicitly
Very high confidence, almost certain to be correct
Don’t confirm at all
If the confidence value is above the threshold, the system can proceed with the interpretation. Otherwise, the system may take alternative actions, such as asking for clarification or repeating the misunderstood utterance. This approach helps ensure accurate understanding and improves the overall performance of the dialogue system.
Natural Language Generation (NLG)
In the information-state architecture, Natural Language Generation (NLG) is modeled in two stages: content planning and sentence realization.
Content Planning: This stage determines what to say to the user. The dialogue policy decides the appropriate speech act to generate and the attributes (slots and values) to include in the response.
Sentence Realization: This stage focuses on how to say the generated content in natural language. It involves generating a grammatically correct and coherent sentence that provides an answer to the user or follows a confirmation strategy.
Training data for NLG is often limited, as it is not feasible to have examples for every possible situation. To improve generation, a technique called delexicalization can be used. This involves replacing words in the training set that represent slot values with generic placeholder tokens.
Example
recommend(restaurant name= Au Midi, neighborhood = midtown, cuisine = french)
restaurant name is in neighborhood and serves cuisine food.
There is a cuisine restaurant in neighborhood called restaurant name
Sentence realization involves mapping frames (dialogue states) to delexicalized sentences. This mapping ensures that the generated sentences are contextually appropriate and can be easily filled with specific slot values during runtime.
In addition to generating informative responses, NLG systems may also need to generate clarification questions to resolve ambiguous user inputs. These questions help the system understand the user’s intent more accurately. Methods for generating clarification questions include rule-based approaches and classifiers that guess which slots were misrecognized.
Training data for NLG is often challenging to obtain, and techniques like delexicalization and clarification question generation can help improve the quality of generated responses.
Evaluating Dialog Systems
When it comes to evaluating chatbots and task-based agents, different methods are used for each. For chatbots, human evaluation is primarily relied upon, rather than automated methods. This can be done through participant evaluation, where the user who interacted with the chatbot assigns a score based on their experience. Another approach is observer evaluation, where annotators read the transcript of the conversation and assign a score. In the case of task-based dialogue systems, evaluation is mainly focused on measuring the performance of the system in completing the given task.
When evaluating chatbots, there are eight dimensions of quality that are considered:
avoiding repetition
interestingness
making sense
fluency
listening
inquisitiveness
humanness
engagingness
Participant evaluation involves asking questions such as how repetitive the bot was, if it said anything that didn’t make sense, and if the user enjoyed talking to it. Observer evaluation, on the other hand, involves comparing two conversations and determining which one is preferred based on factors like engagingness, interestingness, humanness, and knowledgeability.
Automatic evaluation methods, such as the BLEU scores used in Machine Translation, are generally not suitable for evaluating chatbots as they often do not correlate well with human judgments. Adversarial evaluation, inspired by the Turing Test, is one possible approach where a classifier is trained to distinguish between human and machine responses. The better the system is at fooling the classifier, the higher its quality. Recently, ChatGPT has been used as a tool for evaluating the quality of dialog generated by other systems.
For task-based agents, evaluation is primarily focused on task success. This can be measured through end-to-end evaluation, where the system’s ability to successfully complete the task is assessed. Another approach is to evaluate the system’s performance is the slot-error rate, which measures the percentage of incorrectly filled slots: . Additionally, fine-grained metrics can be obtained through user surveys.
Task
Question
TTS Performance
Was the system easy to understand?
ASR Performance
Did the system understand what you said?
Task Ease
Was it easy to find the message/flight/train you wanted?
Interaction Pace
Was the pace of interaction with the system appropriate?
User Expertise
Did you know what you could say at each point?
System Response
How often was the system sluggish and slow to reply to you?
Expected Behavior
Did the system work the way you expected it to?
Future Use
Do you think you’d use the system in the future?
In addition to these evaluation methods, there are other heuristics that can be considered. Efficiency cost can be measured by factors such as the total elapsed time for the dialogue, the number of turns in the conversation, the total number of queries made, and the “turn correction ratio” which represents the percentage of turns used to correct errors. Quality cost can be assessed by looking at the number of Automatic Speech Recognition (ASR) rejection prompts and the number of times the user had to interrupt or intervene in the conversation.
Design and Ethical Issues in Artificial Agents
Ethical Design in Artificial Intelligence
Ethical considerations are crucial in the design and development of artificial agents, as these systems increasingly interact with humans in complex and sensitive contexts. The ethical implications of creating autonomous entities have long been a subject of concern, as illustrated by Mary Shelley’s Frankenstein. This classic tale serves as a cautionary narrative about the dangers of neglecting ethical and humanistic aspects when creating autonomous beings. In the realm of artificial intelligence, ethical design must address various critical concerns:
Safety: Systems must be designed to ensure they do not cause harm to users, particularly in high-stakes environments such as healthcare or automotive industries. For example, AI systems should avoid providing incorrect medical advice, which could lead to severe health consequences, or distracting drivers, which could result in accidents.
Representational Harm: It is essential to avoid creating systems that perpetuate or amplify harmful stereotypes or biases against specific social groups. AI should be inclusive and avoid any form of discrimination, whether intentional or unintentional.
Privacy: Protecting users’ privacy is paramount, especially in systems that handle sensitive information. The design of AI systems must include mechanisms to prevent the unauthorized leakage of personal data.
Safety in AI Systems
Safety is a critical aspect of AI, especially in applications where the consequences of failure can be severe. For instance, chatbots used in mental health settings must be carefully designed to avoid making statements that could harm or distress users. These systems need to be aware of the emotional and psychological impact their responses can have.
Similarly, conversational agents used in vehicles must be designed to enhance the driving experience without causing distractions. These agents must be context-aware, recognizing when their interaction might interfere with the driver’s ability to focus on the road. Ensuring that AI systems prioritize user safety in all interactions is a fundamental ethical requirement.
Abuse and Representational Harm in AI Systems
One of the most striking examples of ethical failure in AI is the case of Microsoft’s Tay, an experimental Twitter chatbot launched in 2016. Tay was designed to emulate the personality of an 18 to 24-year-old American female, capable of engaging in casual conversation, sharing horoscopes, and using informal language, including slang and emojis. However, within just 16 hours of its launch, Tay was inundated with negative interactions, leading it to produce offensive, obscene tweets that included harassment, propaganda, and conspiracy theories. This incident demonstrated how societal biases, such as racism and misogyny, can be reflected and even amplified in AI systems if they are not adequately safeguarded against such influences.
Bias in Training Data: One of the primary sources of ethical concern in AI is the presence of bias in training data. AI systems learn from large datasets, which often contain biased or harmful content. Without proper filtering and preprocessing, these biases can be embedded into the AI model, leading to outcomes that reinforce or exacerbate existing social inequalities. Therefore, significant effort is needed to identify, reduce, and eliminate biased content in training datasets to ensure that AI systems are fair and equitable.
Privacy Concerns in AI Systems
Privacy is another critical ethical issue in the development of AI systems, particularly those that interact directly with users and process personal data. Two major concerns related to privacy are:
Accidental Information Leakage: There is a risk that AI systems may inadvertently expose sensitive information. For example, a user might accidentally reveal personal details, such as a password, while interacting with a voice-activated system. If the system is not designed with strong privacy safeguards, this information could be compromised.
Intentional Information Leakage: Some AI systems are deliberately designed to collect and transmit user data to developers, advertisers, or other third parties. This practice raises serious ethical questions about the user’s right to privacy and the need for transparency in how personal data is used. Privacy-preserving techniques, such as differential privacy and secure multiparty computation, are essential in ensuring that AI systems respect user privacy while still providing useful services.