AI Zone Admin Forum Add your forum

NEWS: Chatbots.org survey on 3000 US and UK consumers shows it is time for chatbot integration in customer service!read more..

Chatbot effectiveness evaluation
 
 

Hi,
First thanks for the site and this forum, it’s invaluable source of chatbot information (this is my first post here, thought it’s the best place for this “thanks”).

I have created recently a chatbot program and implemented various algorithms which can be used to generate chatbot output based on user input. Some algorithms generate less possible responses which are more accurate while the other generate more responses which are less accurate (but still they sometimes surprise me very positevely).

I’d like to compare these algorithms and then build an optimal chain of them configured with the most suitable options. I have digged through some articles on chatbot evaluation methods (found the articles at CiteSeer as well as some university sites) and what has struck me is that all these methods are subjective. All of them are more or less based on Turing test (with specific scenario or not) and they all rely on user experience which is unstable (even the same person can have different experience depending on mood, etc.).

I thought about an algorithm which compares bot answers to model answers using some kind of “distance” (cosine, Levenshtein, whatever)—I would have a test scenario with questions and model answers and use this scenario for every algorithm. Unfortunately I didn’t find such a method described anywhere and I don’t think it’s so visionary. So I started to think that there are no such methods because they wouldn’t be feasible for such a comparison, because the nature of the problem is so indefinite—it’s hard to tell programmatically if the response is good or not (maybe that’s why my less accurate algorithms sometimes amuse me).

So, are there any standard numerical/statistical chatbot evaluation methods similar to the described above? If not, maybe I’ll try to formulate one grin).

Thanks,
Seweryn

 

 
  [ # 1 ]

That’s an interesting question Seweryn. You might find some answers in the field of psychology rather than computer science or linguistics. So-called “soft” sciences like that have more methods for pinning down subjective evaluations I believe.

Rather than directly comparing the output of different algorithms, have you thought about tracking some aspect of their internal state? For example, if a chatbot is supposed to simulate “mood” you could check the values of the variables that register mood under the influence of a variety of inputs that would be expected to alter the robots mood in similar ways.

 

 
  [ # 2 ]

So if I understand you correctly, you want a program that

(a) assign a confidence to each algorithm used to produce an output
(b) for each iteration, run the algorithms with a set of confidences which determine how the algorithms’ output combine into the final output
(c) measure the final output against the known solution, creating some effective error (chi-squared type value or otherwise)
(d) adjust confidences accordingly to reduce error
(e) run again and again and again
(f) once the error is within some threshold, the confidences will be deemed correct.

Frankly I wonder if one optimization would be appropriate for all inputs. I’m doubtful. I think there needs to be some sort of estimation of the goal of the input to direct which confidence sets to apply to the algorithms used in formulating the output. That is, is the input a question looking for a certain solution (one combination of confidences), is it a statement to be verified or contradicted (another set of confidences), etc.

Maybe it would help to know more about how your algorithms arrive at an output.

(Welcome to the forum by the way!)

 

 
  [ # 3 ]

Taking that line of reasoning a step further, what about evaluating the chatbot algorithms’ performance using the same techniques that you would use to evaluate the performance of a student?

For example, with multiple choice questions, the chatbot would have to read a question and a list of possible answers, and then decide which answer is closest to the one that it would give.

Even for short answer and creative writing tests, there are formal marking guides laid down to help examiners grade students’ answers consistently.

Another possible scenario is something like the Jeopardy quiz show. IBM has been developing an artificial intelligence and natural language processing system called Watson which is able to compete against human players in quiz shows, and win. http://www.research.ibm.com/deepqa

 

 
  [ # 4 ]

@Andrew
I found a few “psychological” chatbot evaluation methods, which aren’t binary as Turing test (human or not), they are based on user notes of how much each response is “correct”, they introduce scales for notes (1-5 or other) and then aggregate those response notes to have a chatbot note. But still, they’re based on user notes, therefore are subjective, depend on mood and so on, I thought that maybe other methods exist.

@CR
First I want to have a method to tell that one algorithm is better than other, a method different than me looking at the response and judging. If such method is numerical then I would also tell how much one algorithm is better than other.
Then I can run many of such tests to:
1. Optimize configuration values of each algorithm to get optimum of each.
2. Combine algorithms and set their confidence levels to get my global optimum.
Thanks for a hint with dynamic confidence sets, didn’t think about it earlier, maybe I’ll give it a try.

My chatbot is very simple, it holds a database of known conversations and if user input arrives it matches the input to the database, then selects the best answer. I have a few algorithms of matching input to database with various options, then a few filters which can boost potential answers. Finally the best answer is chosen.
And these matching algorithms are more or less accurate (and produce less or more potential answers which then can be boosted)—wanted to measure how effective these matching algorithms are, and their options, as well as boosting methods.

 

 
  [ # 5 ]

Seweryn, this reminds me of one of my old experiments from a few years ago.

It was very basic also, but it was kind of fun and entertaining to chat with.

It worked like this:  The database had a set of “L & R” pairs.  That is, each pair had what I called a “left” and a “right”.

The conversation would always start with the user having the left.

Example, if I said : “I’m hungry”, since that’s a left, it would look for a good ‘right’ for it.. if it found none, there would be no response.  And you could then enter:

  R: well, get something to eat !

Then, next time it would have that response for that input.  Also, you could switch ‘sides’ at start of conversation, such that the PC was on the ‘left’ side.  So it could then say “I’m hungry” and it would take your response and add it as a right.

It worked by picking keywords.  Another example, the pair…

              L(“It’s hot outside”), R(“What about going to the beach”)

It new the connection between this L & R by the word “hot”,  then user would enter:

              user: I want a hot cup of coco.
              pc: What about going to the beach?

LOL . .  BUT, you could reply with
              !“cup of coco”: sounds tasty!

And it would know that it responded inappropriately (the “!”), and that “sounds tasty” is the proper response BECAUSE although ‘hot’ was int he input, “cup of coco” was also.

Anyway, a primitive but entertaining little experiment from years ago!

 

 
  [ # 6 ]

@Victor
Yes, my bot is something like that, but instead of holding L/R pairs it holds whole conversations which took place in the past. It matches the input to one of phrases and then picks the next phrase in conversation as an answer, if such answer made sense in the recorded conversation it can make sense now.

But returning to the topic, it seems that I’m going to do some subjective tests as further research didn’t bring any clues about programmatic chatbot evaluation methods.

 

 
  [ # 7 ]

Hi Seweryn,
Welcome to the forums.

I thought about an algorithm which compares bot answers to model answers using some kind of “distance” (cosine, Levenshtein, whatever)—I would have a test scenario with questions and model answers and use this scenario for every algorithm.

Let’s say you have a few of these questions and answers. I suppose they would be considered ‘model’ answers because they were relevant at the time of the conversation.

Human A: How’s it going today?
Human B: Man! Can you believe that traffic? I sat for an hour out there.

Human A: How’s it going today?
Human C: Not too good! I received some bad news today.

Human A: How’s it going today?
Human D: Leave me alone. I need coffee.

Here are three realistic snippets from a conversation with the same question and a variety of answers. Each of them are valid.

My only concern about using ‘model’ answers is that one could spend a lot of time creating numerous model answers for each inquiry.

However, I’m a big fan of looking outside the box and conducting research. I look forward to reading about your findings.

Good luck,
Chuck

 

 
  [ # 8 ]

@Chuck
Good point with those different responses to the same question. I came to similar conclusions after I couldn’t find any chatbot comparison algorithm, that only human can judge if an answer is relevant or not, because such judgement requires similar abilities as generating an answer.

For now I’m OK with that and going to evaluate my algorithms “by hand”, but evaluating one chatbot with another could be an interesting experiment and would automate evaluation process a little.

 

 
  login or register to react