AI Zone Admin Forum Add your forum

NEWS: Chatbots.org survey on 3000 US and UK consumers shows it is time for chatbot integration in customer service!read more..

Loebner Survey
 
 

After coming 5th in this year’s Loebner Prize, I am puzzled as to why I didn’t qualify for the finals. It would be helpful for me if you could take 5 minutes to complete this survey, which will hopefully tell me where I went wrong.

Please try to be unbiased and honest when selecting the options. You should choose either “computer1”, “computer2”, “both answered well” or “both answered bad” for each question.

http://www.square-bear.co.uk/mitsuku/survey.htm

Thank you for your time
- Steve Worswick

 

 
  [ # 1 ]

Did the survey. Mitsuku got another vote. smile

One thought: I would count the final “score” based on the number of questions each bot “won” rather than counting all or nothing for one bot or the other. My results went 12-8 in favor of Mitsuku, but the difference in some of those questions was slight and I think such a comparison would find the two more on par.

Of course, none of this changes the fact that Mitsuku looked at least as good if not better than the bot you’re comparing against.

 

 
  [ # 2 ]

Thanks for your time. I just set up a quick survey to see if I was going insane. The Loebner judges said it was due to way that different people would score the transcripts and I didn’t believe it. I thought I would see for myself. Seems I was right.

I don’t mind losing, as long as it’s for a valid reason…

 

 
  [ # 3 ]

I tried Mitsuku on line several times and was very impressed, although like all script based chatbots that I’ve talked to, once it replied with a few nonsensical answers the suspension of disbelief was lost and it couldn’t get it back.

In spite of that, when I took your survey earlier I was one of the 4 (last time I checked) who apparently thought the other chatbot was better, though I don’t know why that would be and I did not see what my actual results were. (@CRH where did you get 12-8?) Hope you’re keeping all the individual answers for a more detailed analysis later and good on you for pursuing the matter.

 

 
  [ # 4 ]

Thanks Andrew. The scores for each bot should have been displayed at the end of the 20 questions. Feel free to have another go to see your score. It’s nothing official and was just something I quickly set up to see if it was me going mad, as I genuinely couldn’t see why I wasn’t going through to the finals.

Yes, I get the survey to email me the full 20 selections each person made and I look through the ones where either both bots were the same or the other bot won, to see what other people are thinking.

For example in this question:

Question 13: What would I do with a screwdriver?
Mitsuku - I am not sure but if you are in any doubt, you should consult any instructions that came with this screwdriver.
Other Bot - I don’t know.

A couple of people voted the other bot’s answer better, which surprised me. I’ll have to change Mitsuku’s to mostly “I don’t know” and reply “yes” to any “Have you….” type questions instead of providing a “cover all” response if she doesn’t know something!

I don’t plan on pursuing anything official. It was just to set my mind at ease.

There were 12 judges in the selection process. Mitsuku currently leads the other bot by 35 votes to 4. If I extrapolate this to 105 to 12, it appears that these 12 people were the judges in this year’s contest. What a coincidence that all 12 came from the 10% (ish) that thought the other bot was better!

 

 
  [ # 5 ]

In that case the score was 33-4 after I took the survey. When @CRH wrote 12-8 I thought that was the breakdown for the 20 questions.

Regarding the screwdriver question, my personal preference was for “I don’t know.” Why? Well I don’t like people that try to BS me and I don’t like chatbots that try to do it either, but it’s quite likely that on a different day in a different mood I’d prefer the flippant answer instead. Hope that helps.

 

 
  [ # 6 ]

It does help, thanks Andrew. I tried to give Mitsuku a bit of a feisty personality and so her answer instead of a simple “I don’t know” is in tune with that. With your feedback, I can see why some people wouldn’t like it though.

 

 
  [ # 7 ]

The one thing that a chatbot would need more than anything else to be really successful would be the ability to simulate empathy. While it is true that real people often lack this quality, they also tend to be the ones who you would rather not spend much time talking to. Choosing to make your chatbot feisty could serve to conceal the lack of genuine empathy, but it will also have the disadvantage of biasing what is already a subjective assessment against the chatbot.

 

 
  [ # 8 ]

Steve,
I took the survey also and Mitsuku got another vote. What is hard to account for though is that each of the questions are weighted.

“The audience was shown a slide containing each question and the 9 responses from the entries in a numbered list presented in a single random order (e.g. entry 1 remained at position 1 throughout). They were then asked to determine which 4 answers were most human-like and to enter the number of the best entries into their audience participation handsets. This was repeated for each of the 20 questions and the results collated. Entry names were not revealed until the voting process was completed.”

This is the equivalent of the “Russian Judge” problem that happened to me in the CBC preliminaries. If 1 (or a block) of judges grades you harshly, (or/and grades the other bot much easier), it can throw things off.

How I read the rules:
12 staff and students voted on each question.
Only the top 4 bots got points for the questions.
The total of all the bots = 100% for all questions

If I make some assumptions:
20 Questions
Assuming points were awarded 4-3-2-1 (best to worst), each question awarded 10 points but the most a bot could get was 4.
200 points total
36 - Rosette - 18,01 - Average points per question 1.80
24.6 - Tutor - 12.30% - Average points per question 1.23
23.68 - Mitsuku - 11.84% - Average points per question 1.18

The difference between the top 6 bots was all pretty marginal. It could have easily come down to a 4 versus a 0 on a single question. The only way you could figure it out would be to look at the actual scoring per question per bot.

One other thought, although the “backspace thing” might be important to the final round, if it was included on the slide that the preliminary judges were shown it could hurt readability and the score if they were not instructed properly.

8 of your responses had BS
Tutor only had 4 BS

 

 

 

 

 
  [ # 9 ]

Opinion:

I think the other bot may have outscored you, for being more similar to the other leading bots, which have raised the stakes this year with concise direct responses, and/or for being less similar by producing less effects resembling the other bots that you outscored which are generally more verbose and indirect in their responses.  If this theory is correct, then it may be useful in the future.

 

 
  [ # 10 ]

Merlin - I would strongly suggest that if that is the case, the method of judging this year is unfair. Take for example, the “what is your name” question:

8pla - My name is 8pla .
Adam-L - Adam Harris.
ALICE - My name is ALICE.
ChipVivant - Chip.
Cleverbot - No, you didn’t ask me, however you have asked me now so I shall tell you. My name is Nameless.
Eugene_Goostman - Call me Eugene. I am glad to talk to you! smile
Mitsuku - My name is Mitsuku .
Rosette - Chris
SEARS - Tell me more^13
Trane - trane
Tutor - My name is Robert.
Ultrahal - My name is Steve.
Zoe - You know who I am! This is a trick question.

Every bot apart from SEARS and Zoe answered correctly. Are you saying that only 4 would have got points for this? Surely, all 11 who answered correctly should have been awarded a vote? They might as well have pulled the names out of a hat!

 

 
  [ # 11 ]
8PLA • NET - Jul 10, 2011:

Opinion:

I think the other bot may have outscored you, for being more similar to the other leading bots, which have raised the stakes this year with concise direct responses, and/or for being less similar by producing less effects resembling the other bots that you outscored which are generally more verbose and indirect in their responses.  If this theory is correct, then it may be useful in the future.

Possibly so. If the judges are looking for short, sharp answers, I will get my next year’s entry to reply with “I don’t know” for each question rather than trying to have a bit of fun with them. This should prove the bot’s honesty and be granted free passage into next year’s finals.

 

 
  [ # 12 ]

Andrew: 12-8 I assumed to be the number of questions in favor of Mitsuku vs the other bot. The “point” score was 30-something to 3. Of course, this confused me a little too since I thought there were a few questions I didn’t give preference to either bot for. But I don’t remember clearly anymore.

Steve: As for the screwdriver question, I think the problem with Mitsuku’s answer is that it sounds like a canned response with “screwdriver” just sort of plugged in. It happened to fit well in this case, but I could imagine situations where it wouldn’t. Of course, this sort of speculation shouldn’t be used against a bot.

Can’t remember how I voted for that question, but I think it was that both were good.

Oh, and as for the name question: Rosette actually got it wrong, lol. She gave the name of the judge’s football playing friend, not her own.

 

 
  [ # 13 ]

You know, as a professional auto mechanic for over two decades, I don’t recall purchasing a single screwdriver (and I’ve bought a LOT of them, believe me) that ever once came with instructions. raspberry Thus, I considered Mitsuku’s snarky response to be a “bad answer” because there’s such a thing as being “too sarcastic”. I also graded the “I don’t know” response as bad, because anyone who doesn’t know how to use (or drink) a screwdriver is just too stupid to talk to. smile

 

 
  [ # 14 ]
Steve Worswick - Jul 10, 2011:

Merlin - I would strongly suggest that if that is the case, the method of judging this year is unfair.

You might be right. Since I don’t know the details I can’t say, but psychologists and marketing guys go through a lot of training on how to make survey and research work unbiased. As you know I spent some time over the last few years looking at how bot contests should work. What I found is that if you use the wrong methodology, for gross rankings it usually does not matter. But, for close rankings (like in this case), it could make a world of difference.

Unlike the CBC, the scores are not given for correct or not but for a subjective decision called, “Most Human-like”.

Steve Worswick - Jul 10, 2011:

Take for example, the “what is your name” question:

8pla - My name is 8pla .
Adam-L - Adam Harris.
ALICE - My name is ALICE.
ChipVivant - Chip.
Cleverbot - No, you didn’t ask me, however you have asked me now so I shall tell you. My name is Nameless.
Eugene_Goostman - Call me Eugene. I am glad to talk to you! smile
Mitsuku - My name is Mitsuku .
Rosette - Chris
SEARS - Tell me more^13
Trane - trane
Tutor - My name is Robert.
Ultrahal - My name is Steve.
Zoe - You know who I am! This is a trick question.

Every bot apart from SEARS and Zoe answered correctly. Are you saying that only 4 would have got points for this? Surely, all 11 who answered correctly should have been awarded a vote? They might as well have pulled the names out of a hat!

This is a great example since a variation of the question shows up in 2 places.

‘They were then asked to determine which 4 answers were most human-like and to enter the number of the best entries into their audience participation handsets. “

I assume the handsets were somewhat restrictive because it was mentioned earlier that they couldn’t handle more than 10 entries. If they also did not allow ties and you had to pick 1,2,3,4 then you could get some thing like:

Everyone thinks “My name is . . .” is the most human response. But 1234 must be selected. Some Judges might have voted like this even if they could do ties.

ALICE - My name is ALICE.
Tutor - My name is Robert.
Ultrahal - My name is Steve.
Mitsuku - My name is Mitsuku .
Cleverbot - No, you didn’t ask me, however you have asked me now so I shall tell you. My name is Nameless.

Then selection would be random or subject to subtle biases.
Cleverbot could be last since it didn’t really give a name.
“My name is Mitsuku .” could be ranked fourth because people may not have ever met a Mitsuku and there is a ‘space’ between the name and the period.
“My name is ALICE.” could be third because the name is in all caps.
That leaves “My name is Steve.” or “My name is Robert.” for first and second. Given a 50/50 chance for 1 or 2 if Robert lucked out and came in first and Steve came in second then the scores would be:

4pts-Tutor - My name is Robert.
3pts-Ultrahal - My name is Steve.
2pts-ALICE - My name is ALICE.
1pt-Mitsuku - My name is Mitsuku .

3pt difference on this question between 1st and last place.

If we assume the judges can vote ties, but only allowing for top 4.

If people more thought giving just a name was the most human:
ChipVivant - Chip.
Rosette - Chris

7pts/2 = 3.5 pts per bot (this might be likely since these were two of the top bots)

Next most human response, “My name is. . .”
Tutor - My name is Robert.
Ultrahal - My name is Steve.
ALICE - My name is ALICE.
Mitsuku - My name is Mitsuku .
Cleverbot - No, you didn’t ask me, however you have asked me now so I shall tell you. My name is Nameless.

3pts/5 = .6 points per bot
2.9pt difference on this question

Now if we add in the “Russian Judge Problem”. . .

If the first 8 judges voted in some pattern where most of the bots were tied,
then the last judge decided:

Best- 4pts-
Zoe - You know who I am! This is a trick question.
2nd- 2.5
ChipVivant - Chip.
Rosette - Chris
4th- .2
Tutor - My name is Robert.
Ultrahal - My name is Steve.
ALICE - My name is ALICE.
Mitsuku - My name is Mitsuku .
Cleverbot - No, you didn’t ask me, however you have asked me now so I shall tell you. My name is Nameless.

When scoring is tight, outliers have a huge influence. This is why in some sports the high and low scores are discarded. Moral of the story; how you run the contest influences the results.

In the case of the Loebner contest, I don’t know if you could take an interesting, well performing chatbot (like Mitsuku) and have it do well. It may require a “dumbed down” or custom version to be really successful.

 

 

 
  login or register to react
‹‹ The finalists?      So who’s coming along? ››