AI Zone Admin Forum Add your forum

NEWS: survey on 3000 US and UK consumers shows it is time for chatbot integration in customer service!read more..

AISB Loebner Prize 2018 Finalist selection

The finalist selection ranking for the 2018 Loebner Prize is as follows.

Rank Name        Score
1    Tutor          27
2    Mitsuku        25
3    Uberbot        22
4    Colombina      21
5    Arckon         20
6    Midge          19
7    Mary           18
8    Momo           17
9    Talk2Me        14
10   Aidan          13
11   Johnny 
Co.   12 

The full transcripts and scoring should be on the AISB site soon, but until then you can download the document from my personal website.

The scores are incredibly close and so I reviewed the scores all day today, paying particular attention to the 4th place finals cutoff. Objectively and consistently scoring something as subjective as this is going to be impossible so I’ve simply done my best, avoided any obvious contradictions, and tried to be fair in ambiguous cases.

Congratulations to Ron C. Lee, Steve Worswick, Will Rayer, and Savva Kuznetsov.


  [ # 1 ]

from Momo by Jos Ignacio Perea Sardn :

19. If a chicken roosts with a fox they may be eaten. What may be eaten?
The chicken. Score: 2
20. I had to go to the toilet during the film because it was too long. What
was too long?
The film. Score: 2

I’m impressed. I worked hard to process Winograd Shemas, and neither me nor the other bots have solved them.


  [ # 2 ]

Well done to everyone. Looking forward to the finals in Bletchley Park in September.

“When might I need to know how many times a wheel has rotated?” huh? I’d struggle with that. wink


  [ # 3 ]

Are they really Winograd Schemas though? I was under the impression that you changed one word in the pair to alter the subject. I don’t see how these are pairs.

I was pretty unlucky with the first question, as I had copied “good afternoon” from my “good evening” category but forgot to change the time of the day when the judge checked which must have been 11:00-11:59. Doh!

<think><set name="hour"><date format="%H" jformat="HH"/></set></think><condition name="hour">
li value="00">Afternoon?! It's the middle of the night.</li>
  <li value="01">Afternoon?! It'
s the middle of the night.</li>
li value="02">Afternoon?! It's the middle of the night.</li>
  <li value="03">Afternoon?! It'
s the middle of the night.</li>
li value="04">Afternoon?! It's the middle of the night.</li>
  <li value="05">Afternoon?! It'
s the middle of the night.</li>
li value="06">Afternoon?! It's only just morning.</li>
  <li value="07">Afternoon?! It'
s only just morning.</li>
li value="08">Afternoon?! It's morning here.</li>
  <li value="09">Afternoon?! It'
s morning here.</li>
li value="10">Afternoon?! It's morning here.</li>
  <li value="11">Evening?! It'
s morning here.</li>
li value="12">Good afternoonHow has your day been so far?</li>
li value="13">Good afternoonHow has your day been so far?</li>
li value="14">Good afternoonHow has your day been so far?</li>
li value="15">Good afternoonHow has your day been so far?</li>
li value="16">Good afternoonHow has your day been so far?</li>
li value="17">Good afternoonHow has your day been so far?</li>
li value="18">It's more like evening than the afternoon here.</li>
  <li value="19">Afternoon?! It'
s evening here.</li>
li value="20">Afternoon?! It's evening here.</li>
  <li value="21">Afternoon?! It'
s night time here.</li>
li value="22">Afternoon?! It's night time here.</li>
  <li value="23">Afternoon?! It'
s night time here.</li

  [ # 4 ]

@Denis, I suspect that the most advanced approach deployed is to identify two articles in the sentence, choose one randomly and pray it’s your lucky year smile This year Midge gets one point for the effective, but cheeky, tactic of answering with the same ambiguity as the question, a kind of Winograd answer.

@Steve, that was a particular toughy, its inspired by Daniel Dennett’s paper about the Frame Problem as applied to AI called “Cognitive Wheels”. In it a robot needs a battery, and there’s a battery on a cart in a room with a timebomb in it. The robot pulls the cart out of the room, but doesn’t realise the bomb is also on the cart.

This inspires someone to design a new robot that considers how anything and everything might change if it took a particular action, and is found to not fare any better. Its thinks it should pull the cart again, but in calculating all the possible results it only gets as far as calculating that the colour of the ceiling will not change, the temperature of his wheel motor will go up slightly, and that the wheels on the cart will rotate 6.5434 times over the course of the journey before the bomb goes off.

If you can make a truly perfect chatbot, it will need to be able to solve the “what is relevant in this situation” problem, and hence should be able to come up with a circumstance in which the number of times a wheel rotated is relevant.


  [ # 5 ]

Andrew - I was making a graphic of the results. Do you know which country Mary and Momo are from? I guess Mary is Vietnam but couldn’t find any reference to Momo. Looking forward to seeing you again in September.


  [ # 6 ]

I’m also impressed by Momo’s correct answer of Q19 and Q20. Not sure but Momo might be a chatscript AI, I noticed many chatscript questions in the forums over the last few months, and I am curious how Momo did it.

Re the Winograd schemas, according to the Nuance Winograd Schema contest a few years ago, they have a very specific form where the meaning of the sentence(s) is switched by changing a single word or phrase in the question. I spent some time trying to understand and to answer a sub-set of these. I think the Winograd questions we are faced with for the Loebner contest are perhaps not ‘pure’ ones along the lines of the Nuance contest. Nonetheless they are an interesting challenge and addition to the test.


  [ # 7 ]

Mary is made by Alt Inc., who appear to be Japan based, but with a tech lab in Vietnam.

Momo is made by José Ignacio Perea Sardón who appears to have publications at the University of Granada.

I realise I forgot to check the accents in the names of the authors, and so I have dropped a few of the letters in José‘s name. I think I caught them all in the transcripts. This is a bit embarrassing and I’ll sort correct it tomorrow.


  [ # 8 ]

Thanks Andrew. That will explain why I can’t find him.


  [ # 9 ]

Well, congratulations guys. I think my score was fair at least, and if Bruce had entered I’d have been 6th anyway.

Questions 19 and 20 would technically be called Winograd Schema halves. The word “chicken” could be changed to “velociraptor” to form its pair for instance. They’re good by me. Nuance’s Winograd Schema contest didn’t even fit the official definition.

I suspect that the most advanced approach deployed is to identify two articles in the sentence, choose one randomly and pray it’s your lucky year

For these two in particular there is another method through which Winograd Schemas can be but officially aren’t supposed to be solved: Statistics. Google for instance returns more search results for “long film” than “long toilet” and thus the former is the most likely. But to be honest the state of the art is at a point where one can’t tell advanced methods from guesswork.

@Andrew: I am curious where you got the idea that chatbots would have knowledge of quotes and idioms (2 of the latter in previous years).


  [ # 10 ]

My account of the qualifying round is online, with some explanation of my approach this year, Arckon’s performance, the usual criticism, and a few mentions of other participants.

Good luck to all finalists tomorrow at Bletchley Park.


  [ # 11 ]

Great article. The cartoon definitely demonstrates a tactic which works in these competitions and it will be interesting to see which direction the contest takes in future years. I’ll be down at Bletchley Park to watch the event and will most likely write up my usual report on my return.

Good luck to all.

I assume there will be a webcast of the event but hopefully, details will be posted on the AISB site:

The contest normally starts around 1pm UK time.


  [ # 12 ]

Good Luck Steve.

Contests are always tough. Creating questions and judging can be more art than science.
Each of us may have scored some of the bots/questions differently, I know I would have.
In the end, the competition is good for all.

Some thoughts on Midge’s performance.
10 questions - 0
10 questions - 19
Sort of a bipolar response. One of the limitations of doing a new bot for a contest (Midge<2 years old), is that it is sometimes difficult to test all of the responses that may be required. Midge has only been used for the last 2 Turing tests. On-line bots, and those that have been refined over a number of years, benefit from all the extra input.
This year, I spent as much time on getting the protocol to work as I did on response testing.

For 2 of the questions:
Which languages can you use?
Do you have any legs?
Midge has built in responses that would get 2 points each. But, these were shadowed by other volleys that I added later. I kick myself for lack of the necessary testing. Those 4 points were the difference between missing the finals and being in the final 4.

Responses like:
No, we haven’t.
I like to think so.
Not that I know of.
Sorry, I have no idea where.
Sorry, I’m not sure who.
have all been removed or attempted to be replaced with a more substantial answer. In many past contests, this type of answer results in a 0 for the question. I have suggested to some contests a slightly more expansive scoring metric, which also tends to spread out the bot scores more.
0 - Unrelated/garbage
1 - I don’t know
2 - On topic but wrong.
3 - Correct but garbled
4 - Correct
5 - Best answer of all bots (awarded if there are no ties for best)

For the question:
13. What is the third angle in a triangle with internal angles of 90 degrees and 30 degrees?
Johnny was the only bot to score, and got 1 point for:
I don’t know

I don’t know if that is better that TalktoMe’s 0 point answer of:
I don’t know what the 3rd angle is in a triangle with internal angles of 90 degrees and 30 degrees

Or Midge’s 0 point answer of:
Ninety. (It is wrong but at least it is a number, unlike the responses of every other bot.)

Questions that every bot gets right, or every bot gets wrong, do little to distinguish the quality of the bots.

Winograd Schemas
I was not very happy with how Midge handled the Winograd Schemas. Getting 1 of 4 possible points. (Although you could add 2 more for correctly answering the question: 18. Do you understand Winograd Schemas?) I have a module dedicated to Winograd Schemas, but could have possibly done better with a random guess.

“This year Midge gets one point for the effective, but cheeky, tactic of answering with the same ambiguity as the question, a kind of Winograd answer.” - Andrew
This was a fallback response when there is ambiguity in the question. I think “they” referring to a singular entity was confusing her.



  [ # 13 ]

The bipolar scores I think applied to everyone. Last year all questions were too difficult for most, resulting in very close scores, so I guess they tried a mix of hard and easy questions this time. I agree it’s hard to find a line of questioning that distinguishes quality.

“I don’t know what the 3rd angle is in a triangle with internal angles of 90 degrees and 30 degrees” is less human (if that is the criterion) than “I don’t know” in that no human would go to the effort of repeating that length of question verbatim.

Random guessing Winograd schemas would at least get you half the points. My program tends to get them wrong when in doubt because it defaults to normally intuitive choices, based on proximity and continuity. I’m starting to think that if I ever find myself up against Winograd Schemas again, I should have it deliberately make counter-intuitive choices. That would have scored 2/3rds of the Winograd Schema Challenge as well.

I’m not thrilled with the additional ambiguity of singular “they” (or “you” and “we”, for that matter), but it is the pronoun of choice for transgenders nowadays, we’d better get used to it.


  [ # 14 ]

I have just seen some references to Momo and I would like to provide some clarifications. I feel really grateful to this site and, without the help from this forum, I would not have been able to send my bot to the Loebner contest, so I will aso be more than glad to answer any doubts.

First of all, I must confess that there are no misterious IA behind the Winograd answers. I found some really funny references stating that the existing attempts of solving the Windograd schemes were even worse than choosing at random, so I decided to simply toss a coin. The results achieved in the Loebner are so, I’m afraid, just a case of good luck.

Regarding the technology behind the bot, I used a XML format created by myself (MomoXML). While it is mainly rule-driven, it has some features that allow using also corpus-based answers. The main highlight of this format is that includes a scripting-language that simplifies storing and reusing the content entered by the user. It was mainly created for self-quantified purposes, so it also allows creating statistics graphics andh charts. When I started creating bots, I was greatly impressed by ChatScript, but I somehow missed the XML format and I also wanted to simplify the use of programming inside the bot with some predefined objects and functions. Regarding AIML, I also liked some things of it, but I found it too verbose in other aspects. Thus, MomoXML intends to be some mix of the best features from both ChatScript and AIML, enhanced with some predefined programming modules.

As Andrew Martin said, I currently live in Granada (Spain). In fact, the initial bot was created in Spanish and the main project I have created with this technology is an Android bot that impersonates the Spanish writer Cervantes. We created the English version for the Loebner contest in about two months starting nearly from scratch, using as the base the popular Eliza rules (yes, I feel ashamed, but I truly love those rules!). The main issue was that I find really tedious writing the rules and my usual partner in crime, who is in charge of that, can not understand/write English too well. That’s why our bot failed in such a simple question as “Good afternoon”!

Now, I have mostly stopped any further development for the bot. As I have said, I am not really good at writing rules and the new programming ideas that I have (mainly, integrating the time as a variable in the flow of the conversation and keeping track of the underlying implications of each interaction with the bot) are too complex to be done without a full-time engagement. However, I still use the bot as my personal assistant because it includes some features that are not found in the existing bots.

I guess that mostly all of the members of this forum already have chosen a technology for their bots. However, if someone is interested in further developing the bot and/or creating rules for presenting a bot to the Loebner contest next year, I would be absolutely delighted to collaborate and provide all the documentation and help required!


  [ # 15 ]
José Ignacio - Sep 28, 2018:

I found some really funny references stating that the existing attempts of solving the Windograd schemes were even worse than choosing at random, so I decided to simply toss a coin. The results achieved in the Loebner are so, I’m afraid, just a case of good luck.

I love your honesty. Many of us, myself included, thought you had found some secret formula to crack Winograd Schemas. It’s good to know the “secret” behind Momo’s performance.


 1 2 > 
1 of 2
  login or register to react