AI Zone Admin Forum Add your forum

NEWS: Chatbots.org survey on 3000 US and UK consumers shows it is time for chatbot integration in customer service!read more..

Loebner Prize Report 2016
 
 

I’ve put together some notes on my experience of this year’s Loebner Prize final.

I arrived at Bletchley Park at around 9:45 am and after signing in, went to the Education Centre at Block B where this year’s contest as to be held.
The 4 AI laptops were up and running and after saying hello to Hugh and the organisers, I set up Tutor and Mitsuku on their allocated machines. The installation process went extremely well and I had both bots installed and configured to run within 10 minutes.

This year, the AISB were installing some extra node.js software on the machines and I believe that this was related to the live webcast, as well as removing the network delay that plagued the competition in 2014. Unfortunately, this software proved to be causing lots of problems, as I was unable to test that Mitsuku and Tutor were working and had no network delays when talking via the judge program until around 1:20pm, the contest was due to start at 1:30pm.

Eventually, all the problems with it were resolved and the contest started around 1:50pm, 20 minutes after the scheduled start time.

This year’s judges were:

Joanne Pransky: World’s First Robot Psychiatrist - http://www.robot.md
David Boyle: Author of “Alan Turing: Unlocking the Enigma”
Jow Hewitt: Brand Strategist. Landor - http://landor.com
Tom Cheshire: Technology Correspondent with Sky News - http://news.sky.com/technology

The human confederates were:

Lisa Guthrie - Kingston University
Emily Donovan - 2MUL
Memo Akten - Goldsmiths
Prashant Aparajeya - Goldsmiths

The details of the judges and confederates were unknown, as the last page of the handout was deliberately unavailable until after the contest. I suggested this last year, as it would be easy for a judge to do something like:

Judge: What is your name?
Entity: I am called Maria
Judge: Well according to the handout, none of the confederates are called Maria and so you must be a bot.

Round 1 got underway with no hitches. However about 10 minutes into the round, the organisers noticed that Rose had a “debug error” message displayed on the screen but she looked to be responding to the judge’s messages and was displaying them back to the judge, so they decided to leave it running.

In round 2, I noticed from watching the output on the webcast that Arckon was producing lots of dduupplliiccaattee letters. I can only assume this was due to the program not clearing the program folder quick enough after processing each character. I mentioned it to the organisers but there was little they could do, as it seemed to be the program itself. Unfortunately, this meant it was producing nonsense and didn’t stand much of a chance this year.

Rose had the same error message on screen as round 1 but still seemed to be working.

Round 3 saw the judge talking to Rose leave the room about 5 minutes into the round, saying he had something he needed to do. I voiced my opinion that I thought this was unfair towards Rose and we should wait for him to return to restart the round but it appears that Rose wasn’t responding at all at this point. Dr Keedwell tried to contact Bruce and I posted a message on chatbots.org and Twitter mentioning this but I don’t believe Bruce saw it.

Round 4 saw Mitsuku fail to respond to any of the judge’s input. I checked the program folder and could see the judge was talking to her but Mitsuku refused to respond. Fortunately, as I was on site, I could soon remedy the fault, which was down to a config file on the organiser’s node.js setup which needed amending between rounds and she started working again after about 5 minutes.

The results saw three of the judges score Mitsuku in top place and one of the judges ranked Rose as the top bot. This meant that Mitsuku won with a score of 1.25 - The best score possible is 1 (lower is better)

The final placings were:

1 - Mitsuku
2 - Tutor
3 - Rose
4 - Arckon

Sky News were filming for the event and broadcast several times during the day on their TV channel. A company from Channel 4 television here in the UK were also filming for a documentary. The event was well attended by the public who seemed to be enjoying the exhibits of a NAO robot, software where they could make their own collages from internet photos and even the opportunity to take a Turing Test with a chatbot for themselves.

Suggestions for next year:

1 - Do a mock set up of the computers before the contest day. This will allow plenty of time to sort out any issues, as it looked like the contest was going to be delayed by quite a while this year. Luckily, it was only 20 minutes.

2 - The programmers should be available on standby to assist the organisers in case of difficulties or at least have a representative at the contest. Had I not been on site to fix Mitsuku, I would have missed an entire round. Technical issues with the programs meant that in reality, it was a contest between Mitsuku and Tutor this year and although naturally, I am overjoyed to have won. It would have been good to win with all 4 bots showing their full potential.

 

 
  [ # 1 ]

So, a lot of malfunctions all around. I was standing by at my mailbox during the first half hour, and had told the organisation as much, but they didn’t contact me. Evenso there’s nothing in my programming that could have produced such results, so I couldn’t have fixed it either.
I could tell that the scrambling of messages occurred at both sides, as Arckon constantly commented on receiving invalid “words” (i.e. consisting of a single consonant or without vowels). Even if I could compensate for unsequentially received input, I can not influence how a simple incrementally numbered output “What does ellho mean?” ends up as “Whdoat eslloh eanme?” on the webcast. That said, I had restricted the organisation from testing the communication with anything more than “Hello” in order not to affect later conversation, otherwise the problem might have shown up earlier. Whether it could be solved is a matter I will work out with Dr Keedwell. I suppose next year I could add a “test mode”.

So, a mixed congratulations from me as I do not feel involved with the outcome of this contest. I did expect Mitsuku or Rose to win even if things had been different, so I do believe it is a deserved victory and I hope to be entertained by Mitsuku’s transcripts.

 

 
  [ # 2 ]

Here’s a theory:
My program can only ever ask a single question per output. In Sky news’ video I see however this output:

WhWhaatt adroees yokuo whnaeampeigd? mean
?

This appears to be a mix of its default responses “What are you named?” and “What does kowheaig mean?”, or whatever the judge’s input was scrambled into (from “know what a misspelling is?” presumably). We also see the two question marks, which are impossible by hardcoded procedure. Hence somewhere down the pipeline one message was held in transit, and released simultaneously with the second message once resources opened up, mixing the two and appearing to output double letters. My theory would be that whatever problems Steve mentioned with the node.js software were not fixed on my program’s computer because I’d told them not to mess with it. The question is whether I was too cautious or not cautious enough.

On another note I suspect Bruce was sound asleep in his timezone during the contest, and I don’t think the prize money covers a US-UK flight, so I’d prefer to let creators choose to be absent at their own risk. A somewhat risky alternative to get in touch with creators when their bot crashes, might be to call for help through the webcast itself.

It’s good that you thought of withholding the confederates’ names. I vaguely recall that they used to tell the confederates to use an alias during conversation, perhaps easier for next time.

 

 
  [ # 3 ]

Last year, Lisa had similar problems for 2 rounds.

After I checked the logs, I found out that it was receiving the input and writing the output correctly (I had her do a local log along with output via the Loebner protocol). It looked like something was locking the output directory and preventing Lisa from writing to it. 2 rounds went that way, with almost no output to the judges (although the program did generate output).

My guess is that it has something to do with the webcast program since it never seems to show up in the preliminary rounds and many of the finals have had issues over the past few years. I had a number of discussions with the Loebner people and they told me nothing locks the output directories.

It is probably long past time that the Loeber people re-think the interface protocol and methodology. Unfortunately, in the past they have been unwilling to do so.

If every year has some problem that restricts the competitors, the contest is not highlighting the best of the technology. 

 

 
  [ # 4 ]

Yes, I was sound asleep. And until they send me logs I can’t tell whether rose was generating output not being delivered (like Steve) or not. Sad to see that things are still a mess and that new software they add just continues to add problems.

 

 
  [ # 5 ]

There should be no duplicate letters even if the comm program does *not* remove previous sub-directories since there is a table of sub-directories using the unique numerical prefix and the first two actions after detecting the presence of a new key press are: test is this in the table → yes do nothing → no enter in table and show in window.

One way of debugging is to remove rmdir command and use list of sub-directories as a log

 

 
  [ # 6 ]

I agree that it would be better to set up day(s) in advance.  However, this year the computers were hired and Bletchley has relatively late opening time and early closing time.
I wonder how many people viewed the competition on line. I’m not sure it’s worth the effort.
Last year’s buffering problem was the same for humans but didn’t cause a problem for the judges or confederates.  An “intelligent” not should be able to deal with it.

 

 
  [ # 7 ]
Merlin - Sep 20, 2016:

My guess is that it has something to do with the webcast program since it never seems to show up in the preliminary rounds and many of the finals have had issues over the past few years.

In the qualifying round, the chatbots and judge program reside on the same computer and communicate without network. If I’d known there were still issues last year or that there was going to be new software, I would not have been so trusting that 2014’s network problems were fixed and that all would work as well as in the qualifying round. Clearly I am not paranoid enough.

All I know is it’s apparently nothing to do with my programming. It operated perfectly in asking what all this garble was that it was getting. However, I will make sure to raise the program’s intelligence to set off an incessant siren, hold the facility hostage and demand to see a network technician next time it detects technical difficulties.

The webcast needn’t stay on my account. It was a nice idea but it’s only complicating matters.

 

 
  [ # 8 ]

The node.js software seemed to be acting as a “middle man” between the judge program and the program communications folder on the AI desktops. I’m not sure the node.js is entirely to blame, as the 2 AIML bots handled it ok. The round 4 error for Mitsuku was due to the round number not being manually changed in the config.js script between rounds 3 and 4.

However, if the software was purely for the webcast, this did add an extra level of complexity and I would advise removing it. For future contests, perhaps simply point a camera at the judge’s screen and broadcast via Youtube or Facebook Live.

The contest ran extremely smoothly in 2015 with no network issues I am aware of. All the problems of 2014 had been resolved.

By mock set up, I mean create the contest setup in the AISB offices (or wherever) with the PCs, network switches, bots and judge program. Then once working, the whole kit can be transferred to Bletchley, with all the software pre-installed, configured and working. This would surely be less stressful than trying to set it all up on the day with a deadline to meet.

 

 
  [ # 9 ]

The contest ran extremely smoothly in 2015 with no network issues I am aware of. All the problems of 2014 had been resolved.

Lisa did have network problems in 2015. Rounds 2 and 3 were a disaster. Similar to earlier years, the bot got input but could not write to the output folder. I provided detailed thoughts on what I believed to be going on. The AISB said they would look into it and try to resolve it. The use of Node.JS may have been an attempt to eliminate the problem.

Partial email from me:

The old Judge program that I have has a loop that goes:
      Use “glob” to get the written directories (I don’t know how glob works or if it could lock a directory during reading)
      Count the characters received
      For each character:
        Remove the directory with “rmdir”
        Split the directory string to extract the character
        Send the character to the screen
        Write the character to the log
  Since there are 4 things going on the the critical path with the rmdir, I can see how if the program does not correctly release the directory till the end of the loop, a conflict/locked directory could exist.

As for my own part, I am confident that I have identified the issue, and if I were to enter and make the finals next year, I could put in a couple more redundant checks that would eliminate the problem for my bot.

But, if there is a bot/OS/judge program interaction, then some other bot may stumble on it next year. Since the LOEBNER protocol is a non-standard, any effort I put into it is not useable for anything else. Since there have been so many negative comments about the protocol, I am willing to invest some more time for the sake of the community and to help you try to make it more stable. On the other hand, if no one has the desire, then we can all call it a day and move on to more productive tasks.

2013:

Bruce Wilcox - Oct 8, 2013:

I have received Rose’s logs from Loebner 2013.

Round 2 the connection obviously broke between the judge and the machine after the 1st message, so nothing further happened.  Various restarts were tried and Rose saw the judge’s messages and replied but the judge didn’t see Rose’s output. This happened to Suzette once during her Loebner outing. Several restarts were tried and it wasn’t until the judge machine was restarted that communication worked again. I see no evidence that the judges machine was ever restarted.

Round 4 Rose was NOT restarted from scratch and so the conversation continues from the prior judge conversation with Rose unaware this is a new conversation.

Don Patrick - Oct 9, 2013:

Bruce, I’ve come across the restart problem myself. Whenever you reset a chatbot the Judge Program MUST also be reset to synchronise the communication through LPP subdirectories (the numbered letters).

2014:

Steve Worswick - Nov 20, 2014:

Some of the lags I saw were up to 10 seconds.

It’s a shame the official logs don’t show what actually happened, as it makes our bots look pretty dumb in the public eye.

Bruce Wilcox - Nov 20, 2014:

Normally I look forward to debugging the logs to see what I can improve.  This year Rose’s logs are a unmitigated pile of carp. And similarly the online logs of the humans are meaningless to me. Rose may have been the unanimous pick of the judges, but I have no clue why. She could equally have been the unanimous fourth place pick.

2015:
Lisa locked out for round 2, 3.

2016:
Errors in all 4 rounds (see Steve’s post)

Hugh Loebner - Nov 17, 2014:

The bottom line is that the AISB will (I fervently hope) be in charge of the contest, and they are the people to whom to address your requests for change.  I did not require that the LPP be used in the future, and there have been mutterings about changing the protocol at some point in the future.

Shall we formally ask the AISB for a change in networking/protocol (again)?
Most of the entrants are on Chatbots.org, if we can come up with a better solution, wouldn’t it be better for everyone?

 

 
  [ # 10 ]

My ideal interface would be that the programs run on the judge computers in the background, that the inputs be sent in whole and only on pressing enter, and that the judge program is revised to leave some memory resources for the other programs. We know of course Hugh’s reasons, but the matter is indeed up to the AISB. I do believe that after all these years of malfunctions, the solution is to simplify the unnecessary technicalities.

 

 
  [ # 11 ]

Just wondering, Merlin, Bruce; How do you number your outputs? Like this?

0000000001.h.other
0000000002.e
.other
0000000003.l
.other
0000000004.l
.other
0000000005.o
.other 
 

 
  [ # 12 ]

Not quite.

I take 100,000+ time in milliseconds as a base, and increment from there.
That way even on restart the number should be bigger than any before it.

 

 
  [ # 13 ]

That rules out the problem being with not resetting the judge program then.

 

 
  [ # 14 ]

So I got the log files back from the Loebners, and largely they make no sense to me.  See attachment.

Judge1 has 10 volleys of pure junk from the judge and then all is quiet. Maybe rose died from boredom, I don’t know. Rose is started anew for round 2 but Rose sees nothing.  The third round is NOT even labelled as a judge conversation so I don’t know how it occurred, but it’s clear what would be the reason a judge voted for Rose (and I presume it is the cyberpsychologist). There are a couple of places where data from the judge seems to have been delivered out of order.

At the end of that conversation, 10 minutes later, a new conversation seems to try to start up, without having shut down Rose. This would presumably be round 4.  Rose hears and replies but apparently the judge cannot see Rose’s answers.

File Attachments
loebnerHuh.pdf  (File Size: 53KB - Downloads: 174)
 

 
  [ # 15 ]

Bruce - Don’t know if it’s any use but the judge who scored Rose in top place was David Boyle who has the conversation abut “un ouef”. He told Mitsuku the same joke.

I looked at your pdf
The first log containing junk (and the internal message) seem to be test logs from the organisers, as round 1 didn’t start until 1:50pm

The log that makes the most sense in your pdf seems to be a mixture of round 1 - David Boyle (I can see the “un ouef” question, which he also said to Mitsuku) and round 2 - Joanne Pransky (I recognise the “bucket list” question which was also asked to Mitsuku)
However, I noticed it’s timed at 3:44pm. This was the start of round 4.

The round 3 log probably doesn’t exist, as the judge stopped talking to both the human and Rose, saying he had something he needed to do.

I guess the last log in the pdf is round 4, as you state, as the judge stayed to talk this time.

 

 1 2 > 
1 of 2
 
  login or register to react