AI Zone Admin Forum Add your forum
Controversial issues about the Loebner Prize
 
 

I was reading the article about the Loebner Prize in the Wikipedia, and found the following comments about the contest:

“Within the field of artificial intelligence, the Loebner Prize is somewhat controversial; the most prominent critic, Marvin Minsky, has called it a publicity stunt that does not help the field along. In addition, the time limit of 5 minutes and the use of untrained and unsophisticated judges has resulted in some wins that may be due to trickery rather than to plausible intelligence.”

The article was written a few years ago. There might have been some changes already in the contest. For example, the time limit is 15 minutes, not 5, right? However, “the use of untrained and unsophisticated judges” may have not been changed. The contest is coming. I wonder what standards are used in choosing judges. Do they get any training at all? I was at the 2011 contest. I found some judges quickly found out which was the bot and which was the human, and then stopped talking to the bot, and totally engaged in the chat with the human. As a result. The script with the bot was rather short. Can we have the rule that the length of the both scripts should be about equally long. That way it would be fairer when the judges compare the conversation logs of all the four bots and put them in order.

Of course I do not agree with everything it says. I don’t think any wins due to trickery, but may be due to accidents, or by chance. If we can give appropriate training to the judges before the contest, it might help.

 

 
  [ # 1 ]

I believe each round is 30 minutes, which consists of 25 minutes of questions followed by 5 minutes to allow the judges to decide. It also allows the bots to be set up ready for the next round.

The judges are certainly sophisticated though, being professors in AI and academics in general.

 

 
  [ # 2 ]

This year’s judges:

Judge Dr. Minhua (Eunice) Ma, Digital Design Studio, University of Glasgow (Scotland)
Judge Professor Emeritus Mike McTear, University of Ulster (Belfast, N. Ireland & Granada, Spain)
Judge Dr. Roger Schank, John Evans Professor Emeritus, Northwestern University & CEO, Socratic Arts, (Florida, USA)
Judge Professor Noel Sharkey, University of Sheffield (England)

If people have a problem with a 5 minute limit and untrained judges, then their problem is with the original Turing Test, which is odd because that is what everyone says we’re testing here. Alan Turing even made hints to using trickery (e.g. response delays), and also describes “average interrogators” (which the current Loebner Prize judges are far from). Some previous chatbots convinced some judges partly because they made delibarate typing mistakes. That’s probably what Wikipedia is referring to.

I agree it would be good if the judges were briefed better so equal time would be spent on both conversants. Personally I would like to see the judges make a guess after 5 minutes (each) as to which is the chatbot, and then continue for another 20 minutes and guess again. The first guess to show if the chatbot passed the Turing Test as described by Alan Turing, the second to pass the Loebner Prize version whose limits are moved out of reach whenever a chatbot gets close to passing. - by suggestions of such people as Marvin Minsky, I should add.

 

 
  [ # 3 ]

I have a bit of a problem with the idea that the judges should be trained.  Of course, it’s good that they should adopt a consistent interpretation of the rules and setup of the competition.  However, I prefer the notion of “average interrogators”.

Looking over transcripts from previous years, I think there is often a strong UK cultural bias both in the judges and the human volunteers.  While the “anything is allowed” approach is quite valid, I believe a diverse and “average” group on both sides would provide a better illustration of the differences between humans and the current state of chatbots. 

My favourite example was an exchange which read roughly like this:
Judge: shocker about City last night, eh?
Human confederate: yes, what with all that money they spent.

I think most humans reading this forum would have trouble understand this exchange, which of course proved in an instant that the Judge was having a conversation with a human (British, male, living in the UK, football fan).  I believe an interesting approach would be to use humans in their early teens as confederates (not to be confused with the Junior Prize, which is about having Junior judges).

 

 
  [ # 4 ]

The example you give highlights a solid problem: The judges also use cheap tricks that have no bearing on intelligence or human-likeness. Questions about recent events, the colour of their chair or the weather will out a chatbot no matter how intelligent it is. This can only be avoided if updates or internet access were possible during the contest, or if the judges were prohibited from discussing events more recent than a week.

On the other hand one might say that using UK-only judges narrows down the domain of the questions and is therefore theoretically easier to prepare for.

 

 
  [ # 5 ]

It’s a double-edged sword… judges that know too much vs. judges that know too little.

I have some minor experience with one chatbot contest—creating questions, judging answers, and coordinating agreement among the judges which is akin to herding cats.

From observing several different contests over a period of years, I’ve questioned the participation of some of the judges. I’ve seen people selected from other disciplines, such as cartoon artists, or language experts, and someone with advanced degrees in computer science and business people. But, from reading their comments, I had the feeling that none of them had ever chatted with a bot before, knew what their limitations were, or how chatbots were constructed. It seemed to me that they either didn’t know what to expect, or they expected too much—especially at this level. Some judges seemed to think they’d be talking to a version of Watson, the IBM project that competed in the Jeopardy TV program.

I don’t think that mere academic credentials are enough, or that having the title “professor” attached to your name makes you an expert at judging any form of AI.  I’d like to see judges from a variety of fields, but there should be boundaries that relate to the kind of contest it is… what the contest is testing for.  I wouldn’t want my contest entry being scrutinized by a night club pole dancer, and I’d be equally disappointed by the involvement of a nuclear physicist… although the pole dancer might have a better grasp of everyday, conversational language.

 

 
  [ # 6 ]
Don Patrick - Aug 14, 2013:

.... The first guess to show if the chatbot passed the Turing Test as described by Alan Turing, the second to pass the Loebner Prize version whose limits are moved out of reach whenever a chatbot gets close to passing. - by suggestions of such people as Marvin Minsky, I should add.

Turing did not claim that fooling a judge for 5 minutes constituted passing the test.  He wrote

” I believe that in about fifty years’ time it will be possible, to programme computers, with a storage capacity of about 10^9, to make them play the imitation game so well that an average interrogator will not have more than 70 per cent chance of making the right identification after five minutes of questioning.”

This is simply a prediction, not a definition of passing.  He might as well have written that in 5 years time an average interrogator will not have more than a 5 per cent chance of making the right identification.  I believe that Turing would have felt that an “intelligent” computer could fool a judge for any length of time.

 

 

 
  [ # 7 ]
Don Patrick - Aug 15, 2013:

The example you give highlights a solid problem: The judges also use cheap tricks that have no bearing on intelligence or human-likeness. Questions about recent events, the colour of their chair or the weather will out a chatbot no matter how intelligent it is. This can only be avoided if updates or internet access were possible during the contest, or if the judges were prohibited from discussing events more recent than a week.

On the other hand one might say that using UK-only judges narrows down the domain of the questions and is therefore theoretically easier to prepare for.

The problem with permitting web access during the contest is the possibility of a human providing responses.  I have no problem with providing contestants a reasonable amount of time before the contest, on the day of the contest, to input local conditions (number of spectators, room color, weather, etc).

What’s the problem with “cheap tricks”?  If a human can solve them, then the computer must be able to also.  Get over it.

 

 
  [ # 8 ]
Hugh Loebner - Aug 16, 2013:

Turing did not claim that fooling a judge for 5 minutes constituted passing the test.  He wrote

” I believe that in about fifty years’ time it will be possible, to programme computers, with a storage capacity of about 10^9, to make them play the imitation game so well that an average interrogator will not have more than 70 per cent chance of making the right identification after five minutes of questioning.”

This is simply a prediction, not a definition of passing.

I see. That part of the article is what I based my beliefs on, yet you make a valid point. I guess the terms are a matter of interpretation then.

I understand that an internet connection could be used for remote cheating, so I am pleased to hear that chatbot creators will have the opportunity to update on the day of the contest, that pretty much covers it.
My problem isn’t so much with computers or judges using tricks, as it is with the Turing Test being widely regarded as a test of intelligence. If then an AI fails the test because it does not have a pair of human eyes to glance at the weather, or passes because it makes typos, what conclusions could we draw about its intelligence? I suppose it’s no use complaining, but that is where the controversy is coming from.

 

 
  [ # 9 ]

Contests of this nature will always generate a certain degree of controversy, simply due to the fact that not everyone has the same opinions regarding the nature of intelligence, whether human, computer, or otherwise, and trying to get all of us to accept the same criteria for this is like herding cats, i.e. it ain’t gonna happen. cheese

 

 
  [ # 10 ]

My primary issue with the Loebner Prize and other similar contests is that people these days seem to put too much effort into creating contests true to the letter of the Turing Test whilst ignoring the spirit in which Turing proposed it. When I read Turing’s paper, I felt that the test itself was essentially incidental to what was really being said, which was that, in the absence of the ability to look inside the ‘mind’ of a computer and ‘see’ whether it was really conscious and/or intelligent, we should consider a computer to be intelligent if it acts intelligently (ie produces intelligent output).

Hugh Loebner - Aug 16, 2013:
Don Patrick - Aug 14, 2013:

.... The first guess to show if the chatbot passed the Turing Test as described by Alan Turing, the second to pass the Loebner Prize version whose limits are moved out of reach whenever a chatbot gets close to passing. - by suggestions of such people as Marvin Minsky, I should add.

Turing did not claim that fooling a judge for 5 minutes constituted passing the test.  He wrote

” I believe that in about fifty years’ time it will be possible, to programme computers, with a storage capacity of about 10^9, to make them play the imitation game so well that an average interrogator will not have more than 70 per cent chance of making the right identification after five minutes of questioning.”

This is simply a prediction, not a definition of passing.  He might as well have written that in 5 years time an average interrogator will not have more than a 5 per cent chance of making the right identification.  I believe that Turing would have felt that an “intelligent” computer could fool a judge for any length of time.

 

I definitely agree with this.

 

 
  [ # 11 ]

I have received Rose’s logs from Loebner 2013.

Each startup of Rose increments a file which has an id number. So each judge automatically gets a unique judge id.  I have matched the logs against the visible online displays.  The first adult judge starts at log 9.

Round 1 was a normal round

Round 2 the connection obviously broke between the judge and the machine after the 1st message, so nothing further happened.  Various restarts were tried and Rose saw the judge’s messages and replied but the judge didn’t see Rose’s output. This happened to Suzette once during her Loebner outing. Several restarts were tried and it wasn’t until the judge machine was restarted that communication worked again. I see no evidence that the judges machine was ever restarted.

Round 3 was a normal round.

Round 4 Rose was NOT restarted from scratch and so the conversation continues from the prior judge conversation with Rose unaware this is a new conversation.

 

 
  [ # 12 ]

Bruce, I’ve come across the restart problem myself. Whenever you reset a chatbot the Judge Program MUST also be reset to synchronise the communication through LPP subdirectories (the numbered letters).

Your program, when reset, probably starts creating LPP subdirectories starting from number 000000001 again, but the Judge Program will not pick up on numbers smaller than where it last left off. So if the connection was lost at the 100th letter, the Judge Program will only be looking for letter number 101 and up, and will be blind to letter 1 or 2 or 99, until it is reset too.

Any idea why the connection was lost in the first place?

 

 
  login or register to react