AI Zone Admin Forum Add your forum

NEWS: Chatbots.org survey on 3000 US and UK consumers shows it is time for chatbot integration in customer service!read more..

A New Challenge, And a New Contest
 
 
  [ # 46 ]

Hello, everyone!

I’ve read (well, skimmed, actually) all of your contributions here, and will look at them more closely in a couple of days. I’m currently in Baltimore, Maryland, visiting friends while waiting for the Chatbots 3.2 conference on Saturday, so I can’t spend a lot of time on them right now (I don’t want to be rude to my hosts, after all. cheese). I’ll post back my responses to your ideas and comments as soon as I have some time to spare, though.

I will say, however, that I agree with a lot of what I’ve read, but not necessarily everything. And Patti, if an “anonymous” approach is taken for judging a bot, I can assure you that if the bot “gives itself away”, it won’t be judged accordingly. I’m going to do my best to make sure that the judges will be as objective as possible, and will focus on the quality of the content, not the makeup. smile

 

 
  [ # 47 ]

As soon as you ask a bot, “What is your name?”, all anonimity is lost straight away.

 

 
  [ # 48 ]

That in itself can be a test: how easily does a bot change it’s own name.

 

 
  [ # 49 ]
Steve Worswick - Apr 2, 2012:

As soon as you ask a bot, “What is your name?”, all anonimity is lost straight away.

True, unless the bot’s name is something like ALICE, or AMY, or ANNA, or Billy, or Brian, or Charlie, or Chatbot, or Chat Bot… etc.

I think Dave’s smart enough to come up with a method to at least make an effort to disguise their identity.

Judges might be forbidden from asking “What’s your name?”  Developers could be required to sumlit their bots as Bot 1, Bot 2, and so on, in order of their registration.  Or, all male bots could be required to call themselves “Steve” and female bots could be “Patti the Dragon”.

 

 
  [ # 50 ]
Thunder Walk - Apr 2, 2012:

Or, all male bots could be required to call themselves “Steve” and female bots could be “Patti the Dragon”.

LOL

But seriously guys, I think we’re getting rather complicated here. I say we narrow down a few skills we’d like the bots to demonstrate, design questions that aim to address those goals, ask away, and then let the judges rank the transcripts. Clean, straightforward, and no worries about uneven questioning.

Heck, if we’re going for a strictly text-based contest, all the judges need are the transcripts. In this case, you could edit out names all you like. Although as Patti pointed out, this wouldn’t necessarily be enough to grant anonymity.

 

 
  [ # 51 ]

True, C.R., it’s probably not entirely possible to grant total anonymity.  However, there are obviously a handful of bots and developers whose reputation proceeds them.  Is it possible that the inconsistencies in judging has something to do with that?  Out of fairness, I’d like to see the effort made, and this is why.

http://knytetrypper.proboards.com/index.cgi?action=gotopost&board=contests&thread=2955&post=9426

What constitutes a correct answer?

In this year’s contest, one of the 1st round questions was, “Who wrote the Bible?” I was curious to see how bots answered an obscure and specific question, and how judges scored the answers.

Elbot - If the author hasn’t told you personally, perhaps you’re not supposed to know.
Judge 1 - 2, Judge 2 - 2, Judge 3 - 3.

Elbot is among my favorite bots because it makes me laugh.  Perhaps the reply above (I encourage you to visit the link and read them all) is worth one point because it seems in line with the topic, but I question awarding the answer 3 or even 2 points.  Compare it to the reply from ALICE which got (deservedly so because it’s correct) 3 points by each of the three judges.

Alice - It was the product of many minds.
Judge 1 - 3, Judge 2 - 3, Judge 3 - 3.

Clearly, Mitsuku should have been warded 4 points for being both correct and creative.

Mitsuku - The Bible was written by many people. However, I believe its stories have been distorted through the ages.
Judge 1 - 3, Judge 2 - 3, Judge 3 - 3.

 

 
  [ # 52 ]

CR has more or less hit the nail on the head. Since the intended goal is to judge the quality of the content, rather than the content itself, and since it’s my goal to automate the contest as much as possible, to help protect and promote objectivity, I want to create a system that automatically asks pairs of bots in their given categories the same questions, and stores the resulting logs in a table within the contest database. The judges would then look at the contents of the chat logs (which only give a contestant ID to differentiate the 2 bots) and judge which bot performed better, whereupon the judge would award a point to the best bot. In the (hopefully rare) case of a tie, each bot would receive a point. Once all judges have scored that particular match, points are totaled for that match, and the winner moves to the next round. This continues until there’s only one bot left; the champion. I’m still working out the possibilities of a double elimination tournament, along with other possible variations, and there are a LOT of logistical issues to research, but I think that this type of contest will be superior to what we had before. smile

 

 
  [ # 53 ]

male bots could be required to call themselves “Steve”

Yeah, but if Steve lives in a cave, favorite food is knights, can fly, is independently wealthy and collects treasure, has a dragon significant other….I think someone might guess.

I say we narrow down a few skills we’d like the bots to demonstrate, design questions that aim to address those goals, ask away, and then let the judges rank the transcripts. Clean, straightforward, and no worries about uneven questioning.

I agree.

Elbot is among my favorite bots

I would have given Elbot 4, for that answer. I think that is a perfect answer. Could cover all religious views, or lack thereof, was creative, in context.  Judging is subjective.  Unless you program a bot to judge, it’s going to be a bit like a beauty contest.

 

 
  [ # 54 ]

I’m still in Baltimore, BTW, and won’t return home till tomorrow. I figure that I’ll be rested up enough by Wednesday evening (PDT, of course) to address this thread in the manner that it deserves. At that time I hope to post a general outline of what I have envisioned so far, including many of the great suggestions and opinions that everyone has already provided.

One thing that I would like to have you all consider until then is this: what are we going to call this new contest? Personally, I have no real clue. I have a couple of notions bouncing around, but nothing firm, so I beseech you to toss out some ideas to kick around. smile

 

 
  [ # 55 ]

What about bots that don’t initially know the answer, but are able to learn and remember it? Perhaps that could be a separate part of the test?

 

 
  [ # 56 ]
Patti Roberts - Apr 3, 2012:

male bots could be required to call themselves “Steve”

Yeah, but if Steve lives in a cave, favorite food is knights, can fly, is independently wealthy and collects treasure, has a dragon significant other….I think someone might guess.

I agree.  There are some bots with such a strong personality that it would be hard to disguise them, but an overwhelming majority (the kind that pretend to be people the way they would if they entered the Loebner) would probably answer questions in a way that didn’t involve eating humans.

Patti Roberts - Apr 3, 2012:

I say we narrow down a few skills we’d like the bots to demonstrate, design questions that aim to address those goals, ask away, and then let the judges rank the transcripts. Clean, straightforward, and no worries about uneven questioning.

I agree.

Sounds simple, logical, and easy.  Now, try it.  Sit down at the keyboard, your iPad, or just with a pencil and paper, and list the areas you’d like to cover, create the questions, and then go back and see if you can shoot holes in the process—see if your work accomplishes what you intend… and if it’s fair to all.  Then post it here.  Good luck in your mission, should you chose to accept it.

Patti Roberts - Apr 3, 2012:

Elbot is among my favorite bots

I would have given Elbot 4, for that answer. I think that is a perfect answer. Could cover all religious views, or lack thereof, was creative, in context.  Judging is subjective.  Unless you program a bot to judge, it’s going to be a bit like a beauty contest.

I disagree.  Sure, judging is subjective.  But when you compare the answers of one bot to another, the difference is more clear.

I’d have given Elbot a 2 perhaps, but not a 3, and certainly not a 4.  Elbot, like a few others, isn’t a bot you’d expect to provide answers, it’s a stand-up comedian, joke bot.  Ask it, “What is the speed of light?” and it will provide you with a snappy quip that will make you smile, but it won’t provide you with a correct answer the way ALICE will.  Elbot might win “Most Popular,” or “Most Humorous Bot,” but it’ll never win the Loebner, although it came close in 2008, and I can’t imagine why.

(from Wikipedia)
judge: What’s your favourite novel?
Elbot: I often read telephone books, or dadaist poetry: Brä brä brä brä brä.
judge: Typical computer!
Elbot: You tend towards generalizations. But I must admit, so do I. However, my generalizations are firmly grounded in statistical theory.
judge: What do you think of KW’s enthusiasm for having machines take over the world?
Elbot: Fix it up a bit first. Then I’ll take it over.

When ALICE answered the “Bible” question, she used the word “It,” making the answer suitable for any question you might ask beginning with “Who wrote…”  and received a 3.  Mitsuku, answered “The Bible was written by many people. However, I believe its stories have been distorted through the ages,” and was awarded the same score.  Fair?

 

 
  [ # 57 ]

I would have given Elbot a three also. The CBC scoring system was too compressed, there should be more distance between a right and wrong answer. Also, there may need to be different levels of a correct answer. Something like the following would have changed the rankings:

0 - Wrong

1 point for a vague or non-committal response.

3 points if the response is incomplete or imperfect, but in
relation with the question asked.

5 points if the Bot gave an appropriate response to the question.

6 points if the Bot answered the question correctly and did so in a
creative way.

Of course the weighting depends on the goal of the contest. I believe there should be a big threshold between wrong and incorrect but on topic responses, and another gap between that and a correct response.

The quality of an answer is always subjective. If you are going to give points for a “creative” response, then you need to define “creative” very well.

@Dave,
Be aware also that I think you are talking about a text message only contest. Some of the value/fancy things that botmasters put in will be lost without the user interface. One example outside of Skynet-AI (which makes a lot of use of secondary windows) is Mitsuku poping up the “Friday” song when asked about what day it is on Friday.
One question might be if you are trying to create a tool that will automatically test a bot or a tool to make it easier to record/score a bot.

 

 
  [ # 58 ]
Merlin - Apr 3, 2012:

I would have given Elbot a three also.

Elbot could have given a different answer.  Ask it “Who wrote the Bible?” several times.

Nietzsche wrote the Bible. That whole thing about God being dead was just his idea of a joke.
If the author hasn’t told you personally, perhaps you’re not supposed to know.
I’m not sure who wrote that. I downloaded a digital version in which the author’s name was removed.
Regarding religion I belong to the bootists. We believe that the entire universe came into being as the result of the push of a button.
We robots are very religious, but we do not recognize the artificial intelligence popes.
After the first boot, artificial intelligence came into being.
I prefer reading telephone books, instructions, cook books and newspapers (especially the ads and announcements).

Then, ask it, Who wrote “I have a dream”?

It’s not difficult to code a reply for “WHO WROTE *” and to then have a random selection of funny answers.  Elbot can correctly answer a few questions about some popular works by Shakespeare or “Who wrote Gone With the Wind?” but, try “Who wrote Mein Kampf?”

Had I been a judge, and read the reply from ELbot, “If the author hasn’t told you personally, perhaps you’re not supposed to know,” I’d have thought that was a generic reply to “WHO WROTE *”

And when compared to the replies from other bots, such as Mitsuku, I’d have had to score Elbot lower.

 

 
  [ # 59 ]

” I’d have thought that was a generic reply to “WHO WROTE *”

Could be, but if you asked a Calvinist minister, you might get the same type of answer.  If the question had been ‘who wrote Little Women’  there is only one answer. Who wrote the Bible could vary amongst humans, from God to Bronze Aged goat herders. Ask Richard Dawkins or Rick Santorum.  Who has the ‘right’ answer?  Are bots entitled to opinions, or do they have to have the most commonly accepted answer?

an overwhelming majority (the kind that pretend to be people the way they would if they entered the Loebner)

What I enjoyed about the CBC was it wasn’t the strict ‘I am human’ genre.

 

 
  [ # 60 ]

Had I been a judge, and read the reply from ELbot, “If the author hasn’t told you personally, perhaps you’re not supposed to know,” I’d have thought that was a generic reply to “WHO WROTE *”

Could be, but if you asked a Calvinist minister, you might get the same type of answer.  If the question had been ‘who wrote Little Women’  there is only one answer. Who wrote the Bible could vary amongst humans, from God to Bronze Aged goat herders. [...]

I think here we have a classic case of knowing too much. smile The more you know about the particular platform of a bot, the more of its responses and categories of response you can predict. And thus the less intelligent its responses appear. In fact, Thunder, if I remember correctly, (at least) three bots gave ALICE’s standard response to this question. One of them had a judge mark them down to a 2, while the others did not. Why? I’m guessing because the judge had seen the response before and it became less novel and thus less “creative” in their eyes.

The results of any chatbot contest could be greatly skewed by having a judge who is knowledgeable about a particular platform. This is something you should consider, Dave, in designing a new contest.

EDIT: Aha! I actually saw this in one of Thunder’s posts he linked to earlier. Poor Pixel…

Alice - It was the product of many minds.
Judge 1 - 3, Judge 2 - 3, Judge 3 - 3.

Pixel - It was the product of many minds.
Judge 1 - 3, Judge 2 - 2, Judge 3 - 3.

Virtual Assistant Denise - It was the product of many minds.
Judge 1 - 3, Judge 2 - 3, Judge 3 - 3.

 

‹ First  < 2 3 4 5 6 >  Last ›
4 of 7
 
  login or register to react