AI Zone Admin Forum Add your forum

NEWS: Chatbots.org survey on 3000 US and UK consumers shows it is time for chatbot integration in customer service!read more..

A New Challenge, And a New Contest
 
 
  [ # 31 ]

Great idea Dave for keeping the spirit of CBC alive.

So this contest only deals with the quality of a bots response, that is, this is NOT to be a “imitation game” ?  That is, making your bot seem human? (which to me is a huge waste of time . . we don’t try to make our calculators seem human, they do a better job at math and that is all we care about).

Some areas I can think of are:

-small talk

-static global knowledge Q/A (Watson)

-conversation-specific Q/A (answer comes from information provided in specific conversation)

-Inference, I don’t know of any bot that can do naturual language inference yet (I myself won’t be able to start on this until mid-2014, perhaps sooner for more simple statements, but for connecting many complex NL utterances, it’ll be awhile smile )

-serious, no-nonsense knowledge search

-‘Semiza’ (my ‘semantic Eliza’ side project… so like eliza a bit, but with a lot of semantic connections of previoius statements)—What do you do if you don’t fully understand the entire input (but only a substring)?  In ‘serious mode’, the bot should admit that it doesn’t understand, so you know.  OR, should it not admit it and just respond based on a sub-match of user input?

 

 
  [ # 32 ]

Erwin, Dave, dear all

I agree to create a new chatbot challenge, have some ideas to share, here is a point-wise list.

Chatbots are difficult to evaluate, just to fool a human is not good, also the other way by making them mimic a Query & Answer Engine is also not fair! (using pre-defined questions)

I think the agents deserve have better and rich scorings, and the final score should be a balanced mean of all them,

Here I give (in my opinion) the points on which I might evaluate this.

Intelligent Agent Platform
Based on tech-specifications like manuals, tech-sheets, public information, etc..

1- Quality of the Platform (how easy it is to code a kind-of behavior, answer, analysis)
2 - Multi-lingual capabilities of the platform
3 - Extensibility of the Agent-Data Access (native = built in) ie. how to interface with databases, web-services, and other data-sources.
4 - Flexibility of the Pattern-Matching mechanisms (if hard-coded, AI-trainable, plug-ins, etc.)
5 - Natural Language Capabilities (analysis and Generation)
6 - Speed of response, memory footprint, multi-user capability, session-memory, number of concurrent users, etc.

Then I would evaluate a set of different behaviors of an actual implementation of several different type of agents (targets).

A) General chat (entertaiment agent)
B) Specific Purpose Agent (targetted agent)
C) Query Answering Agent (based on a specific context, working as a help desk)
D) Artificial Intelligence, Inference and Cognitive Capabilities

Multilingual
English, Spanish and any other language are welcome, as long as we get at least 3 judges.

For all of them, we should not only see the quality of the responses raher to measure the quality of the conversational behavior, during the conversation, how this bot acts upon mistakes, misunderstandings, and how the chat-flows.

To achieve this I suggest for each one a different task/test

A) General chat (entertaiment agent)
Specify a free conversation, no turn-limits (only time : a few minutes, ie. 5),
The judge should talk to each Agent freely, targetting only a few subjects (pre-stated) like money + finnancial, work, human - familiar talk, nature, math and logic, sentiment matters, etc.

The score should be based on each conversation turn the judge belief:
1 - Agent understood / recognized the entry giving a good response or a successful iniciative.
2 - Agent missed the entry but continued successfully the conversation holding themes or context.
3 - Agent successfully re-passphrased the entry and tryed to understand by asking for clarification, or suggesting something.
4 - Total failure (Agent did’t get a clue)
5 - Agent answer was unexpected, might try to evade the fact he didn’t understand
6 - Agent got bad entries from the judge, and tried to guess what the F# was told to him, or answered correctly to this!

At final stage there will be a F-score like punctuation.

B) Specific Purpose Agent
There should be a target to hold, the bot shold be asked to fulfill something, simple like getting some information from the judge, and the judge should be able to do mistakes, mistypes, answer bad, even in a rude way. the Agent should be able to overcome this limitations in a nice-correct way, and get the goal done.
Score should be based on:
a) Number of turns to get the goal done.
b) Quality (subjective) of the way the agent treated the human. (0:bad, 1: difficult, 2: normal, 3: good, 4:very good)
c) Number of correct and failed turn-pairs
d) Robustness (# of good interpretations of the bad-mistyped-errors from the judges)


C) Query Answering (specific knowledge)
The bot-makers shold get a reference material (text), for which the answers are inside, or are deducible.
The judges should have a number of specific goals like getting some responses, they can ask the way they like, even in mulltiple turns allowing the Agent to refine the questions.
Score might be based upon:
a) Number of turn to achieve each goal (or miss the fact)
b) Number of goals achieved successfully
c) Quality (subjective) of the way the agent treated the human. (0:bad, 1: difficult, 2: normal, 3: good, 4:very good)
d) Robustness (# of good interpretations of the bad-mistyped-errors from the judges)
e) If the processing of the reference material was unattended, semi-supervised or manually supervised by the botmaster.

D) Artificial Intelligence, Inference and Cognitive Capabilities
This is the most challenging part, agent shold be able to make some resoning about relations, resolve anaphoric relations, have human-like memory, even forgetting things, associate memories and deduct new things and relations.

He should be able to ask lacking information to achieve a goal like an answer, or even find out what is lacking of wrong in a statement. For example there might be a story the judge tell the agent based upon the agents request, and the agent should be able to follow successfully the conversation and answer some judge’s questions, or spontaneously deduct associations, or new discoveries. the way the score wold be done is complicated, and Have not thought abbout it, if anyone could help,. welcome!

Hope this helps for getting a better challenge!

PD:

In my opinion, Judges shold be also botmasters, because they know how difficult it is to achieve each challenge!

Obviously Judges won’t judge their own bots (ethically-incompatible), but they might participate for others.

Agents should be anonymous, for the judges, and there should not be a distinctive question to tell one Agent from another.

I am also willing to participate whith my English-Spanish-Agent

 

 
  [ # 33 ]

Victor and Andres, you both make some good points. I like the idea of a free-flowing conversation that revolves around certain pre-selected topics, but the judging for that would be a very subjective thing, and I think that a lot of careful consideration would have to go into judging rules/guidelines, to make sure that all bots are treated fairly. I’m not certain that I would want to judge a chatbot based on it’s platform, though. It shouldn’t matter if a conversational chatbot is an AIML chatbot running on the Pandorabots server, or a bot that’s using ChatScript. What goes on “behind the scenes” shouldn’t matter. The only distinction that I think should be made is whether it’s a conversational chatbot, or a virtual assistant, or possibly an expert system (though I’m still on the fence about that), and I don’t want to upset the many folks here that are working on various NLP projects, but I think we need to discuss whether NLP chatbots should have their own category. I say this because even with NLP bots, their “primary role” is likely either going to be conversation or that of an assistant. Granted, almost everyone here that has an NLP bot is still at the stage where their bot’s primary role is to provide output that differs from either conversation or assistance (such as breaking down the structure of the input, and giving a report on the grammar, parsing, etc.), but the long term goals seems to point to the bot having one or the other of these roles. If anyone can give an example of any other roles that are deserving of a different category, please let me know. smile

I’ll get to the idea of judges and botmasters in a little while. My time now is limited for several reasons. smile

 

 
  [ # 34 ]
Dave Morton - Mar 23, 2012:

Personally, I’d like to try to persuade Dr. Peter Norvig to act as a judge, or maybe even Dr. Michio Kaku, or Professor Brian Cox, but that would be like asking Queen Elisabeth to preside over your dinner party. raspberry

Don’t underestimate our network Dave. I know lots of people in the field, most of the time personally. Here are a few names:
-Peter Plantec (author of Virtual Humans
-Ray Kurzweill, founder of the Singularity Institute

Let me think about it…

 

 
  [ # 35 ]
Patti Roberts - Mar 23, 2012:

The CBC has had a drop in entries each year. It used to have over 40 to start with, now even with money there were only half that.

I think it’s time to organize a contest professionally. That means: with a LARGE sponsor allowing for a business model so people organizing such a contest can get a salary from it. That would allow to hire professional organizers and much better promotion, and fullfillment. It’s a matter of time. Chatbots.org might want to have a role in it. Let’s see…

 

 
  [ # 36 ]

That would be simply awesome, Erwin. smile

I already have some ideas about how chatbots.org can become involved, but I’ll discuss them with you privately first, if you don’t mind. I’m looking forward to tomorrow’s Skype conversation.

 

 
  [ # 37 ]
Andres Hohendahl - Mar 26, 2012:

Erwin, Dave, dear all

I agree to create a new chatbot challenge, have some ideas to share, here is a point-wise list.

so many great ideas. Would you like to be involved in such a new contest? You would be able to add value for sure!

 

 
  [ # 38 ]

With all due respect…

I don’t think that presenting a large volume of ideas constitutes a better list of ideas.  It’s tempting to try including everything under the Sun, but it’s easy to get drawn off-track by over-complicating things.  Any contest is going to be difficult and messy.  The direction should be to simplify the rules and guidelines for submission and scoring so that the widest amount of participation is possible.

You’re going to run into trouble if you start awarding points for multi-lingual bots.  How many are there?  How many speak all languages well?  Who would decide that?

One requirement for the CBC was, “Only English speaking chatterbots are allowed to enter.”  Visitors to my bots know English words well enough, but they don’t always spell them correctly, and their sentence structure is sometimes different from English.  Their sentences and questions aren’t really hard to understand, they’re just grammatically incorrect.  There are even differences in the language as it’s spoken in the U.S. and G.B.

You’re going to have to come to an agreement on terminology.  For example, I don’t have any idea what “memory footprint” means.  Would it mean the same to you as it means to someone in a different country?

I could go on and on, but I think I’ve made my point.

 

 
  [ # 39 ]

I have been off-line for a few weeks, and will be checking in more frequently starting next week. But, a few years back Thunder and I did some work giving feedback on the CBC before Wendell took it over again.

Here are some of my notes:

Suggestion:

Questions should be grammatically correct.

Questions should be pre-checked via on-line grammer checker at:
http://www.link.cs.cmu.edu/link/submit-sentence-4.html

Suggestion:
Multiple judges should not ask identical questions delayed over time.

If a judge asks a bot a question, then a second judge asks identical questions later, the bot master has the ability to add new responses to handle the inputs. Depending on the length of time between the judges it could cause an unfair situation (some bot masters would have more time to add new responses). In addition, this weights the contest toward bots that do better to the specific questions of these judges.

In the 2010 contest 2 judges asked an identical set of 10 questions. Every bot scored better with with the second judge than with the first. Some of this can be attributed to the criteria used by the individual judge to score the bots. But, other differences in scoring were directly related to improved responses.

Suggestion:
Stop awarding points for appearance and personality.

At the time, extra points were given for the following:

    Interface/Avatar:
    1)Animated interface that can speak and lip sync the responses - 3 points
    2)Animated interface - no voice or lip sync - 2 points
    3)Average Interface - 1 points
    4)Simple input/output text box - 0 points

Chatbots are about chatting. If there’s a desire to award extra points for animated heads, or lip syncing to voice, that’s a beauty contest and should be a separate category like Most Popular.

    Personality:
    1)Interesting, engaging personality. - 2 points
    2)Average personality - 1 point
    3)Poor or no visible personality - 0 points

The evaluation of personality is subjective, and when it comes to judging, I’d bet no two judges could agree. I’ve sent various URLs chatbots to friends for chatbots I thought were funny, or interesting, and because their experience was different from mine, they wouldn’t understand my claims. Sometimes, because of what they said or asked, the conversations never got off the ground. It wasn’t the visitor’s fault, or the bot’s fault. They were just different conversations. Replies I thought were clever or funny were deemed offensive or obnoxious to people not involved in chatbots.

Suggestion:
Judges should look for consistency and contradictions.

Whether points should be given (or deducted) for inconsistency, or how it should be dealt with, is probably a matter for debate. However, since there’s no mention of it in the rules, it’s an important element that’s probably been overlooked since the contest’s inception.

Along the lines of employing follow-up questions, if a bot is asked or happens to mention an affinity for dancing, or something like “music you can dance to, a judge might be inspired to ask if the bot has legs. Or if a bot is asked if it likes to juggle, or enjoys playing cards, it would seem the next logical question would be, “Do you have hands… or arms?”

While bots might be confused about their physical attributes and abilities, such as having eyes along with the ability to see, a bot might also be tested for consistency in other areas. Wouldn’t a favorite book have been written by a favorite author… a favorite song the product of a favorite singer or band… a favorite team involved with a favorite sport?

If there is an inconsistency, is the bot able to justify it, or explain it to the judge’s satisfaction?

 

 
  [ # 40 ]

One of the areas that I explored a few years back was the concept of micro-contests. Each focused on a specific area. I thought this might speed focused development in a short time frame. This “BOT OF THE MONTH” club would allow for many more specialized contests and would be similar to what the 3d rendering community does. Any interest?

 

 
  [ # 41 ]

The idea of a “blind” competition is better.

You simply vote and never get to know who you’ve voted!

 

 
  [ # 42 ]
Andres Hohendahl - Mar 26, 2012:

Agents should be anonymous, for the judges, and there should not be a distinctive question to tell one Agent from another.

This is one idea I like.  I sometimes wonder if the inconsistencies in scoring might have to do with a bot’s (or botmaster/developer’s) reputation.

 

 
  [ # 43 ]
Thunder Walk - Mar 28, 2012:

You’re going to have to come to an agreement on terminology.  For example, I don’t have any idea what “memory footprint” means.

Just means how much RAM the system takes.

To judge a bot based on how much RAM it uses?  Makes no sense.

 

 
  [ # 44 ]

The idea of a “blind” competition is better.

You simply vote and never get to know who you’ve voted!

I would have to remove so much stuff from my bot, to avoid him giving himself away, I wouldn’t bother with the contest. If you talk to the bots, you know who is who. They become like old friends.

 

 
  [ # 45 ]
Patti Roberts - Mar 29, 2012:

The idea of a “blind” competition is better.

You simply vote and never get to know who you’ve voted!

I would have to remove so much stuff from my bot, to avoid him giving himself away, I wouldn’t bother with the contest. If you talk to the bots, you know who is who. They become like old friends.

That’s probably true for a lot of bots.  Alice clones are filled with “identifiers” often even making reference to Dr. Wallace or the Alice Foundation.  Those replies still surprise me and pop up in the oddest and most unexpected places.  And then, I frequently read other exchanges in the chatlogs where someone tells one of my bots, “Every bot says that.”

In the past, the CBC main page listed each bot by name alphabetically along with the “creator,” and that’s how judges located them.  I think the contest organizers should simply do the best they can to hide the bot’s identity from the judges.

 

 < 1 2 3 4 5 >  Last ›
3 of 7
 
  login or register to react