While passing the Turing test may be the ultimate holy grail of NLP, it seems to me that along the way there are other grails that should receive some recognition. In looking over the Loebner prize and the CBC, it seems to me that too much emphasis ultimately falls on the knowledge-base underlying the system. Specifically, if one knows that their bot is going to be asked “simple” questions on any subject of general knowledge, the range is pretty overwhelming.
But, my purpose is not to diss the existing contests, but rather to suggest that it would be nice to have a contest that instead focused on the general conversational skills of each chatbot. (Perhaps such a contest exists that I am unaware of?)
Basically, the idea would be to have judges that attempt to interact with a given bot conversationally, without trying to trick it in any way. Rather, they try to play along as if they were a person who accepted prima facia that they were interacting with an intelligent and aware “being”.
If the bot was programmed for general conversation, the judge could start by talking about the weather and then go wherever the conversation led. If the bot focused on a specific subject, the judge would play along within that area—for example, in interacting with Eliza as psychoanalyst, the judge’s score would be based on how well the analysis went, how varied and insightful Eliza’s responses were, etc.
I guess the reason for me to suggest such a contest is that I am working on a program that will attempt to have a philosophical conversation with the user. Therefore, I’m unlikely to spend much time programming it to deal in any sophisticated manner with a user that babbles on about some unrelated subject like sports, etc.
How would such a contest be scored? Perhaps others who find this idea interesting can weigh in on this thread. Here are some criteria that come to mind:
(1) Grammatical and semantic accuracy, especially when referring to material previously input by the user.
(2) Accuracy in responding to what the user is saying.
(3) Depth of comments, such that the bot gives at least some impression of intelligence and insight.
(4) Longevity, i.e., how long a bot “lasts” before it becomes painfully obvious that it’s not as smart as one might have hoped.