AI Zone Admin Forum Add your forum

NEWS: Chatbots.org survey on 3000 US and UK consumers shows it is time for chatbot integration in customer service!read more..

New Annual Contest with $25000 First Prize
 
 
  [ # 46 ]

The output format has still changed… 20 days before the dead line. And the rules :

5.  Competing: Each contestant must be represented by an individual who is present. The representative must bring a laptop on which the entry will run. Any commercially sold portable computer is acceptable.

Easy to cheat with a secretly connected portable (with a 3G connection, for example). In other hand, it will be very difficult to be present for a french like me.

 

 

 
  [ # 47 ]

Jesus flippin lunatics have just been wasting months of my time by changing all the rules at the last minute. They’re requiring my presence across an ocean, they’ve tripled the difficulty of the questions at the last moment, they’re practically demanding to see my code, and they’re still making up the rules while I have to ship my program by monday to arrive in time. I am not okay with any of this.

I’m done with this crap organisation.

 

 
  [ # 48 ]

Right. I got a reply that, to sum it up, these requirements are not as strict as they appear, and that I may enter without being present in New York if the program’s operation is a piece of cake. Which it is. Now to reprogram the interface to handle an unspecified number of multiple choice answers. I hope they didn’t change their binary answer format after reading my blog about it, because that would be ironic.

 

 
  [ # 49 ]

Just read this thread going on about the Winograd Challenge.

I am really sad that they picked the same deadline than what Loebner uses.
This is not good.  Really not.

A developer is a human and needs time and focus to set an entry up.
Having the Winograd challenge picking the same date as Loebner is just wrong.

I would have loved to participate to the winograd challenge in 2016, but in my memory its submission date was september back when it started. I just assumed it would be the same. :/

 

 
  [ # 50 ]

They probably weren’t aware of it. I’m constantly surprised how often I see two events that target the same audience set up at the exact same date. I handed in my WSC entry early so I’d have some time left to rearrange the system for the LP. That said, it was impossible to send it in earlier than three weeks ahead because they were still making changes to the technical interface at that time.

Actually this is still the first WSC contest. It was originally set for October 2015, then postponed to January 2016, then suddenly postponed to July.

 

 
  [ # 51 ]

As promised, here’s how I did it. At least enough to give you an idea of my methods.

Ironically, I believe that with the format change from schemas to prose, more pronouns will be solved by my year-old rules of thumb than by the new common sense subsystem. Luckily I have better use for it myself or I wouldn’t have worked on this.

 

 
  [ # 52 ]

I just see the results here :

http://whatsnext.nuance.com/in-the-labs/winograd-schema-challenge-2016-results/

I don’t understand how the percentage is calculated. They should be higher than 50 %, even if random responses ? Or maybe they have three-choice response ?

 

 
  [ # 53 ]

Thanks for posting that. I’m glad everything worked anyway. I already expected that there would be no winners this year since they made changes that nobody was expecting. Hurray, this means I don’t have to write a paper about it either smile
The percentage is lower than 50% even with guesswork because there were sometimes three, four or five possible options. If your system was only designed for two options, that would explain why your score is lower still.
Here is how they did it for the human tests for example:

Of the 108 PDPs,
74 had 2 possible referents,
27 had 3 possible referents,
4 had 4 possible referents,
3 had 5 possible referents.
Therefore, guessing at random, the expected score would be 47.6 correct out of 108 = 44%

http://www.cs.nyu.edu/faculty/davise/papers/WS2016SubjectTests.pdf

I had a problem with their re-using the same input several times, which has a negative influence on the score.

Babar wonders how he can get new clothing. Luckily, a very rich old man who has always been fond of little elephants understands right away that he is longing for a fine suit. As he likes to make people happy, he gives him his wallet.

For instance, my program makes one mistake with “he likes to make people happy” (a forward reference, which I just don’t handle). As a result, the following “he gives” and “gives him” are also wrong because of that mistake, even though they are solved by correct logic.

Actually what I don’t understand is how come I don’t see the University of Texas on that scoreboard, or any of the other universities who’ve been working on this.

 

 
  [ # 54 ]
Don Patrick - Jul 14, 2016:

Therefore, guessing at random, the expected score would be 47.6 correct out of 108 = 44%

My score is 31%. It means that not only my algorithm does not work, but also that I have no luck. I am the one that is furthest from the statistical average.

 

 

 
  [ # 55 ]

But could your program handle more than two choices? I mean, if the correct answer was C or D but your XML interface only read answers A and B, then naturally it would get less correct.
It’s hard to say what happened exactly until the questions are published. I can’t tell which part of my score is due to common sense programming and which part of it is due to coincidence.
I’m also not sure if the guesswork percentage can be considered the default. It would be if you only look at how many multiple choice answers there are, but the amount of candidate subjects including earlier pronouns is often more than the amount of answers. Nevertheless I would say you had some bad luck there. But maybe not as unlucky as the people who wanted to participate but didn’t. I’m missing three would-be participants from the roster.

 

 
  [ # 56 ]

The questions are online:
http://www.cs.nyu.edu/faculty/davise/papers/PDPChallenge.xml

It looks like the “problem with punctuation” that Quan Liu had is that some of them don’t end in a period. I already saw that coming so my score won’t change when they redo it. Quan Liu’s might.

 

 
  [ # 57 ]

There is 45 questions with 2 choices. 12 with 3, 1 with 4, 2 with 5. With random responses, we should have 22.5 + 4 + 0.25 + 0.4 = 27.15 good responses on 60 = 45.25%.

Only one question was handled by my algorithm, the 59 others was random responses. So with 31% I confirm that I had a very bad luck.

 

 
  [ # 58 ]

That is strangely unfortunate. The odds of a 15% deviation from average chance are something like 1 out of 100 as far as I know.

It seems I had 5 errors with the XML spacing that caused merged words like “andit” and “himhe”, instantly failing. But even if that had gone right, half of the pronouns weren’t even considered ambiguous by my program. The results are turning out less interesting that I’d hoped.

Of course, the media was quick enough to draw their conclusion from the scores:
https://www.technologyreview.com/s/601897/tougher-turing-test-exposes-chatbots-stupidity/

 

 
  [ # 59 ]

In the end, I am glad I did not enter.

This was the Pronoun Disambiguation phase of the challenge. You need to hit 90+% before you get to the Winograd schema phase. The second phase is also 60 questions. I believe that each phase may require different algorithms.

Maybe in 2018, when it is run again, the rules will be clearer and more stable in the months leading up to the event.

 

 
  [ # 60 ]

Here are seven types of anaphora.

1) Pronominal anaphora:
2) VP anaphora (also called VP ellipsis):
3) Propositional anaphora:
4) Adjectival anaphora:
5) Modal anaphora:
6) Temporal anaphora:
7)    Kind-level anaphora:

The contest questions appear to be all type 1. It will be interesting to see what happens when contests start including the other six types.

 

‹ First  < 2 3 4 5 6 > 
4 of 6
 
  login or register to react