AI Zone Admin Forum Add your forum

NEWS: Chatbots.org survey on 3000 US and UK consumers shows it is time for chatbot integration in customer service!read more..

About the “new” AIML set
 
 

I know that there are at least a few AIML botmasters here, so I thought I would pass this along.

I have no idea just how “new” this AIML set actually is (I only just discovered it a few days ago, so it’s “new” to me), but I’ve given it a bit of a look-see, and I’m quite pleased with it, and here’s why:

1.) It seems that they’ve taken the time and effort to separate ALICE’s code from everything else, and placed it in it’s own set of discrete files. I know that the earlier AAA set was supposed to do this (to an extent), but this more recent set has done a better job of “pulling ALICE out”, saving the botmaster considerable time and effort.

2.) Much of the redundant code has been removed, leaving less bloat and waste. True, this AIML set is larger than the last one, but for a good reason, which is number:

3.) One of the ‘best’ improvements to this new AIML set is the addition of an AIML translation of MindPixel. For those of you who don’t know what MindPixel is, I strongly recommend that you read up on that link I just gave out. Basically, it’s a huge list of questions with ‘yes’ or ‘no’ answers (and varying degrees of certainty in-between). For even the most dedicated botmaster, this ‘database’ represents probably a year or more of ‘heavy coding’ (at least 6 hours per day, 5 days per week) to create. And since I truly suck at coming up with spontaneous content, it would likely take me a lot longer. Granted, some of the questions are somewhat silly (“SHOULD SOFTWARE BE FREE”), and there are a few typos (“SHOULE YOU MEASURE TWICE AND CUT”), but all in all, I see this as a great benefit.

4.) And finally, this new set is a little more up-to-date than previous sets. One of the more tedious chores that a botmaster has is keeping the AIML current, with regards to political leaders and current events. While this set isn’t “up to the minute” by any stretch of the imagination, it’s far better than the AAA.


One of the plans I have for this new MindPixel data is to add a certain amount of randomness to the responses. Right now, if you were to ask a bot with this set installed “Should software be free”, all you’ll get, every time, is “I am certain”. I think it would seem a lot more “human” to have a more varied answer, personally. Right now I have a category for “yes” that chooses randomly between around 50 affirmative answers. My “no” category has around 30 or so, but is expected to grow (only a couple of the “no” answers are simple re-wordings of a “yes” answer). It’s the categories for the varying degrees of certainty (“maybe”, it’s likely”, “I doubt it”, etc.) that are going to give me the most grief, but I’m confident that it will certainly stand improved.

Anyway, I thought I would share this. smile

 

 
  [ # 1 ]

Probably off-topic but too extraordinary to not mention: both the MindPixel project and a competing project called OpenMind have a number of unusual things in common. MindPixel was the brainchild of Chris McKinstry and OpenMind was created by Pushpinder Singh (among others) from MIT. They both committed suicide within a few weeks of each other, using the same method, early in 2006.

http://en.wikipedia.org/wiki/Chris_McKinstry

http://en.wikipedia.org/wiki/Open_Mind_Common_Sense

http://en.wikipedia.org/wiki/Mindpixel

 

 
  [ # 2 ]

I think that’s fairly relevant, Andrew. No worries. smile

 

 
  [ # 3 ]

Dave,
Dr Wallace announced it on the forum at the end of May.
http://www.chatbots.org/ai_zone/viewthread/534/

I have looked at the data, I would suggest MP0-MP3 and part of MP4 to call your “yes” category. I don’t know if the negative assertions will have as much value for you.

 

 
  [ # 4 ]

Yeah, May wasn’t a particularly good month for me, I’m afraid. I’m not surprised that I missed the post.

After looking through the files, I’ve decided to use <SRAI> tags for categories like MPYESANSWER, MPNOANSWER, and several others, based on the ‘certainty level’ of it’s original response. Since I haven’t gone all the way through the files, I don’t know how many categories I’ll end up with, but I expect the number to grow to over a dozen. Once I get the categories written, I’ll bundle up the whole thing and submit them to the GoogleCode page.

 

 
  [ # 5 ]

As I recall, in the MP files there are also some spelling errors and weird duplicates that you’ll want to look for.


<category>GENDER</pattern>
<template>Yes.</template>
</category>
<category>GENDER</pattern>
<template>Yes.</template>
</category>

Some of the patterns may not be appropriate for all bots.

 

 
  [ # 6 ]
Merlin - Aug 17, 2011:

As I recall, in the MP files there are also some spelling errors and weird duplicates that you’ll want to look for.


<category>GENDER</pattern>
<template>Yes.</template>
</category>
<category>GENDER</pattern>
<template>Yes.</template>
</category>

Some of the patterns may not be appropriate for all bots.

The typos and (most of) the duplicates will be taken care of when I submit the files. As to which categories are suitable for which bots, I think I’ll leave that up to the individual botmasters. smile

 

 
  [ # 7 ]

Dave, thanks for the kind review.

The Mindpixel GAC-80K data originated with a conversation I had with Chris.  He claimed that the MP database held millions of assertions whose truth value had been voted on by its users.  Knowing Zipf’s Law, I suggested that perhaps there was a subset that had been voted more than others, so he sorted the data by vote count and came up with the 80,000 most-voted assertions.  The idea was that if “The sky is blue” started out with a value of 1 (because someone said it was always true), then if enough users voted on it, it would eventually converge to something like 0.5 (“sometimes”).  Mindpixel was sort of like a Wikipedia-style crowd-sourcing approach to truth-values.  Our assumption was that the top 80K would have the most reliable truth values.

The original data came with truth values ranging from 0.0 (“No”) to 1.0 (“Yes”).  To migrate to AIML, I divided these values into 55 intervals of about 0.02 each.  Then I assigned them natural language responses from “Yes” to “No” that seemed likely to correspond to these interval values.  For the record, they are:


1. “Yes.”
2. “Absolutely.”
3. “Affirmative.”
4. “Certainly.”
5. “Positively.”
6. “Definitely.”
7. “Exactly.”
8. “Indubitably.”
9. “Naturally.”
10. “Of course.”
11. “Precisely.”
12. “Undoubtedly.”
13. “Unquestionably.”
14. “Beyond a doubt.”
15. “Most assuredly.”
16. “I am certain.”
17. “Always.”
18. “Mostly.”
19. “Highly likely.”
20. “Usually.”
21. “Likely.”
22. “That could be the case.”
23. “I think so.”
24. “That may be true.”
25. “Within the realm of possibility.”
26. “Sometimes.”
27. “Maybe.”
28. “Possibly.”
29. “Conceivably.”
30. “I am uncertain.”
31. “That’s feasible.”
32. “For all I know.”
33. “I can imagine it.”
34. “It may not be true.”
35. “I don’t know if that’s true or not.”
36. “Not very often.”
37. “Not likely.”
38. “Not to my knowledge.”
39. “Doubtful.”
40. “Seldom.”
41. “Unlikely.”
42. “Not likely.”
43. “Doesn’t seem likely.”
44. “If it is, I don’t know it.”
45. “Does not seem possible.”
46. “I don’t believe so.”
47. “I don’t think so.”
48. “There is no reason to think so.”
49. “Negative.”
50. “Never.”
51. “Absolutely not.”
52. “No way.”
53. “Not at all.”
54. “Not by any means.”
55. “No.”

The reason you are seeing duplicates is that, the input patterns have been run through the AIML reductions.  Thus if the data contained “Are you a man or woman”, “Are you female”, “What is your gender”, these would all be reduced to “Gender”.  (Of course, the data in theory *shouldn’t* have this particular question, since it is not a yes-no question, but people put those in anyway).  In any case the duplicates should be removed.

I did put some effort into eliminating nonsense and offensive inputs, as well as particularly long inputs, from the data.  That’s why the AIML category count is somewhat less than the original 80,000.

I included in the Google Code wiki a description of the file loading order.  If you load the MP files before the ALICE categories in Pandorabots, the ALICE categories will overwrite any duplicate MP patterns.  This way if ALICE (or your bot’s AIML) includes a response to GENDER, it will be selected over “Yes”.
(see http://code.google.com/p/aiml-en-us-foundation-alice/wiki/AIMLFileLoadingOrder)

—-

I’ve been getting into using Mercurial and Google Code to maintain the AIML sets.  It is a somewhat deep subject, not easy to learn in a day (unless perhaps you are already a version-control guru).  If you do put in the time however you’ll be able to do lots of cool stuff, like clone your own repository of the ALICE AIML and make modifications to it, then submit them back to me for inclusion in the “official” set.

An even simpler option is to put change suggestions into the Issue Tracking system in Google Code.  Right now there are numerous change and improvement suggestions to the ALICE set scattered all over, and it would be great to at least collect these in a central resource.

 

 

 

 
  [ # 8 ]

Thank you for the list of answers, Rich. That will make my job a lot easier. As I earlier stated, I’ll make the altered MP files available to the public when I complete them; perhaps not as a replacement, but most likely as an alternative. smile

I find it interesting that I have nearly as many ways to say “yes” as the MP files have for total answers. smile

 

 
  login or register to react