AI Zone Admin Forum Add your forum

NEWS: Chatbots.org survey on 3000 US and UK consumers shows it is time for chatbot integration in customer service!read more..

Chatbots, AI, wildcards, random phrases and Pattern Matching
 
 
  [ # 31 ]

Merlin take a good look at the file that I wrote a few years ago for converting wikitext markup into XML and which I have attached to this post. According to the documentation on the Wikimedia website there have been over 30 attempts to write a good library to replace the appalling mess of PHP that currently converts Wikipedia pages into HTML. By my reckoning I accomplished more in the program that I wrote for that purpose than anyone else has so far.

But that’s neither here nor there, the program was written in FLEX which is the GCC tool for processing regular expressions and it’s about as sophisticated as RE processing can get. I’ve also written complex and beautiful programs using POSIX RE’s, Advanced RE’s and Extended RE’s and their ilk in PostgreSQL, Perl and JavaScript, and regularly use Perl Compatible RE’s in my C programs. However in the case of the wikitext parser, I gave it up as a bad idea because although it worked well, it was turning into just as big a mess as the unmaintainable Wikimedia code that it was intended to replace.

To your credit, you’ve taken the trouble to do a bit of research to back up your claims. Unfortunately you have still not come up with anything to show that outside a very narrow range of applications, the use of regular expressions for complex parsing tasks isn’t doomed to fail, unless it is propped up by an ever growing tangle of bolted on (and more often than not) mutually incompatible kludges.

Most of those kludges “allow” you to hand code the complex operations that the computer ought to be able to code for you. It’s like having to get out of the car and walk to your destination when you should be able to drive all the way. Or maybe you don’t have a drivers licence yet and all you’re competent to do is back the car out of the garage. I’m going to go with that until you manage to prove otherwise.

Anyway, I haven’t revisited the wikitext problem with my CFG parser yet, though I have already implemented a number of fast and elegant parsers for much more complex problems using it. I even published all the source code for one of them in this forum, the parser for discourse analysis which you can download from here for comparison: http://www.chatbots.org/ai_zone/viewreply/7811/

The video presentation from Peter Norvig was very interesting, so thanks for that Merlin. In it he is pretty much saying the exact same thing that Laura and CR have been saying in this thread, but with visible hand waving. Did you watch the other videos in the series? In the next one he explains how the solution to the problems just described is to use Probabilistic Context Free Grammars. (No mention of Probabilistic Regular Expressions or supporting frameworks anywhere, though maybe Larry hasn’t figured out a way to bolt those on yet. Give him time, he is the king of kludges after all.)

In fact Professor Norvig is also wrong about this. Probabilistic “anything” has been very fashionable for more than a decade because of the ready availability of crunchable data from the internet, that and it’s so easy that even MBA’s can understand it and open their cheque books to fund the research. However James Allen showed very handily as far back as 1996 that all it was good for was speeding up parsing algorithms by a small but significant factor, and that it still didn’t solve any of the real problems (e.g. ambiguity resolution) satisfactorily by itself. (Chapter 7, “Natural Language Understanding”). As Professor Norvig is such a busy man, I guess he can be excused for being a little out of touch.

While linguists generally are still arguing vehemently about exactly how language works, there is one thing that they do all agree on, and that is that the use of context free grammar will be part of the solution. For the very latest theories on the subject, I invite you to do a bit of reading about “The Simpler Syntax Hypothesis” which was published a couple of years ago. The first chapter is available on the internet for free download. Another good book is the one that Jan pointed out last week “Basic English Syntax with Exercises” by Mark Newson which can be downloaded in its entirety (though personally, I think that chapter 3 is rubbish).

File Attachments
wikitext.l.zip  (File Size: 10KB - Downloads: 109)
 

 
  [ # 32 ]

This is the most fascinating and entertaining forum that I have ever had the pleasure to participate in thanks to brilliant minds like Andrew and doubting minds like Merlin.  tongue rolleye

 

 
  [ # 33 ]

We do tend to have some stimulating, often inspirational, sometimes spirited debates here, Laura. That’s for sure. smile

 

 
  [ # 34 ]
Andrew Smith - Dec 15, 2011:

The video presentation from Peter Norvig was very interesting, so thanks for that Merlin. In it he is pretty much saying the exact same thing that Laura and CR have been saying in this thread, but with visible hand waving.

Ha ha, this forum has a way of drawing you in, even when you’ve been out of the loop for a few days… Apparently I’ve been posting in spirit! wink

I’ll have to take a closer look and weigh in later. smile

 

 
  [ # 35 ]
Andrew Smith - Dec 15, 2011:

The video presentation from Peter Norvig was very interesting, so thanks for that Merlin. In it he is pretty much saying the exact same thing that Laura and CR have been saying in this thread, but with visible hand waving. Did you watch the other videos in the series? In the next one he explains how the solution to the problems just described is to use Probabilistic Context Free Grammars. (No mention of Probabilistic Regular Expressions or supporting frameworks anywhere, though maybe Larry hasn’t figured out a way to bolt those on yet. Give him time, he is the king of kludges after all.)

In fact Professor Norvig is also wrong about this. Probabilistic “anything” has been very fashionable for more than a decade because of the ready availability of crunchable data from the internet, that and it’s so easy that even MBA’s can understand it and open their cheque books to fund the research. However James Allen showed very handily as far back as 1996 that all it was good for was speeding up parsing algorithms by a small but significant factor, and that it still didn’t solve any of the real problems (e.g. ambiguity resolution) satisfactorily by itself. (Chapter 7, “Natural Language Understanding”). As Professor Norvig is such a busy man, I guess he can be excused for being a little out of touch.

I am taking the Stanford AI course, so I have been through all the videos. It is a bit easier to do from the course page. Here is the link to the start of the NLP unit.
On the forum we have mostly talked about Grammer vs Pattern matching. One of the two professors giving the course is Peter Norvig.

Peter Norvig is Director of Research at Google Inc. He is also a Fellow of the American Association for Artificial Intelligence and the Association for Computing Machinery. Norvig is co-author of the popular textbook Artificial Intelligence: A Modern Approach. Prior to joining Google he was the head of the Computation Sciences Division at NASA Ames Research Cente

It has been interesting to get the Google perspective (probabilistic automated learning). In other units he describes how it is used to do language translation. I would say the approach is good for more than just “speeding up parsing algorithms by a small but significant factor”.

Other parts of the course relate to other AI problems, though most of the approaches deal with probablility/Bayes rule. The other professor is Sebastian Thrun.

Sebastian Thrun is a Research Professor of Computer Science at Stanford University, a Google Fellow, a member of the National Academy of Engineering and the German Academy of Sciences. Thrun is best known for his research in robotics and machine learning.

Sebastian and his team won the Darpa Grand Challenge (Self driving cars) using probablistic models.

Andrew Smith - Dec 15, 2011:

To your credit, you’ve taken the trouble to do a bit of research to back up your claims. Unfortunately you have still not come up with anything to show that outside a very narrow range of applications, the use of regular expressions for complex parsing tasks isn’t doomed to fail, unless it is propped up by an ever growing tangle of bolted on (and more often than not) mutually incompatible kludges.

I do believe I have pushed regular expressions farther than almost anyone else. I have the empirical knowledge of creating my own framework (JAIL - JavaScript Artificial Intelligence Language), and the practical experience of building a complete chatbot (Skynet-AI) from scratch to test my theories. I even successfully replicated the results of the MIT “Student” program which was written in LISP and have done a real-time basic part of speech tool.

Andrew Smith - Dec 15, 2011:

I have already implemented a number of fast and elegant parsers for much more complex problems using it. I even published all the source code for one of them in this forum, the parser for discourse analysis which you can download from here for comparison: http://www.chatbots.org/ai_zone/viewreply/7811/

I think I will try to take a few minutes today to try replicate your discourse analysis parser via regex.  smile

 

 
  [ # 36 ]

When all you’ve got is a hammer, everything looks like a nail.

 

 
  [ # 37 ]

I use RegEx only where it is practical and makes sense. Just like I have been told over the years, “why use JS for that when you can use X instead?” My answer has always been and still is, “I am proficient in JS and it makes sense for this application”.

Of course those are the same people that told me that Flash is taking over JavaScript and will overtake HTML eventually.  smile

 

 
  [ # 38 ]
Andrew Smith - Dec 15, 2011:

When all you’ve got is a hammer, everything looks like a nail.

Ah, but in the hands of a true master, that hammer can be used in oh, so many ways; even, possibly, to create ice sculptures. smile

 

 
  [ # 39 ]

I don’t know the breakdown percentage of how much ‘true NLU’ (ok I made up a term smile ) will involve grammar and how much ‘world knowledge’, but I suspect an absolutely huge portion. 

Input 1 - “I ate dinner with a fork”
Input 2 - “I ate dinner with a friend”
Input 3 - “I ate dinner with some ketchup”


Input 1 parsing
————————

Possibility 1 - I ate both dinner and a fork
Possibility 2 - I ate dinner, and I used a fork to do it (to eat)
Possibility 3 - I ate dinner in the company of a fork.

Input 2 parsing
———————-

Possibility 1 - I ate dinner and a friend was in it.
Possibility 2 - I ate dinner, and I used a friend to do it (to eat)
Possibility 3 - I ate dinner, in the company of my friend.

Input 3 parsing
————————
Possibility 1 - I ate dinner which had some ketchup on it, in it.
Possibility 2 - I ate dinner, and I used some ketchup to do it (to eat)
Possibility 3 - I ate dinner in the company of some ketchup.


Reasoning Evaluation of all possible grammatically correct parse trees
———————————————————————————————————

Input 1/parse 1: forks are utensils, no knowledge of humans eating utensils, thus low confidence in this option.
Input 1/parse 2: forks are utensils, and we -do- have knowledge that utensils are used to eat, thus higher confidence in this option.
Input 1/parse 3: forks are utensils, no knowledge of humans preferring the company of utensils, thus low confidence in this option.

Input 2/parse 1: most of the world today isn’t into cannibalism, thus low confidence here.
Input 2/parse 2: perhaps we have no arms or legs and a friend spoon fed us, ok, that is possible, but that goes into ‘conversation state/user-specifics’—see below.
Input 2/parse 3: people like to dine with other, thus higher confidence here.

Input 3/parse 1: many people like ketchup, especially my wife, so high confidence here.
Input 3/parse 2: ketchup isn’t really used to do the actual ACT of eating, that i know of, thus low confidence
Input 3/parse 3: ketchup isn’t a human being, thus low confidence.

conclusion: NLU needs NLP + common sense reasoning.  But, is it best to achieve this common sense reasoning using statistical methods, or rule based? Myself, I will employ both (rule based a bit more so, and statistics as a ‘second opinion’ or ‘tie breaker’).
But statistics alone to achieve this?... i doubt it. 

Now, in addition to parse trees, and in addition to evaluating those possible trees in the context of common sense and ‘world knowledge’, we also need another two (perhaps more) types of evaluations.

The context, or the specific ‘state’ of the particular conversation should be able to be over-ridden.  Example, the above common sense rules are great, but what if the user entered something funny like ‘The car smiled’?  Well, ordinarily this makes no sense and a low confidence should be assigned to that particular parse tree.  But, what if the bot is chatting with a child, and the child is telling it what happened in one of the stories he/she read recently?  Well, taking that state or context into account, we actually boost the confidence.

conversation state/user-specifics
—————————————————

demographics of user - yet another source of evaluating among the possible parses and ‘promoting’ one.    This helps in word sense disambiguation.  Also ‘teen talk’... ‘sick’.. “It was sick” coming from someone <19, can mean actually something positive, but my mom saying “It is sick”, pretty much always means a bad thing.

In short, it takes a LOT more than just grammar, and more than even common sense reasoning, but many, many sources of evaluation, just to decide which parse tree is the one the user really meant.  Then.. what to do with the input once you think you know what it means.  Well, it turns out, in my design, if the system hasn’t decided yet which tree is the best, it actually assumes more than one, and then, based on what it was actually able to *do with the input*, the more it was able to do, that is also a source of evaluation, and a factor in the overall ‘confidence calculation’.

the above example (Input 2/parse 2), so, this particular conversation -with-that-specific-user, perhaps needs help when eating, thus in light of that, promote that possible parse.

so: perhaps the pattern-matching/NLP is only a small portion to reach NLU.  The ‘real meat’ is in, after generating all those perhaps thousands of possible interpretations/trees, evaluate each of them (at the many levels, from common sense reasoning/knowledge and up to specifics of that particular user, that particular conversation,etc).  Again, I’m not saying your approach will fail Steve (and in fact, if you *DO* pull it off, you’ll be the most intelligent person I know), and in the end, it simply may be a ‘gut feel’ that NLP is the way to go, and you’ll prove us wrong.

Dave Morton - Dec 15, 2011:

Ah, but in the hands of a true master, that hammer can be used in oh, so many ways; even, possibly, to create ice sculptures. smile

very true, and imagine what Newton could have done with a computer.

 

 
  [ # 40 ]
Victor Shulist - Dec 15, 2011:

...imagine what Newton could have done with a computer.

If it were a computer with Windows on it, he probably wouldn’t have gotten past Solitaire. raspberry

 

 
  [ # 41 ]

Victor:

I believe a good knowledge base is an important part of the parsing and confidence evaluation process.

Human: I love apples.

Parsing returns: apple~fruit/food/~company/computer/technology/

Bot: I am not sure if you are talking about fruit or computers?

Human: The computer.

Bot: yes, I agree that Apple is a great company and makes a superior computer.

So as you can see, with just a little word association knowledge, the bot can build a convincing and intelligent reply.

Dave:

Yes, and if Newton had an iPad, he would drop it from a tree to test gravity. smile

 

 
  [ # 42 ]
Laura Patterson - Dec 15, 2011:

So as you can see, with just a little word association knowledge, the bot can build a convincing and intelligent reply.

 

100% Agreement.

Laura Patterson - Dec 15, 2011:

Human: I love apples.

 

I would also suggest the bot take into account 2 things about that input:

a) lowercase ‘a’ in apples

b) apples—plural.. not many people say I love apples to mean Apple Inc.  They’d more likely say

I love Apple computers
or
I love Apple laptops
or just

I love Apple.

No, they shouldn’t *have to* capitalize… but every bit of help and hints and tips the bot can get its hands on the better… so if they -do- use captial A,  it should suggest more that it is Apple Inc… and if they use plural ‘s’—that , to me, higly suggests the fruit apple.

At first reading it , i never thought for an instant you meant Apple Inc.

But, again, it also goes back to my comments above - include as much sources as possible, if , for example, the bot knows that it is talking to a farmer that is not really “into” computers, *and*  you used plural form of apple, *and* you didn’t capitalize.. then it is HIGLY likely you meant the fruit, and not the company.

without these (perhaps somewhat probablistic assumptions, the system , given any input, with say 30 words, with each word having , what, about 20 definitions, it would be asking A LOT of questions every single time, a pretty unusable system.

also, a very nice feature would be:

“from now on when I say “apples”, assume “Apple Inc” unless specified otherise”
then, that would go into the “user-specifics” source of parse tree evaluations.

the piont to take away from this is— there are many sources to draw upon.  I figure, so far, in my design anyway, probably a 1/2 dozen by the time I’m done smile

 

 

 
  [ # 43 ]

When a user submits their input to my Bot, the first thing is to convert to the text to lower case. The next thing that happens is the input is split into an array of words using the spacing as a delimiter. At the same time an array is created so that the processor can step through the words as pairs looking for compounds and words like New York. If a compound is not found, then each individual word in the pair is processed against the knowledge base of about (at last count) 116,714 word associations. Plurals are handled at the front end of processing by removing the “s” from the word(s), marking the variable, then passing the word through the NLP to build the tree. The modifiers and associative nouns are combined with the knowledge base data where the “secret sauce” is added before formulating the reply.

Since my Bot functions as much more the. Chatterbox, there is another layer of task oriented processing that takes place as well. The whole process of parsing and ore processing of the input is perform without ever leaving the client browser. In matter of fact, aside from the Internet co/browsing functionality, there is actually no need for an Internet connection. There is no noticeable latency from input to reply. It’s a very efficient system with controlled error checking.

I have said more than I had originally intended to in this forum about my project, since it is a commercial venture and in partnership with others. I have been so impressed with the openness of many of the members here that I wanted to share what I could. I really enjoy the brainstorming that takes place here and I have benefited greatly from my participation. I am not claiming any great breakthrough in my approach with this project because that is not my goal. If in sharing what I can will help another member with their project, then my effort in sharing was well worth my effort.

Please excuse any misspelling or grammar mistakes as I have posted this using my iPhone. Handy device, but not so much for making long technicial posts on forums. wink

 

 
  [ # 44 ]

Interesting - *almost* exactly the same ‘first step’ as I do in GLI.  Except, just before everything goes to lowercase (step 2), is step 1 - original is kept.  So ‘I bought a Toyota’ is converted to ‘i bought a toyota’, so, for processing purposes, the direct object is ‘toyota’ and it takes on association of car, company, vehicle etc, but also has a flag set ‘initial-letter-capitalized’ = ‘true’.  So it *works* with it as all lowercase, but at the same time other logic, if it needs to know, can consult that ‘initial-letter-capitalized’ flag.  So dcomp.noun.1.val = ‘toyota’, the variable that holds the first direct object of the sentence, but also dcomp.noun.1.initial-letter-captialized=true.  It also retains as:  dcomp.noun.1.orig.val = Toyota.  The flag is there for speed.  So in other words, the system both -is- and -is-not- case senstive… well, not for the most part, but is, if it is deemed necessary.  If a word -is- capitalized, it is a hint as to it being a proper noun.

 

 
  [ # 45 ]

Laura, is it possible to store the original string in a “raw” state, to use at various stages of the processing, for things like Case Comparison, etc., for situations such as Victor has mentioned? It seems to me that this may prove to be a useful thing, at some point. smile

As for the level of sharing, I think it’s wonderful that you feel inclined to do so, and I, for one, appreciate it. And it’s not as if you’ve posted anything that will cause you problems later with copyrights or anything like that. smile

 

 < 1 2 3 4 5 > 
3 of 5
 
  login or register to react