AI Zone Admin Forum Add your forum

NEWS: Chatbots.org survey on 3000 US and UK consumers shows it is time for chatbot integration in customer service!read more..

CAESAR
 
 

Hey folks,

I posted in another thread in here the other day and mentioned I would create a topic regarding the details of CAESAR once I’d figured out what I can and can’t state about it.  So here I am smile

First, a bit of background brief.  Some of the old timers around here (*cheeky grin*) might remember I posted around 18+ months ago regarding a project called ALF that me and my collegues were had been working on for some time.  This went through a number of iterations and we finally decided to take what we had learned and pursue an actual set of real goals.

Enter CAESAR.

While CAESAR isn’t strictly a chatbot project, the chatbot aspects come for free, and well, you guys like to hear about this stuff right?

So what is it?
CAESAR is what I believe will be the first general purpose, dare I say it, strong AI (assuming we are successful) which is able to reason, have no environment limitations, solve general problems, and provide solutions for such problems.

Sounds far fetched I know, but this is so far a culmination of around 4 years of work by the team, and I myself have been chasing AI around for years.  In school I used to build little robots and code up BBC micros (showing my age) to control and interact with environments, so it’s safe to say, that I’ve been heading to this point for many many years. 

That said, even in the event that strong AI isn’t achieved, the technology even this far in, is very useful, especially in the domain of the semantic web, so even if we fail, we succeed…paradoxical I know smile

What have we done so far?
Currently there are a number of systems in place that support the main “guts” of CAESAR, we wish for it to be able to learn facts and information un-supervised. 

Crawler
Our first port of call was data, lots and lots of data.  For this we developed a web crawler with specific properties for CAESAR that would churn the web, execute on page javascript while crawling, and perform some preprocessing on that page data for performance enhancements.  Many crawlers out there today do not process the on page Javascript, and thus information can be lost, quite a bit of it infact, and this was deemed unacceptable by us.

So far we have 200M pages and rising, that may not seem like a great deal, but a lot of these pages are crawled daily, as content changes constantly and for now at least, the data which that 200M set provides is more than ample for development.

Post Crawl Processing - Structured Documents
Once a page has been crawled it is processed into what we call a “Structured document”, where particles of interest are highlighted, assembled into a manageable tree and stored for later processing.  Elements such as headings, paragraphs, sentences, bullet lists, hyperlinks are all processed.  Data that is deemed unimportant and discarded, other data is simply flagged as a “potential” information element.

Extraction
The extraction process is the most complicated and involved process in the system, 2nd only to the Rationalizer, and is still undergoing tweaks and improvements after 3 years of work. 

This link in the chain takes the structured documents, and passes the text of each structured element through our NLP module.  This module cleans up text, performs spelling corrections, and spits out a number of different data sets, POS tagging, Typed Dependency tree’s, Phrasal structures, NER, co-reference, sentence/paragraph polarities and other needed data sets.

Rationalizer
The rationalizer is the main key component of our extraction system, and I have to guard it’s inner workings (this is a commercial project after all).  This component of the system processes the datasets produced by the extractor and organises the information into subject, object, action, and property chains.

This then provides a common dataset, which can be applied to any language, scenario, problem, or input, and you have a reliable, structured, dependable representation.
This process can be “ran in reverse” and you will be presented with the original input (albeit with better structure, grammar and spelling depending on your English skills).

Typed Relations & Classifications
Once you have the rationalizer output, classifing and determining relations between objects becomes quite simple.  New relations and classifications are created where needed, old ones are updated if new facts have arisin that enforce a change.

Inference & Deduction
With enough data, un-supervised Horn Clause generation from your Rationalizer and Typed Relation/Classification data is surprisingly simple, with few general seed rules required.  From that end “common sense” such as “if A is C and B is A, therefore B = C” are possible with minimal processing and effort.

——

Well, thats a good hour spent, I’ll add further details of what we have tomorrow or in a few days time as I should really get back to the grind-stone.

Any questions, observation etc, post them up, and I’ll reply where I can smile

 

 
  [ # 1 ]

great work. is it close to being released? how much will it cost? will I be able to specify my domain of interest? will I be able to structure the outputs?

 

 
  [ # 2 ]

Inference & Deduction
With enough data, un-supervised Horn Clause generation from your Rationalizer and Typed Relation/Classification data is surprisingly simple, with few general seed rules required.  From that end “common sense” such as “if A is C and B is A, therefore B = C” are possible with minimal processing and effort.

this reminds me of Lenat’s work on Eurisko.

Eurisko (Gr., I discover) is a program written by Douglas Lenat in RLL-1, a representation language itself written in the Lisp programming language. A sequel to Automated Mathematician, it consists of heuristics, i.e. rules of thumb, including heuristics describing how to use and change its own heuristics.[1][2] Lenat was frustrated by Automated Mathematician’s constraint to a single domain and so developed Eurisko; his frustration with the effort of encoding domain knowledge for Eurisko led to Lenat’s subsequent (and, as of 2008[update], continuing) development of Cyc. Lenat envisions ultimately coupling the Cyc knowledgebase with the Eurisko discovery engine.

 

 
  [ # 3 ]

Thanks, we have a roadmap for development over the next 5 years, each of which fulfill a sub-project as such ie. Big Data fact extraction, Q&A, Analysis for marketting and the biggie Web search.

So it’s never going to be “released” as such, but we are planning to offer API’s which 3rd parties can use for various tasks as well as our own commercial & public services.

The API’s will allow you to specify your domains of interest, rules for processing and your outputs etc

Obviously these haven’t even been designed yet, but thats the route we are planning to take.

In the current timeline right now, we’re hoping to crack a particular golden egg over the course of the next week or so, to do with semantic and thematic role labelling, which we are very close to as of today.

We want total un-supervised SRL with a very minimal set of of rules in the language domain of the parsing, and be able to extract facts, subjects, objects, enitities etc from text that may have ambiguous properties, yet achieve a accuracy of 80%+.  Of course this is a large task and will be refined over the forseeable future, but we should have a good solid foundation and algorithm in place over the next week.  This should be another real first as I believe current SRL techniques require a large training corpus, and are only semi un-supervised (systems that use FrameeNet for example)

For those that like a bit of technical, the key we found to this puzzle was to do with dependency parse trees, and phrase trees.  Using the 2 in a particular way (Noun Phrases, Verb Phrases) and the dependency grammar tree’s it is possible to extract SRL to a high degree with nothing more than a POS Tagger, Grammar Parser and our algorithm smile

I’m hoping that with this module completed, we should be able to put together a simple Q&A demo online for folks to try out, which can be queried in natural language such as “Who accquired Skype for billions?”  answer of course being “Microsoft for 50 billion”.

 

 
  [ # 4 ]

I’m assuming by SRL you mean Statistical Relational Learning such as is done by NELL (Never Ending Language Learning) at Carnegie Mellon University? That’s an interesting project which has been running for a few years now, almost completely unsupervised.

http://rtw.ml.cmu.edu/rtw

http://en.wikipedia.org/wiki/Statistical_relational_learning

 

 
  [ # 5 ]

Yup thats correct, I know of NELL and its very impressive.

We are attempting to do the same, but hopefully with better overall accuracy as there are a few “niggles” with NELL after a large data set has been “read” according to the developers of it.

I guess time will tell if we achieve our goal.

 

 
  [ # 6 ]

Hi Dan,

We haven’t heard from you in ages and I was wondering how the CAESAR project is going. Your last reports from this time last year sounded very promising and while it is not unusual for projects of this type to take (many) years to reach fruition I was wondering if you had any updates for us to think about.

 

 
  [ # 7 ]

hey ..m student and want to build a generalized bot of university enquiries..but i dont get any help..which platform to use..plz help

 

 
  [ # 8 ]

Why bump up a 2 year old thread with your unrelated post?

 

 
  [ # 9 ]

It’s annoying isn’t it, but for what it’s worth, I’d still like to know what happened to Dan and his Caesar project. They were investing so much effort in it, it must have come to something.

 

 
  login or register to react