AI Zone Admin Forum Add your forum
Summarization AI
 
 

I’ve been trying to get my NLP AI to summarise documents semantically rather than textually, and so far the results have been blowing up in my face. That is to say I have a decent topic extractor but the AI’s misunderstandings and language generation make a pretty big mess of writing. But that’s not the problem. The problem is I need some testable standards to gauge more clearly how well it’s doing.

So I was wondering if any of you could post or link a single page of text on a single topic, with a not too intellectual level of language, that is well summarisable (containing lots of excess info and preferably a conclusion).

Secundarily, do you know any good online summarisers? Most that I’ve seen tend to gather sentences that contain the most frequently used word in the document, often overlooking the point the writer was making. SMMRY is one.

 

 
  [ # 1 ]

I have addressed a lot of questions about automatic summarization on Quora already.  You may be referring to abstraction-based summarization or abstractive summarization, see below.  The only algorithm I can think of for relevance scoring, other than statistical relevance, would be the computationally intensive comparing of “semantic fingerprints” using computer vision algorithms, which may be similar to what the brain is actually doing.

The more basic problem I’m having is sentence recognition, in particular sentence recognition based on grammar, rather than punctuation.  I would love to have an API that can recognize sentences based purely on grammar.  In my work with Twitter summarization, I have seen that real people almost never communicate in proper written English; so, AI needs to not only understand fragments, but also be able summarize them.

- What is the best tool to summarize a text document?

- Where do I need to start for making project on article summarize tool (natural language processing) in not more than 40 days?

- Is it worth writing your own summarization algorithm or are there any APIs that are good enough?

- Is it possible to build a AI agent that asks/answer questions based on current state of affairs rather than humans posting questions/answers?

- What are Paper.li competitors?

- What are some good research publications on (or resources to learn about) abstraction based document summarization?

- What is a killer text summarization API that will be able to summarize an article to a headline sized piece of text?

- What are the best open source resources for automatic document summarization?

- What is the state of the art in text synthesis?

- What’s a good* problem to hack at as a personal project in the area of summarization?

- How could a text summarization application be endowed with “personalities”?

 

 
  [ # 2 ]

Thank you Marcus, as usual you are a great source of information.

You’re right, I mean to do abstraction-based summarization instead of just extraction. Seeing as my AI was already built to convert sentences to basic facts and paraphrase somewhat, I thought I’d try and turn what I had into a useful tool. Right now I’m only having it filter out questions and negative, uncertain or inaccurate facts, but I want to take it a lot further and take writing conventions into account to dig out the most important facts.

As to your problem, do you mean you want to split poorly punctuated phrases like “Hi how are you I am fine” into “Hi - how are you - I am fine”? I could post what basic grammar rules I use for that sort of thing if you’d like, although they were designed to split into individual clauses. For example, one rule is to always split between two reference words, e.g. “I told him - I am awesome”. At which a summariser could omit the redundant reporting clause “I told him -”.

 

 
  [ # 3 ]

I am considering preliminary experiments, in order to get Uberbot reading and understanding textual input. I need to start at a basic level. I found this which looks like a good source of basic textual input:

http://research.microsoft.com/en-us/um/redmond/projects/mctest/

I don"t know of any online summarizers. There is an online ‘link’ grammar parser which splits a valid or nearly valid sentence into its grammatical parts, but I realize that’s not exactly what you are after. I hope this helps.

 

 
  [ # 4 ]

My position is that an AI ought to be able to both “read”, a blog for instance, as well as “write” something like a blog.  Of course, this would involve deconstruction and reconstruction, or automatic summarization and natural language generation.  For instance, how might an AI “digest” a given website, and then answer natural language questions based on that content?

One problem is that all major chatbot engines are based off markup language and pattern matching.  For instance, there is no major conversational engine available based primarily on the tasks of natural language processing, nor any major conversational engine based on semantic web technology, in other words subject-predicate-object triples.

Nor is there any standard software available for the conversion of either parsed NLP corpora or parsed semantic web corpora into major chatbot languages.  Further, there is no standard way for the creation of “documents” from chatbot languages, which would necesitate a form of continuity, or longitudinal topic coherence, notoriously absent from chatbot dialog.  One solution might be “topical” templates, such as those used for generating “automatic books”.

Regarding sentences, the conversion of partial or incorrect sentences into proper written English is called normalization.  However, it seems highly inefficient to have to convert every incomplete utterance into proper English.  That said, I’m still in search of a high volume sentence filter, based on grammar and not punctuation.  I want a simple stop/go, red light/green light, sentence filter - for automatically rejecting everything that is not a proper sentence.  Ultimately, I want to be able to parse any website and reject everything that is not a complete sentence - for which punctuation is usually an inadequate rule.

In short, the goal I am looking at now is an AI that can “read” any website - and then not only answer any natural language question based on that content, but also “write” a coherent summary of that website (think Turing test), to a blog post for instance.

 

 
  [ # 5 ]

Thanks Will, that looks like useful test material for language comprehension, which isn’t what I’m looking for towards summarisation, but certainly useful. Makes me think I should test with English language children’s books, except I live in the wrong country for that. Does anyone have the text of a short children’s story to summarise? (Not the Jabberwocky smile )

I’m afraid I can’t help you Marcus. What you describe is the kind of AI I’m working towards (NLP, semantic analysis, fact extraction, question answering, conversation, summarisation, NLG, and then some), but I haven’t found any available comparable program either, other than individual components with compatibility issues. As you can tell from my list there are just too many complex aspects to work on, so projects tend to specialise in one. My NLG is just a crude grammatical template.

I don’t know that there are grammar parsers that distinguish INcorrect grammar. I’ve read that Twitter does not lend itself to language processing as well as to keyword mining. Probably because people hardly use verbs in Tweets.

 

 
  [ # 6 ]

> http://www.gutenberg.org

>> Project Gutenberg offers over 46,000 free ebooks: choose among free epub books, free kindle books, download them or read them online.

 

 
  [ # 7 ]

That’s a great resource grin. I’m sure I’ll find some coherently summarisable stories there. They’ve got everything in plain text too, exactly what I need. Thank you smile

 

 
  [ # 8 ]

09 Jan 2015: Robot Journalist Finds New Work on Wall Street

On the natural language generation side, in addition to automatedinsights.com and narrativescience.com, there are now two new players:

=> arria.com

=> onlyboth.com

 

 
  [ # 9 ]

I am always impressed by those journalist AI, capable of turning boring numbers into decent phrases smile

On the topic of text processing, although it still won’t do much to parse Twitter comments, I recently found this chap who’s been trying to make a standard for sentence splitting at punctuation (distinguishing abbreviations etc). He offers the code open source on Github, and the page contains a handy checklist of edge cases:
https://www.tm-town.com/natural-language-processing

One of those small but fascinating little problems. I thought it might be useful to some.

 

 
  [ # 10 ]
Marcus Endicott - Jan 9, 2015:

My position is that an AI ought to be able to both “read”, a blog for instance, as well as “write” something like a blog.  Of course, this would involve deconstruction and reconstruction, or automatic summarization and natural language generation.  For instance, how might an AI “digest” a given website, and then answer natural language questions based on that content?

In short, the goal I am looking at now is an AI that can “read” any website - and then not only answer any natural language question based on that content, but also “write” a coherent summary of that website (think Turing test), to a blog post for instance.

The problems I am dealing with in this one regard are:

1.) AI that can “read” any website:

Just getting clean, well formed text is not itself trivial. Deciding what to extract is then the real hard part though.  A ask myself- ‘What do you need the AI to “remember” to provide a good AI experience?’ 

The simplest case might be look up a fact and answer a question about it. Is “it” a person, and if so a he or she? This type of information is perfect for NLP of followup input like “tell me some more about HIM”- if the previous subject was a “HIM”, then great, but if the AI just looked up a fact about “KEYWORD”, say ‘Johnny Bravo’, then the lookup would need to set proper noun(s), gender, person/place/thing assignments to answer NLP input of pronouns (he, she, herself, it, that, they, each, who, whose, etc.).  This ties the conversation together and provides continuity between a lookup and the conversation.

This one issue may resolve to how to extract and preserve the pronoun relationships from text analyzed on the fly.

I’ve got some of this tackled, and so far is worth the effort since I am no great fan of the static-db type bot, and this type of processing makes for more AI like behavior.

As for writing a coherent summary, the coherence is likely to be directly related to the success in the initial “reading” of a web page(s).

 

 

 
  [ # 11 ]

Translating reference words is the biggest hurdle I’ve encountered in text processing, particularly for the frequently used “it” and “that”. It is often causing my program to assign the wrong subjects to the wrong facts. I have a decent system for it but linguistical distinctions alone aren’t cutting it. There is a reason the Winograd Schema Challenge uses this issue as a central mechanism.
Here is an example of why: http://artistdetective.com/winogradmermaid.swf

For the time being I’ve stooped to a hybrid, in which the program filters information semantically, but copies the original sentences when they contain something important, instead of rephrasing them. This hides most of the semantic mistakes it makes, but doesn’t do much to make the text more compact.

 

 
  [ # 12 ]
Don Patrick - Jan 27, 2015:

For the time being I’ve stooped to a hybrid, in which the program filters information semantically, but copies the original sentences when they contain something important, instead of rephrasing them..

How do you decide that a sentence contains something “important”- using proper noun detection?

 

 
  [ # 13 ]
Carl B - Jan 27, 2015:

How do you decide that a sentence contains something “important”?

That is the question, and I’m still working on a good answer. I might have better phrased it as “containing something not un-important”, as so far I only filter out all unimportant facts (unknown, uncertain, negative, questioning, e.g. featuring the word “maybe”).
Further hints are provided by the main topic (which I extract by document frequency and grammatical roles), paragraph headers, and writing conventions, things I have yet to work on. Between initial statement and conclusion of a paragraph is often a great deal of elaboration that can be left out for instance, and important conclusions or arguments may be introduced with linking words like “So,” or “because”. I hope.

I don’t categorise in terms of nouns though. A text could be about “cats” in general, or about the colour green, so I don’t find nouns a meaningful distinction. I prefer to distinguish the roles of words as subjects, objects, actions and locations, with subjects tending to be the most important elements.

 

 
  [ # 14 ]

Hey Don.

If you are still looking for a good summation API there are a few available here https://www.mashape.com/

Vince

 

 
  [ # 15 ]

There are some interesting API’s on there, but I find the site hard to navigate (did you mean numerical summation?)

I’ve decided to put language generation aside for a moment, reinvent the wheel, and am currently analysing some texts for hints to importance. What seems to be the case is that these hints are not so much found in word frequencies or stated facts, but majorily in writing style. Extracting sentences about the main topic(s) holds up for a large part if you include translations of “it/he/they”, but is too indiscriminate in that it also includes trivial statements about the main topic that break the “story arc” as writers call it. Having solid story arcs is also important in language generation, by the way.

I’ll set my program up to just highlight particular sentences using HTML format, and see what that gives. To work!

 

 1 2 > 
1 of 2
 
  login or register to react
‹‹ Pizza Bot      Program AB Web Services help ››