AI Zone Admin Forum Add your forum

NEWS: Chatbots.org survey on 3000 US and UK consumers shows it is time for chatbot integration in customer service!read more..

POS tagging input
 
 

I am considering further work on pos tagging to generate data from documents. I have not read anything about the Document reader aspect of CS but am more concerned about the pos tagger.

The documentation says that functions like partofspeech and decodepos can return 64 bit pos info of word at location. I assume this means location in a input sentence but since the arguments are either role or location I am not sure how to pass in the sentence. It says to look in Dictionary.h for more on bit info but I have search my CS dir recursively and cannot find this file.

I have used :prepare on input like “It is a mammal belonging to the horse family”. The pos section inaccurately tags “belonging” in its noun form but the concept set recognizes both the noun and verb aspect when it recognizes the canonical form. Will there be any way to get this double recognition when working with the pos parser?

I am posting the full prepare output here:

TokenControl: DO_SUBSTITUTE_SYSTEM DO_NUMBER_MERGE DO_PROPERNAME_MERGE DO_DATE_MERGE DO_SPELLCHECK DO_PARSE


Original User Input: it is a mammal belonging to the horse family
Tokenized into: it is a mammal belonging to the horse family
Actual used input: it is a mammal belonging to the horse family

Xref: 1:it   2:is o5   3:a >5   4:mammal >5   5:belonging   6:to   7:the >9   8:horse >9   9:family  
Fragments: 1:it   2:is   3:a   4:mammal   5:belonging   6:to   7:the   8:horse   9:family  
Tagged POS 9 words: it (MAINSUBJECT Pronoun_subject)  is/be (MAINVERB Verb_present_3ps)  a (Determiner)  mammal (Adjective_noun)  belonging/belong (MAINOBJECT Noun_singular)  to (Particle Preposition)  the/a (Determiner)  horse (Adjective_noun)  family (APPOSITIVE Noun_singular) 
  MainSentence: Subj: it   Verb: is   Obj: [ a mammal] belonging PRESENT


Concepts:

1: it raw=  +~pronoun(1)  +~pronoun_subject(1)  +~pronoun_bits(1)  +~kindergarten(1)  +~mainsubject(1)  +it(1)  +~it_words(1)  //
1: it canonical=  // 

2: is raw=  +~verb_present_3ps(2)  +~verb_bits(2)  +~verb(2)  +~kindergarten(2)  +~mainverb(2)  +is(2)  +~linkingverb(2)  +~auxverblist(2)
.  +~wordnetpropogate(2)  +~equals(2)  //
2: be canonical=  +be(2)  +~tobe(2)  +~be_verbs(2)  +~states_of_being(2)  +~static_verbs(2)  +~usefulfactverb(2)  // 

3: a raw=  +~determiner(3)  +~determiner_bits(3)  +~kindergarten(3)  +a(3)  +~determinerlist(3)  +~vowels(3)  +~letters(3)  //
3: a canonical=  // 

4: mammal raw=  +~adjective(4)  +~adjective_noun(4)  +~grade3_4(4)  +mammal(4)  +~animals_generic(4)  +~mammals(4)  +~beings(4)  +~tool(4)
.  +~animate_thing(4)  +~objects(4)  +~nounlist(4)  +~animals(4)  +~rideable(4)  +~functions(4)  +~eatable(4)  +~burnable(4)  +~animal_kingdoms(4)
.  +being~1(4)  +~nounroot(4)  //
4: mammal canonical=  // 

5: belonging raw=  +~noun_abstract(5)  +~noun(5)  +~noun_singular(5)  +~singular(5)  +~normal_noun_bits(5)  +~noun_bits(5)
.  +~kindergarten(5)  +~mainobject(5)  +belonging(5)  +~feeling_attached(5)  +~feeling_words(5)  +~emotions(5)  +~sensations(5)
.  +~attributes(5)  +~nounlist(5)  +~goodness(5)  +~nounroot(5)  //
5: belong canonical=  +belong(5)  +~own(5)  +~possess(5)  +~possession_verbs(5)  +~social_verbs(5)  +~animate_verbs(5)  +~verbs(5)
.  +~active_verbs(5)  +~use_intentionverbs(5)  +~static_verbs(5)  +~do_with_titles(5)  // 

6: to raw=  +~lowercase_title(6)  +~particle(6)  +~preposition(6)  +~kindergarten(6)  +~locationword(6)  +~locatedentity(6)
.  +~there(6)  +to(6)  +~directionpreposition(6)  +~spacepreposition(6)  +~prepositionroot(6)  +~focus(6)  +~directions(6)  //
6: to canonical=  // 

7: the raw=  +~lowercase_title(7)  +~determiner(7)  +~determiner_bits(7)  +~kindergarten(7)  +the(7)  +~determinerlist(7)  //
7: a canonical=  +a(7)  +~vowels(7)  +~letters(7)  // 

8: horse raw=  +~adjective(8)  +~adjective_noun(8)  +~kindergarten(8)  +horse(8)  +~sizes(8)  +~soundmaker(8)  +~vehicles_land(8)
.  +~vehicle(8)  +~tool(8)  +~rideable(8)  +~functions(8)  +~enterable(8)  +~auto_dealer(8)  +~store_type(8)  +~store(8)  +~attributes(8)
.  +~nounlist(8)  +~artifacts(8)  +~objects(8)  +~human_data(8)  +~herbivore(8)  +~hobbies_animals(8)  +~hobby(8)
.  +~entertainment_stuff(8)  +~pet_animals(8)  +~pet_store(8)  +~animals(8)  +~eatable(8)  +~burnable(8)  +~beings(8)  +~animate_thing(8)
.  +~animals_generic(8)  +~animal_kingdoms(8)  +~mammals(8)  +being~1(8)  +~nounroot(8)  //
8: horse canonical=  // 

9: family raw=  +~noun_abstract(9)  +~noun(9)  +~noun_singular(9)  +~singular(9)  +~normal_noun_bits(9)  +~noun_bits(9)
.  +~kindergarten(9)  +~appositive(9)  +~sentenceend(9)  +family(9)  +~related_list(9)  +~societal_data(9)  +~human_data(9)
.  +~stronggoodness(9)  +~goodness(9)  +~nounroot(9)  +~life_taxonomy(9)  +being~1(9)  //
9: family canonical=  // 

  sequences=
+it_be(1-2)


+belong_to(5-6)

 

+to(2)  +~directionpreposition(2)  +~spacepreposition(2)  +~prepositionroot(2)  +~focus(2)
.  +~directions(2) After parse TokenFlags: PRESENT USERINPUT

 

 
  [ # 1 ]

THe file you want is dictionarySystem.h which describes values for D->properties bits (partofspeech)
and pos tagger roles and states on roles[]

You do not pass in the sentence, you pass in the request type and integer location (DecodePos)  or a location reference (partofspeech) which can be an integer location or a match variable.

The system parsed the sentence incorrectly, but it was a plausible parse. I will have to make that parse less plausible.

 

 
  [ # 2 ]

I am also interested in the POS tagging, which I understand is a work in progress. 

I think the tagging works by finding all of the groups the words belong to but then isn’t there a process of pruning and applying rules to narrow the tagged POS to the most plausible?  Are the rules “hard-coded” or can we play with some files and test different applications of the rules?

I have a case:
raspberryrepare A sparrow has brown and grey feathers.
...
badparse Tagged POS 7 words: a (Determiner)  sparrow (MAINSUBJECT Noun_singular)
  has/have (MAINVERB Verb_present_3ps)  brown (MAINOBJECT Noun_singular)  and (C
ONJUNCT_NOUN Conjunction_coordinate)  gray (<Clause MAINOBJECT Noun_adjective Cl
ause>)  feathers/feather (MAINOBJECT Noun_plural)
...
4: brown raw=  +~noun_abstract(4)  +~noun(4)  +~noun_singular(4)  +~singular(4)
+~normal_noun_bits(4)  +~noun_bits(4)
.  +~kindergarten(4)  +~mainobject(4)  +brown(4)  +~basiccolors(4)  +~mybrown(4)
  +~palettecolor(4)  +T~mixcolors (4)  +~cooking_verbs(4)
.  +~special_activities_verbs(4)  +~animate_verbs(4)  +~verbs(4)  +~active_verbs
(4)  +~race(4)  +~racial(4)  +~brown(4)  +~colors(4)  +~color_adjectives(4)
.  +~esthetic_adjectives(4)  +~physical_properties_adjectives(4)  +~adjectives(4
)  +~nounroot(4)  //
4: brown canonical=  //  +Brown(4)  +Brown_University(4)  +~university(4)  +~col
lege(4)
....
It appears that it understands that brown could be in ~color_adjectives(4) or +~adjectives(4)
but it decided it was a noun.
....
For gray:
6: gray raw=  +~noun_abstract(6)  +~noun(6)  +~adjective(6)  +~nounphrasewords(6
)  +~prepphrasewords(6)  +~adjective_normal(6)
.  +~noun_adjective(6)  +~noun_bits(6)  +~kindergarten(6)  +~mainobject(6)  +~cl
ause(6)  +gray(6)  +~mygrey(6)  +~palettecolor(6)  +T~mixcolors (6)  +~gray(6)
.  +~colors(6)  +~color_adjectives(6)  +~esthetic_adjectives(6)  +~physical_prop
erties_adjectives(6)  +~adjectives(6)
.  +~mammals(6)  +~beings(6)  +~tool(6)  +~animate_thing(6)  +~objects(6)  +~nou
nlist(6)  +~animals(6)  +~rideable(6)  +~functions(6)  +~eatable(6)
.  +~burnable(6)  +~animals_generic(6)  +~animal_kingdoms(6)  +~weakbadness(6)
+~badness(6)  +~nounroot(6)  //
6: gray canonical=  //
...
gray is included in ~adjective_normal while brown is not.  Also gray is marked correctly as belonging to my custom concepts of ~mygrey and ~prephrasewords which includes ~adjectives while brown is not.

?????

:prepare A sparrow has dull or shiny feathers.
...
Tagged POS 7 words: a (Determiner)  sparrow (MAINSUBJECT Noun_singular)  has/hav
e (MAINVERB Verb_present_3ps)  shiny (OBJECT_COMPLEMENT Adjective_normal)  or (C
ONJUNCT_ADJECTIVE Conjunction_coordinate)  dull (MAININDIRECTOBJECT Adjective_no
rmal)  feathers/feather (MAINOBJECT Noun_plural)
...
Both shiny and dull are flagged as ~adjective_normal. 

:prepare isn’t POS tagging fun?
...
Tagged POS 5 words: is/be (Aux_be_present_3ps)  not (Adverb)  POS (MAINSUBJECT N
oun_proper_singular)  tagging/tag (MAINVERB Verb_present_participle)  fun (MAINO
BJECT Noun_singular)
...
Now I’m not even sure I know enough English grammar to be doing this…is fun an adjective? ...very well might be correctly tagged as a noun in this case.

So basically the issue seems to be with “brown” and if the :prepare statement does not show “brown” as belonging to my custom concept ~prepphrasewords then I can’t use ~prepphrasewords with brown in a pattern successfully, right? 

 

 
  [ # 3 ]

Thanks Bruce. I will check out that file. I was currently using the following patterns to respond to strings of basic sentence structure. The idea is to slowly expand them to support complex sentence structure. I was thinking to use the match variables to compare with concept sets and thus analyze the input to deal with sentences say from a wiki document. Do you think this could be a viable approach?

#! A dog has been fighting the cats
u: ({determinerlist} _~noun_bits {~aux_verb} {~aux_verb} _[~verb_bits ~noun_bits] {~determinerlist} _{[~noun_bits ~pronoun_bits ~adjective ~verb_bits]}) _0 subject _1 verb _2 object

#! They have been fighting me
u: (_~pronoun_bits {~aux_verb} {~aux_verb} _[~verb_bits ~noun_bits] {~determinerlist} _[~noun_bits ~pronoun_bits ~adjective ~verb_bits]) _0 pronoun _1 verb _2 object

The reason for including verbs and nouns together is that verbs like “fights” can be incorrectly tagged as noun instead of verb but I know if it is in that match position it is a verb. I realize the pattern does not currently support possessives or adverbs but one step at a time. I also realize the second pattern can be merged with the first; I am just currently seperating them to avoid accepting determiners before the pronoun. The current patterns are too general too; ill need to serperate the noun,verb,adject etc. forms in the third match

 

 
  [ # 4 ]

I think it could work if your goal is to attempt to identify your own MainSubject, MainVerb and MainObject and if you think that sentence structure is important in helping to identify them.

I am doing something similar in identifying my own prepositional phrases since some of the sentences I am trying to parse have prepositional phrases not being recognized by Chatscript.  Sometimes the wikipedia sentences are long and laborious.  Chatscript has some limitations.

“Pattern matching
a sequence is limited to 5 words in row and will do both original and canonical forms.” Bruce in the Basic User Manual.pdf

“ChatScript has a complexity limit, and will not accept sentences near 255 words. Also, sentences
containing more than 7 of some particular concept will not mark after the first 7. So a sentence with 8
nouns will not have a ~noun after the 7th. If the nouns were a mixture of singular and plural, then it will
represent them, up to the 7 limit.”-Bruce in POSParser.pdf

I am attempting to strip out prepositional phrases and then send the sentence back to Chatscript POS tagging for another try.  Many times after a failed parse by chatscript, removing the prepositional phrases, results in a successful parse and therefore I can use the MainSubject, MainVerb and MainObject that is identified by the Chatscript POS tagger.

I have posted the start of a simple POS tagger that loops through each word in a sentence and determines the membership of basic POS concepts.  You could possibly do something similar where you identify each pronoun noun and verb and determine your own subject, verb and object without having to match all of the possible combinations of sentence structures. 

It is possible to turn off the POS tagging in Chatscript:
“You can suppress the parser by setting $token to NOT have the request to parse | #DO_PARSE
If the parser is not run, then the pos-tags of all words will still be partially performed, reducing the set
of word interpretations as much as it can without risking losing any correct pos-tag. But you will not
get any parse information..”- POSParser.pdf

I am not sure if this would result the POS tags you are looking for that are later removed with full parsing enabled as it chooses the most likely POS tag.


 

 
  [ # 5 ]

Coordinate conjunctions are weak in CS parsing.

There are 2 phases that can intermix. The first are rules that try to remove obviously illegal pos values. These rules are in
LIVEDATA/ENGLISH.  I have a standard test   :pennmatch raw   which takes a bit above 90K tokens in 3800 pennbank sentences and reports those that have incorrectly pruned (REGRESS/PENNTAGS/penn.txt) .  It’s goal is to maximally prune while NEVER removing a valid pos-tag. It currently prunes badly 2% of the time and prunes 47K amgibuous words down to15.7K .

The next phase is hard code parsing. It tries to find a sentence structure that works, using a garden path algorithm. Along the way it can assign pos, rerun rules, assign roles, and sometimes override prior pos decisions when it finds a conflict.  It gets run as :pennmatch and currently is 94% right in pos-tagging.

:pos this is a sentence
displays what it is doing as it pos/parses.

 

 
  [ # 6 ]

“Will there be any way to get this double recognition when working with the pos parser?”

depends. If the concept does not require a part of speech, then it wont care how it pos-tags. If the concept does constrain by part of speech, then the pos-taggers job is to constrain, and if it doesn’t match pos type, then the concept wont trigger

 

 
  login or register to react