AI Zone Admin Forum Add your forum

NEWS: Chatbots.org survey on 3000 US and UK consumers shows it is time for chatbot integration in customer service!read more..

How to implement a Chatscript bot that talk Portuguese
 
 

Following the example of the thread “How to implement a Chatscript bot that talk Spanish”, I created this thread to share with other people my doubts about how to adapt and use the chatscritp to create a bot for a foreing language specially portuguese.

So, for initiate this thread these are my doubts.

For a foreing language (another language different from english) you have to make some adaptations in chatscript.

The dictionary that came with Chatscript is based from the Princeton Wordnet and just exist in english.

So, the dictionary has many functionalities that is so useful in chatscript like spell check, ontologies, part of speech and etc. Then, if you don’t want lose these features you need make some adaptations in chatscript.

First option is to build your own dictionary based in the chatscript dictionary, so you have to do some things:
   
      You have to erase the content of the DICT/ENGLISH, because it will not be useful for you (or you can translate its content for your language, for portuguese it is infeasible because english is very different from portuguese).

      The name of files must have this format x.txt (where x is the name of the file).

      If your language has diacritics you must save the file in utf8 format.

      In the flag $cs_token I put the CONSTANT #DO_SUBSTITUTE_SYSTEM

      The bitflags (part of speech, example: VERB_PRESENT, VERB_PAST, ...) are different to other languages. How can I rename the bit flags?

Second, if I don’t want make a dictionary, What are the other options that I have?

      I thought that make a canonical file and put into it all variants of the words and its canonical forms would be a accepted solution (or a kind of limited solution) , but it just do not work. You have to make other things that I don’t know. What are the other things you have to do for your canonical.txt works with a foreing language?


I hope be helped from the members this forum and help other peoples.

Thank’s
Oberdan Alves

 

 
  [ # 1 ]

the dict is the way to go, at least a basic dict. is much more usefull that other alternatives. I recommend that to split it into two files… verbs (with its most usefull conjugations) and any other kind of words in the second group, or put them both in the same file, but put first the verbs cuz it will be the larger part.

there are several bits in the dict definition, but basically you must discard all the POSTDEAFULT bits
an all the “meanings” part that

eg nouns
dinero ( NOUN_ABSTRACT NOUN NOUN_SINGULAR COMMON4 COMMON2 NOUN_NODETERMINER KINDERGARTEN )

eg verbs must always start with they must always start like
ser ( VERB VERB_PRESENT .........................) here you can put several verb bits, that you need to adapt from english to portuguese, that is up to yo to decide how, replace the verb bits for english and use them to define tenses in portuguese

eg in eng CS will form the future tense like this… “will be” that’s why there is not such as thing (verb bit) as VERB_FUTURE, you must use another verb bit to indicate the future, THIS APPLY IF THE FUTURE TENSE in portugese is a verb conjugation without any other modal word, which is the case in spanish.

keep order, it is a big task, but you can start with 2000 word DICT, lets say 20 verbs with and 500 words,
it will much help for you if you know regex (regular expresions) and any tool to create macross, that’s up to you, to your programming skills

these are most of the verb bits, use them wisely,


VERB_PRESENT
VERB_PRESENT_3PS
VERB_PAST
VERB_INFINITIVE
VERB_PRESENT_PARTICIPLE
VERB_PAST_PARTICIPLE

VERB_NOOBJECT     0x0000000000008000ULL
VERB_INDIRECTOBJECT   0x0000000000010000ULL
VERB_DIRECTOBJECT   0x0000000000020000ULL
VERB_TAKES_GERUND   0x0000000000040000ULL //
VERB_TAKES_ADJECTIVE 0x0000000000080000ULL // 
VERB_TAKES_INDIRECT_THEN_TOINFINITIVE     0x0000000000100000ULL   // proto 24 —
VERB_TAKES_INDIRECT_THEN_VERBINFINITIVE 0x0000000000200000ULL // proto 25 - 
VERB_TAKES_TOINFINITIVE       0x0000000000400000ULL   // proto 28
VERB_TAKES_VERBINFINITIVE   0x0000000000800000ULL // proto 32, 35

#define PHRASAL_VERB     0x0000000000004000ULL // accepts particles - when lacking INSEPARABLE and MUST_SEPARABLE, can do either
#define MUST_BE_SEPARATE_PHRASAL_VERB 0x0000000000001000ULL // phrasal MUST separate - “take my mother into” but not “take into my mother” 
#define INSEPARABLE_PHRASAL_VERB 0x0000000000000800ULL //  cannot be split apart ever
#define SEPARABLE_PHRASAL_VERB   0x0000000000002000ULL // can be separated


I’m a bit concern about telling you the way I did this, as if it were the only way to do it, search your own way, ask Bruce for further help, its a huge task really, good luck.

PD: AFAK Pos can’t apply to other languagues, but you can workaround it using “forward backwar” matching, I asked Bruce for an spanish spell checker, I wrote an explanation file, with the basic rules, but most of the problem came from the tilde and ñ characters, which in portuguese are even more present and there are more, thats a huge problem and your main aim, do a basic portuguese dict, and test that you portuguese spell checker could correct most of the words, you will have problems with the foreign characters (al tilded vocals, ñ, etc), Bruce could help you with that.

 

 
  [ # 2 ]

Thanks Eduardo Bedoya for your explanation!

Some more questions:

Did you use the name of bit flags as is, or you “rename” (actually aliasing) them like Bruce explaned below?

I aliased the VERB_PRESENT to VERBO_PRESENTE, but when I use the alias in the dict (eg. andar (VERB VERBO_PRESENTE)) the chatscript show the following message: “Verb andar lacks tenses”.

How did you alias the bit flags?

Bruce and you said that for foreing language is not possible to use POS TAG. So, if I created a concept like it “concept: ~verbo_presente (VERBO_PRESENTE)”  I can not use it in my patterns like ” u: ( I ~verbo_present hoje) esta certo”. That’s it? (assuming that I have aliased the bit flag).

Bruce Wilcox - Nov 15, 2015:

Those bits are also available. But if you merely change names in dictionarySystem.h, the files will not compile because the names are used in the english language parser code (which you are not using). That’s why I recommend merely aliasing things. You COULD add #defines into the file at the end like this:


#define INSEPARABLE_PHRASAL_VERB   VERB_PRETERITE
which will not disturb existing code.  In a topic file you need to do:
concept: ~verb_preterite (VERB_PRETERITE)
which enables you do things analogous to matching: 
u: ( I * ~verb_present * home)
presuming you have coded via a table that your word has the VERB_PRETERITE property, EG
table: ^spanishverbs(^base ^preteriteform ...)
  ^addproperty(^preteriteform VERB_PRETERITE)
DATA
somespanishword somespanishpreteriteformofthatword

 

 
  [ # 3 ]

Hi,

No, I tried to do it as simple as possible, cuz, more feathered stuff may work well at the beginning, but then when testing more complex patterns (like reverse matching, or looking if a match is contained in certain concept) it will start to show problems, I found that with my own case, the engine got some flaws here and there, things that worked well in english chars, but not foreign ones, despite the utf8. So I decided to workaround, or try to do thing as simple as possible, and not bother bruce with every ONLY SPANISH RELATED bug, he help me alot with the spanish spell checker, which still has some flaws in more complex patterns, but as I said I workarounded them.

So I used almost all of the verb bits, but without aliasing, I never ask about renaming the bits, I don’t know how to rename the bits or how the engine can thereafter recognize the new bit names, it seems very uncertain to me, Bruce should respond about that. I just used the bits with their original names and remember the new meaning that I had assigned to them.

I did not use tables, and did not create concepts with flags, only made the dict, I do use three kinds of flags to reisntrain each verb match, this is how Bruce suggested to do it, I suggest you ask again and try to figure out your a way to do it that fits portuguese verb conjugations

“A concept cannot be triggered by a pair of words directly.
But you CAN create a topic that will do that. It can be run early in your control script or be run as $cs_prepass which happens before $cs_controlmain
u: ( _~verb_infinitive _0?~verb_noobject ) ^mark(~dualconcept1 _0)
and other such rules”

Perhpas Bruce could shed more light on this matter wich is also very important.

eg. it’s very likely that you will need not only a TENSE flag, but also a PERSON flag, perhaps another flag, and select verbs that match those three flags, note that you have to be able to combine those flagged verbs in pattern rules after. I suggest you try to find the easier but powerfull way that may fit portuguese.

good luck

 

 
  [ # 4 ]

Thank you Eduardo Bedoya!

I just didn’t understand very well this part of your explanation: “I do use three kinds of flags to reisntrain each verb match, this is how Bruce suggested to do it”.

Could you explain a little bit more about it?

 

 
  [ # 5 ]

Hope Bruce, could shed more light on this, I did not take that rute… I don’t know how well it goes in the practice

I asked Bruce this…

So using the #kind and #tense, I would be able to group all tenses from any verb,
I would like to ask you to make possible for CS to create ~concepts that can call words that have a pair of #definitions, (#kind n #tense).

I eneded up doing it in other way, but the idea was…

you have a ~verb_infinitive [uno dos tres]
and another concept ~verb_noobject [dos]

so this below is a rule that must be always fire at the beginning, so
u: ( _~verb_infinitive _0?~verb_noobject ) ^mark(~dualconcept1 _0)

~dualconcept1 = dos

You got the idea what could it do with verb conjugations? the flags are TENSE and PERSON

I repeat I end up doing it in a different way, but I suggest you ask Bruce more about this idea

It could be better, I don’t know, but you know a way to flag verbs, that’s for sure
Hope it helped.

 

 
  [ # 6 ]

So, I think that create a TABLE and use CONCEPTS is more suitable, because it is clear that I don’t have enough bit flags to use with TENSE and PERSON in verbs.

I think that a good solution would be like below.

What do you think Eduardo Bedoya?

concept: ~verbo_presente_1ps ( )
concept: ~verbo_presente_2ps ( ) 
concept: ~verbo_presente_3ps ( ) 
concept: ~verbo_presente_1pp ( ) 
concept: ~verbo_presente_2pp ( ) 
concept: ~verbo_presente_3pp ( )
concept: ~verbo_presente( ~verb_present )

table: ^addverbos(^canonical ^pessoa1ps ^pessoa2ps ^pessoa3ps ^pessoa1pp ^pessoa2pp ^pessoa3pp)
  if(^canonical!=*){
^addproperty ( ^canonical VERB VERB_INFINITIVE)
}
  if(^pessoa1ps!=*){
^addproperty( ^pessoa1ps VERB VERB_PRESENT )
^canon(^pessoa1ps ^canonical)
^createfact(^pessoa1ps member ~verbo_presente_1ps)
}
  if(^pessoa2ps!=*){
^addproperty( ^pessoa2ps VERB VERB_PRESENT )
^canon(^pessoa2ps ^canonical)
^createfact(^pessoa2ps member ~verbo_presente_2ps)
}
  if(^pessoa3ps!=*){
^addproperty( ^pessoa3ps VERB VERB_PRESENT )
^canon(^pessoa3ps ^canonical)
^createfact(^pessoa3ps member ~verbo_presente_3ps)
}
  if(^pessoa1pp!=*){
^addproperty( ^pessoa1pp VERB VERB_PRESENT )
^canon(^pessoa1pp ^canonical)
^createfact(^pessoa1pp member ~verbo_presente_1pp)
}
  if(^pessoa2pp!=*){
^addproperty( ^pessoa2pp VERB VERB_PRESENT )
^canon(^pessoa2pp ^canonical)
^createfact(^pessoa2pp member ~verbo_presente_2pp)
}
  if(^pessoa3pp!=*){
^addproperty( ^pessoa3pp VERB VERB_PRESENT )
^canon(^pessoa3pp ^canonical)
^createfact(^pessoa3pp member ~verbo_presente_3pp)
}
  DATA:
  ser   sou es e somos sois sao
  estar   estou estas esta estamos estais estao


caneta: > :prepare sou
TokenControl: DO_SUBSTITUTE_SYSTEM DO_NUMBER_MERGE DO_PROPERNAME_MERGE DO_DATE_MERGE DO_INTERJECTION_SPLITTING


Original User Input: sou
Tokenized into: sou
Actual used input: sou


Concepts:

1: sou (raw):  +~verb_present +~verbo_presente +~verb_bits +~verb +sou +~verbo_presente_1ps //
1: ser (canonical):  +ser //

Sequences:

After parse TokenFlags: USERINPUT
JUNIOR:
caneta: >

 

 
  [ # 7 ]

Despite you are making the conjugations for portuguesse, I suggest that you post here your ideas in english, otherwise I or any other person couldn’t quite understand the meanings.

 

 
  [ # 8 ]

Eduardo Bedoya, you are right. I’ll try put always the translation to english.

But, as I said, We don’t have enough bit flags to put all the tenses of the verbs in our language (spanish, portuguese), right? So, I created a concept for each TENSE and PERSON.

What did you do to deal with this problem?

In portuguese We have about thirteen tenses of verbs and six personal pronouns for each tense (similar to spanish).

Thanks!

 

 
  [ # 9 ]

yes almost the same in spanish, but join the tenses that are present (could include infinitive), join the tenses that are preterite perfect, the one that are preterite, the conditional, future, they made almost 8 or 7 tenses. I realized that there is not real use in to be so specific with the tenses, cuz when trying to make the pattern rules we almost all the time need to set the verb matches broader rather than narrowed.(this is the best advice I can give you, It took me long time to get the grasp of the basic and most of the advanced pattern rules, so I could really be aware of how the tenses would actually be used in the pattern rules) I did it that way, make it simple, spare as much acentuation as you can, before listing the verb tenses test the spell checker with the every foreign character, and a combination of them, then you will realize the problems, first do that, is very important.
PD: I did make 6 persons, and third bit group also,
I have to say I never used tables (a little obscure for me), I can’t give advice on that matter, good luck.

 

 
  [ # 10 ]

Hey guys.  CS6.7 just release supports pos-tagger than can do spanish and portugese. This would allow ChatScript to have both postags corresponding to the words (you have to use their tags) and to get the lemma form of a word so your patterns can do the typical of being written with lemmas and matching other forms of the word.

I have not tried to find foreign spellcheckers, but in theory such could also be wired in via the provided mechanism. You still have to write your own translated concept sets, but I’m not sure how much need you have of a custom dictionary given this capability

 

 
  [ # 11 ]

Hi Bruce do you really mean Part of speech tagger? do you mean CS will differenciate between a word in adverb and in adjective mode in spanish??
Bruce could you please explain little more about how this could work? what do you mean by…
“This would allow ChatScript to have both postags corresponding to the words (you have to use their tags) and to get the lemma form of a word so your patterns can do the typical of being written with lemmas and matching other forms of the word”
do you refer that now concepts can be narrowed by a pair of def_bits, like VERB_PRESENT and VERB_CONJUGATE1 ???
what do you mean by “lemmas” do you mean the canonical form??
I have another doubt. AFAIK spell checking only works by comparing an input word with the most similar existing word in the DICT, please correct me if Im wrong.

Thanks Advanced. Bruce

 

 
  [ # 12 ]

Pos tagging labels a word as adverb or adjective in spanish.  And the spanish pos tagger would return not just its tag, but its canonical form, which CS uses in pattern matching usually.

Topic files are entirely your job to write in spanish. Obviously english ones (other than control script) are meaningless.
DICT files can also be stripped back to none so no useless concepts get marked. Spell checking by CS is difficult for foreign languages and foreign spell checkers would make more sense.

 

 

 
  [ # 13 ]

I erased all the “postdefault:” parts of the spanish DICT, are they useless with this new version???
in my spanish DICT, CS 6.5a also had problems detecting my ~pronouns, these were all the pronoun def_bits I found (are these deb_bits they also tags? please correct me if I’m wrong)

PRONOUN_SUBJECT
PRONOUN_OBJECT   << this wasn’t recognize as ~pronoun at all

PRONOUN_SINGULAR
PRONOUN_PLURAL

PRONOUN_INDIRECTOBJECT
PRONOUN_REFLEXIVE
PRONOUN_POSSESSIVE   << this was recognize as ~pronoun_possessive but not as ~pronoun

ANIMATE_BEING
OBJECT_AS_ADJECTIVE

CS 6.5a was not able to recognize ~PRONOUN_OBJECT so I set the def_bits of all my spanish pronouns as… eg.
mí ( PRONOUN_SUBJECT KINDERGARTEN ANIMATE_BEING OBJECT_AS_ADJECTIVE )

could you please Bruce tell me how should I set the def_bits for the pronouns in this new CS version?
could you please post a little example of how in this new version a word should be set by creating a concept, and how it should be set by creating the word and its definition bits in the DICT??

Thanks Advanced.

 

 

 
  [ # 14 ]

regarded to CS spell checker, that spanish spell checker worked quite fine, it only had some troubles detecting some of the foreign characters á é í ó ú ñ; but its a great help (thx Bruce), and I don’t see how a third app spell checker could help when you are using CS in server mode. (you meant that when you said foreign spell checker right?) In fact I was telling Oberan about the main importance of the spell checker, it takes place first, before everything else, if you want to understand user input.

 

 
  [ # 15 ]

Hi Bruce
I’m particular interested in knowing on how many ways can the word “si” or “no” be interpreted by the parser, cuz I’m having a hard time differenting between rather the user accepted or denied in its message. any idea? Thanks Advanced

 

 1 2 3 >  Last ›
1 of 4
 
  login or register to react