AI Zone Admin Forum Add your forum

NEWS: Chatbots.org survey on 3000 US and UK consumers shows it is time for chatbot integration in customer service!read more..

Wordnet in spanish
 
 

Hi everyone!

I’m currently using ChatScript as part of a system being developed for my master’s thesis, handling natural language queries to a Linked Data System, and providing “social interaction” rather than just the regular response.

I have written the topic files, and the system is kind of working, although I am having a couple of issues, the first one of them having to do with the dictionary. I would like to use the Spanish WordNet to replace the default English dictionary, but I’m not sure on how to do it (I think the second problem is related to this one, so I don’t want to tackle it just yet).

If I have understood it correctly, I’ll need to create a “SPANISH” folder inside the “DICT” folder, and place the data there. Is there any automated tool to generate those files from the WordNet? And how do I tell ChatScript to use those files instead of those in the “ENGLISH” folder? I could try to write a script that would just replace the English words with their Spanish translation from the WordNet, but I don’t know if that would actually work.

I apologise if this I a very simple question, but I don’t seem to be able to figure this out.

Thanks!
Alberto

 

 
  [ # 1 ]

Generally speaking, wordnet is not freely available in other languages, hence I do not supply/support them (although I would be happy to for free wordnets).  In addition to wordnet (needed for spell checking), CS has code to support conjugation recognition of words. Wordnet only has the roots. One would have to write C code to handle foreign language conjugation. And CS supplies predefined concepts in english, some 1800.  One would presumably want to recreate those in a foreign language.  So…. the task is daunting if done in full. A different approach is to autotranslate spanish to english using google translate and then merely output spanish, while inputting english.

 

 
  [ # 2 ]

Hi Alberto, as Bruce Says, there is no Wordnet available for free in Spanish!

I have been struggling in the past to get a copy and be allowed to use it but ELDA is un flexible, they cash a lot of bucks fo it! over 15k and even so, paying does not even leverage to make anything with it!

Also you must know that the inflection (conjugation) as Bruce says, is complicated in Spanish and other highly inflected languages not English, also there is a whole bunch of parasynthetic voices used in Spanish most of them are spontaneous and are not in any dictionary. I faced this problema lso during my chat-platform developing and decided to afford the cost of a development for doing jus this: correct spelling with a inflected huge dictionary, being capable of even recognize possibly parasinthetic words, like: “superbuenísimo” “reboludón” “infrahumanos” etc.

This development , I confess was a complicated task, and deserved a PhD Thesis work, started on 2010 which I’ve concluded last December supporting my Engineering University-degree.

The result? a system many times (10-600x) better (faster & more precise) than the state-of-the-art spellcheckers + market morphologic analyzers, which includes ispell, aspell, microsoft word and OpenOffice (which uses a variant of aspell) with the particularity that is whistand and recognizes such parasynthetic words, like the mentioned above. The development is yet not open sourced, but the thesis is freely published in my website, and may be here in the future, if you want this. The main problem is the corpus and the property of the dictionary.

Also I am aware that I want free stuff and do not release my 5 yeard hard work for free, but somwhere in this circle, I must earn my living and freeing the work is not the way for me, because most of the lingüistic data (specially the most valuable one, like Spanish) are not free and as I can see, if you give it for free, you never get something in return. Even opensourced packages are continuosly compiled, packaged and cross-licensing sold by many companies and there is no punishment possible because of the cost th affford the opennes from internet, the domains and those companies are migrating from country to country all the time, and the jusisdiction for claiming a license infringement is hard to nail!. So you get f#cked if you give it away, so cash it whenever you can.

My system is available as REST service in a future, tied to my Chat Dialoguing platform for Spanish.

best!


 

 

 
  [ # 3 ]

@Bruce

There is actually a translation for Wordnet in spanish, under GPLv3: http://grial.uab.es/fproj.php?id=11&idioma=in . The download page seems to be down, but we contacted them and they sent us the data.

Assuming I can do the verb conjugation before passing the data to ChatScript, how bad of an idea would be to just try to replace the words in the DICT/ENGLISH files with the spanish translation? (using both wordnet and an automated way for the concepts). It doesn’t sound very… orthodox, but, as you have said, rewriting the system for the spanish particularities is something I don’t have neither the resources nor the knowledge to do.

@Andres

I’ve seen your work, but at this point we are not interesting in changing any module of our system, since we are aiming for a release before July. We also make a point on trying to focus on FOSS, as well as releasing as much as our work as possible with open licenses. Also, keep in mind the system is already working, and using Spanish for the dict it’s not strictly needed, but it will mostly help maintaining and extending the functionalities in the future. I will take a look at your thesis though.

Thanks for your answers!

 

 
  [ # 4 ]

You should probably email me the wordnet file, so I can confirm it is format compatible.
I’m not sure what it means to do the verb conjugation before passing the data to CS.  To do verb conjugation, you’d have had to perform sentence splitting and pos tagging. Then presumably you’d have to pass 2 parallel sentences, the original and the root sentence. Not just verb root but also noun at a minimum (plural into singular) and possible adverb and adjective (comparatives if they exists)...

 

 
  [ # 5 ]

I’ve sent an email to the UAB group, cc’ing you, since my tutor considered it was better practise to ask them first.

About the verb, I have found a python library that seems to be able to identify spanish verbs and return its infinitive form. So, the idea was to chop the sentence introduced by the user, check each word, and replace any verb with its infinitive form, passing them this sentence to chatscript. (For example, for “podrías explicarme que es un bucle”, I’ll give “poder explicar que es un bucle” to chatscript).

I was hoping this, along with having spanish dictionaries, will be enough for chatscript to be able to them do the PoS and so on. I understand this way, at the very least ,the tagging will lose precision, but we are willing to eat that lost, since we do not expect it to affect user experience. And, actually, trying to evaluate the user experience’s variation in respect to a previous system is a part of the project.

Is this possible, or am I missing something here, and the result will not be what I expect?

Thank you again for your time

 

 
  [ # 6 ]

You are missing something.  Pos tagging is done by a) knowing how to conjugate words back to the base form, from which one checks the dictionary to confirm it was correct.  This gives you the form or forms that a word might be. Eg.
“The dog walks” has dog which can be singular noun or verb in various tenses. And walks, which can be plural noun or verb in present 3rd person.  For english CS has a knowledge of auxilliary verbs (has be etc) which are not appropriately marked in wordnet, which lists verbs but not auxilliary verbs. And wordnet does not contain prepositions or conjunctions or pronouns or determiners, so those are also supplemental data I add. The result is that 50% of words have ambiguous pos values at this point.

In CS, rules then winnow legal choices based on proximate information, like “the dog”  will remove the verb meaning of dog because it follows a determiner.  Are the rules of spanish the same?  If so, that will get you to about 33% ambiguous pos tagging. After the winnow, actual grammar rules are applied to try to parse the sentence to further winnow. And anything that cannot be resolved otherwise is resolved by probability of pos tag based on english corpus. Currently this results in a 94% accuracy rate.

But plural nouns in spanish will not be recognized, and currently there is no supplemental data for pronouns, etc. And I don’t know that the winnowing rules apply equally to spanish. And I assume the grammar does not also.  So I expect you will not have great results.  Probably you should not allow spell checking turned on.

 

 
  [ # 7 ]

I see…

So, what are my options then? Other than just try and see if the results are acceptable. I certainly expect them to be better than just using the English tagger, but I cannot afford the rest of the work right now, and I do not how to proceed then.

(I do have spell checking disabled, by the way)

By the way, Ana from the GRIAL group has pointed me to http://grial.uab.es/descarregues.php?idioma=in for the Wordnet download.

 

 
  [ # 8 ]

I have examined the spanish wordnet files. They do not break off the data into files for nouns, verbs, adjectives, and adverbs as wordnet princeton does, and the data is in XML, which is not the format of the data from princeton. My tools will not work without major revision. Not going to happen.

 

 
  [ # 9 ]

Ok, I’ll have to try to take a different approach.

Thanks for your time and your explanations.

PD: Sorry for the late response, I’ve been busy with exams and this slipped my mind.

 

 
  [ # 10 ]

Hi Bruce, congrats for the prize, Hi Andres.

Hi Alberto Mardomingo, I was pursuing the same task, and I get stuck with the same Spanish wordnet issue, and Andres Hohendahl also propose to send me a manual for its Spanish Chatbot (I send him an email, with no answer). Is bad to kow that Bruce’s tools will not work with current wordnet dictionary. I propose you something, if you are trying to somehow create a spanish dictionary, why not we split the task, or share knowledge about common issues that may appear when trying to implement a Spanish ChatScript Chatbot, so it can be achieve in less time.

 

 
  [ # 11 ]

Hi, I did not recieve your email, check my mail address, my name dot my family name at gmail com, see ya

 

 
  login or register to react