AI Zone Admin Forum Add your forum
Improving wordnet?
 
 

I’ve been doing some bulk tests on wordnet, comparing it against various other (much smaller) sources I have found over the years, with some interesting results that I would your opinion about.
-first: the letter ‘z’ appears to be missing in my wordnet db. Under ‘roman alphabet’, relationship ‘member holonym’, you have all the letters, except ‘z’, which I think is an error. Is this also in your copy? and if not, which version do you have and from where?
-secondly: the first thing I tested where ‘antonym’ relationships. Turns out that there are quite a few missing values. Are these also missing in your copy? And, I guess this is for the linguistic purists over here,... what would you do with these values, do you consider them correct (so, they can be added to the data), or something else?

missing relationships:
all - nothing
before - after
create - destroy
happy - sad
in - out
loud - quiet
neat - messy
over - under
real - fake
right-side-up - upside-down
tough - easy
ancient - modern
every - none
dumb - smart
crazy - sane
behind - ahead
arrive - depart
any - none

 

 
  [ # 1 ]

I’m confused about where to look for this missing ‘z’. I’m looking at the text files of WordNet included with NLTK. Searching index.noun, the letter ‘z’ is present and accounted for, between ‘yves_tanguy’ and ‘z-axis’. wink If I look up roman_alphabet (indexed as 06497872) and then check that index in data.noun, I find an entry as follows:

06497872 10 n 02 Roman_alphabet 0 Latin_alphabet 0 028 06497459 n 0000 06825863 n 0000 %m 06831177 n 0000 %m 06831284 n 0000 %m 06831391 n 0000 %m 06831498 n 0000 %m 06831605 n 0000 %m 06831712 n 0000 %m 06831819 n 0000 %m 06831926 n 0000 %m 06832033 n 0000 %m 06832140 n 0000 %m 06832248 n 0000 %m 06832356 n 0000 %m 06832464 n 0000 %m 06832572 n 0000 %m 06832680 n 0000 %m 06832788 n 0000 %m 06832896 n 0000 %m 06833004 n 0000 %m 06833112 n 0000 %m 06833220 n 0000 %m 06833328 n 0000 %m 06833436 n 0000 %m 06833544 n 0000 %m 06833663 n 0000 %m 06833776 n 0000 %m 06833890 n 0000 the alphabet evolved by the ancient Romans which serves for writing most of the languages of western Europe 

The index 06833890 corresponds to ‘z’ and 06831177 corresponds to ‘a’, so I’m guessing ‘z’ is included in my version??

Now on to antonyms. Some antonyms are only defined at the level of lemmas. Here’s an example:

>>> from nltk.corpus import wordnet as wn
>>> alls wn.synsets('all')
>>> 
alls[0].antonyms()
Traceback (most recent call last):
  
File "<stdin>"line 1in <module>
AttributeError'Synset' object has no attribute 'antonyms'
>>> alls[0].lemmas[0].antonyms()
[Lemma('no.a.01.no'), Lemma('some.a.01.some')

As you can see, directly trying to find the antonym of the synset failed, but succeeded for the lemma. Could this be why you’re not finding those antonyms? Here’s what I found for your examples (including all synsets and all lemmas):

all: no, some, partially
before: (none found); after: (none found)
create: (none found); destroy: (none found)
happy: unhappy
in: (none found); out: safe <—odd one; perhaps by way of “knocked out” or “forbidden”?
loud: soft, piano, softly
neat: (none found); messy: (none found)
over: (none found); under: (none found)
real: unreal, nominal, insubstantial
right-side-up: (none found); upside-down: (none found)
tough: tender
ancient: (none found); modern: old style, nonmodern
every: (none found); none: (none found)
dumb: (none found); smart: stupid
crazy: (none found); sane: insane
behind: (none found); ahead: back, backward
arrive: leave
any: (none found): none: (none found)

The code snippet I wrote to search for these (if anyone’s interested) is as follows:

# For input word 'wd'
syns wn.synsets(wd)
lemlist []
for x in synslemlist.extend(x.lemmas)

antlist []
for x in lemlistantlist.extend(x.antonyms())

antwords [x.name for x in antlist] 

I imagine more antonyms could be found by expanding the base of synonyms. For example, if I include hyponyms in my search for “create”, I find the antonym “disassemble”. However, out of 58 hyponyms (and their 129 lemmas), that’s the only antonym that comes up. So clearly most synsets have not been tied to antonyms.

 

 
  [ # 2 ]

Ha! Oh, dear, I figured out what you meant with your first point. raspberry Actually, I don’t come up with any holonyms for ‘roman alphabet’...

>>> ra wn.synsets('roman_alphabet')
>>> 
ra
[Synset
('roman_alphabet.n.01')]
>>> ra[0].member_holonyms()
[] 

However I do find all the letters when I look under meronyms…

>>> ra[0].member_meronyms()
[Synset('j.n.02'), Synset('s.n.05'), Synset('b.n.06'), Synset('q.n.01'), Synset('i.n.03'), Synset('o.n.02'), Synset('a.n.06'), Synset('m.n.06'), Synset('k.n.06'), Synset('z.n.02'), Synset('e.n.05'), Synset('d.n.03'), Synset('y.n.02'), Synset('h.n.04'), Synset('x.n.02'), Synset('v.n.04'), Synset('g.n.09'), Synset('t.n.04'), Synset('r.n.03'), Synset('p.n.02'), Synset('f.n.04'), Synset('c.n.11'), Synset('n.n.05'), Synset('w.n.04'), Synset('l.n.04'), Synset('u.n.03')]
>>> ra[0].member_meronyms()
>>> 
len(c)
26 
 

 
  [ # 3 ]

yes, sorry about that, I meant ‘meronyms’.
Well, this is annoying. I take it that you have the linux (3.0) version of wordnet? I’m gonna go download the windows ver again, see if I still can’t find the ‘z’ (I am missing this in my ms-sql database, but also the standard query tool that came with the original db).

About the antonyms: yes a few can be found through synonyms but most can’t in my db, which is annoying, but I can make the relationships.

 

 
  [ # 4 ]

Interesting find: I downloaded the linux database and copied the files over the windows version (the original distribution files, also the latest version) and all of a sudden there is lots more data available. So the windows and linux distributions are different, with the linux version containing the correct data-set.
I am probably going to contact the distributors about this.

 

 
  [ # 5 ]

I have the WordNet database bundled with NLTK for Ubuntu 10.4 (Lucid Lynx). I think I downloaded my current version about two years ago, if that helps. smile I can try and look up the version tonight.

 

 
  [ # 6 ]

No need, I downloaded the latest versions from the official wordnet site and there is a clear difference between the 2. Both appear to be released a long time ago (2005 I think). The ubuntu distribution most likely used the linux wordnet version, which appears to be the correct one. I have sent an email to the wordnet maintainers about this.

 

 
  [ # 7 ]

It’s making me laugh but just the other day started using the WordNet 3.0 SQLite database and I like it.

select se1.rank,w2.lemma
from word w1
left join sense se1 on w1.wordid = se1.wordid
left join synset sy1 on se1.synsetid = sy1.synsetid
left join semlinkref on sy1.synsetid = semlinkref.synset1id
left join synset sy2 on semlinkref.synset2id = sy2.synsetid
left join sense se2 on sy2.synsetid = se2.synsetid
left join word w2 on se2.wordid = w2.wordid
where w1.lemma = ‘create’
and sy1.pos = ‘v’
and semlinkref.linkid = 2
order by se1.rank asc;

 

 
  [ # 8 ]

I have just released an updated version of the wordnet database versions. It turned out that all the db’s I know of, are all build on the 2005 dataset while the latest release is from 2007. The sql scripts can be downloaded from from: http://bragisoft.com/download/  (resources section, at the bottom of the page).

I don’t know how the sql-Lite db got to be (perhaps it’s included with linux, which appears to be better updated), but if it is created with the wordnet sql builder tool, it’s probably best to update.

 

 
  [ # 9 ]

@Jan
Jan can u please tell me how could i achieve wsd using wordnet
e.g The cool blue is hot, The grilled bass is off, I deposited some cash on bank etc
Does it achieve using wordnet

 

 
  [ # 10 ]

Muhammad,

A wordnet synset includes all the particular “senses” of the word, each sense is defined via part of speech, synonyms, antonyms, and other word connections. Each word sense also has a unique identifying label. In your first example above, you could narrow down the sense of “blue” by first using a parser to recognize “blue” must be a noun in this context and then reducing the number of senses to only those that are nouns. To further refine your results, various word similarity tools could be used to decide whether neighboring words are more closely related to which senses.

If you have a knowledge base to draw from, you could reduce the number of senses still further by checking if synonyms or hypernyms for each sense of “blue” correspond to facts in your knowledge base. In your second example, substituting “bass” for “fish” and “instrument” should easily produce database matches for only the first!

WSD is an area of active research and there are toolkits available that use wordnet along with various statistical approaches to tackle the problem. I suggest googling around for such packages in your language of choice.

 

 
  [ # 11 ]

CR pretty much answered your question, I think. Wordnet is just a resource, it wont do any word sense disambiguation for you, but you can query it in order to get the job done.

There are various versions of wordnet available. The original data comes in plain textfiles combined with index files. There are libraries available in many different languages that help you access this raw data.
In addition, you can also use an sql based system which allows you to query the data in different ways.  I imported the data into my own system, which gives me more flexibility, faster access (compared to sql at least) and a uniform way to access (read/write) all data sources in exactly the same way.
From there, I guess it all depends on your algorithm design skills. 

 

 
 
  login or register to react