|
1
|
- Keith Alcock
- TCS DevSIG
- 1 June 2004
|
|
2
|
- Standard WordNet
- Description
- Application
- Challenges
- Optimized WordNet
- Improvements
- Parts
- WordNet Builder
- WordNet C Database
- WordNet Relationship Browser
- Demo
- Developer topics
- Code generation
- Unit testing
- Trees
- Jagged arrays
- Design patterns
- Iterator
- Singleton
- Composite
- MVC
- Observer
- Template method
- Command
- Memory management
- Smalltalk/C Interop
|
|
3
|
- http://www.cogsci.princeton.edu/~wn/
- “WordNet® is an online lexical reference system whose design is
inspired by current psycholinguistic theories of human lexical
memory. English nouns, verbs,
adjectives and adverbs are organized into synonym sets, each
representing one underlying lexical concept. Different relations link the synonym
sets.”
- Fong
- A synonym set (synset) network connected by semantic relations.
|
|
4
|
- Synset example
- cat (lemma or lexeme)
- computerized tomography, computed tomography, CT, computerized axial
tomography, computed axial tomography, CAT (n)
- cat, true cat (n)
- big cat, cat (n)
- Caterpillar, cat (n)
- cat-o’nine-tails, cat (n)
- kat, khat, qat, quat, cat, Arabian tea, African tea (n)
- cat (n)
- guy, cat, hombre, bozo (n)
- vomit, vomit up, purge, cast, sick, cat, be sick, disgorge, regorge,
retch, puke, barf, spew, spue, chuck, upchuck, honk, regurgitate,
throw up (v)
- cat (v)
|
|
5
|
- Relationships
- Antonym, opposite
- Hypernym, ISA base class
- Hyponym, ISA derived class
- Holonym, HASA whole
- Meronym, HASA part
- Entailment, e.g., snoring entails sleeping
- 25 in all including attribute, cause, derivation, participle,
pertainym, etc.
|
|
6
|
|
|
7
|
|
|
8
|
|
|
9
|
- WordNet
- bin
- dict
- doc
- include
- lib
- man
- src
- dict
- lexnames
- cntlist.rev
- cntlist
- sentidx.vrb
- sents.vrb
- frames.vrb
- adj.exc
- adv.exc
- noun.exc
- verb.exc
- noun.dat
- verb.dat
- adj.dat
- adv.dat
- sense.idx
- adj.idx
- adv.idx
- noun.idx
- verb.idx
|
|
10
|
- idx files
- code n 3 4 @ ~ + ; 3 2 06256978 05959940 05961454
- dat files
- 05961454 10 n 02 code 2 computer_code 0 013 @ 05959763 n 0000 ;c
05762229 n 0000 + 00961643 v 0102 ~ 05961859 n 0000 ~ 05962069 n 0000 ~
05962617 n 0000 ~ 05962737 n 0000 ~ 05963127 n 0000 ~ 05963298 n 0000 ~
05963472 n 0000 ~ 05963624 n 0000 ~ 06162514 n 0000 ~ 06178338 n 0000 |
(computer science) the symbolic arrangement of data or instructions in
a computer program or the set of such instructions
|
|
11
|
- Word sense disambiguation
- Discourse structure
- Text summarization and annotation
- Information extraction and retrieval
- Automatic indexing
- Automatic correction of word errors in text
- Semantic bleaching
- Determination of opposition
|
|
12
|
- Semantic distance
- Hirst-St. Onge
- Leacock-Chodorow
|
|
13
|
|
|
14
|
|
|
15
|
- Advantages
- is relatively platform neutral and therefore portable
- can be searched with relatively simple tools like grep or from within
various text editors or word processors
- does not require that the entire database resides in memory
- is at least minimally human readable
|
|
16
|
- Disadvantages
- is only impractically editable because of the offset fields
- allows programmers to bypass the logical structure
- requires substantial and slow disk I/O to access data
- scatters information across multiple files
- involves substantial indexing at runtime
- calls memory management functions
|
|
17
|
|
|
18
|
- Updating pointers and indexes is easier to do programmatically than by
hand.
- Bypassing the API and accessing data directly is more difficult with
this arrangement.
- No data resides in files other than the executable (exe or dll), which
itself loads almost instantly.
- Both data and functions are contained in a single file and cannot be
separated.
- Indexing is performed during conversion to C code, not at database
startup time.
- The database requires no dynamically allocated memory for itself. All data is available all the time.
|
|
19
|
|
|
20
|
- WordNet Builder
- Combines all files
- Adds lemmas, words, and glosses to trie
- Converts to indexed jagged arrays
- Alphabetizes lemmas and marks part of speech
- Calculates longest word, longest gloss, fanout
|
|
21
|
- WnCDb
- Provides API access to data
- Summary data
- Lemmas
- Synset words
- Synset children
- Calculates distances
- Searches for paths
|
|
22
|
- WordNet Relationship Browser
- Allows adjustment and disabling of relationships
- Shows parts of speech, lemmas, and synset index
- Filters based on any of these in any order
- Displays synset index, part of speech, words, and gloss
- Allows selection of source and target synsets
- Shows path, tree, graph, and histogram
- Provides navigation support for these
|
|
23
|
|
|
24
|
- WordNet Builder is a code munger
- Input is not exactly code but rather data
- Purpose is different than most generators
- Optimizes speed
- Optimizes storage
- Enforces API
- Somewhat related to generation of data tier
- Give and go process
- Smalltalk drives down court and hands data to C compiler
- C compiler gives back optimized access through DLL
- Smalltalk makes the basket
|
|
25
|
- Automated
- Manual
- Run Smalltalk debugger inside C debugger
|
|
26
|
|
|
27
|
|
|
28
|
|
|
29
|
- Iterator
- Shortest path is array of synset index, relationship number so can
continue from where left off
- Singleton
- WordNetLibrary instance is WordNetLibrary default
- Composite
- WordNetOutputFile
- WordNetMultipleOutputFile
- WordNetSingleOutputFile
- WordNetFilter
- WordNetMultipleFilter
- WordNetSingleFilter
- Model View Controller (MVC)
- Dolphin actually uses Model View Presenter (MVP)
- Observer
- Presenter subscribes to events of model and view
- Template Method
- E.g., WordNetBase>>new,
WordNetInputFile>>read
- Command
- E.g., Presenter>>queryCommand
|
|
30
|
- Precalculate maximum size and allocate space statically
- char gloss[MAX_GLOSS_LENGTH+1];
- char *getWnCDbSynsetGloss(int synsetIndex);
- Provide access to precalculated value and have caller allocate space
- int getWnCDbSynsetCount();
- Callback multiple times
- int enumWnCDbSynsetWords(
SYNSET_WORD_COLLECTOR wordCollector,
int synsetIndex);
|
|
31
|
- WordNetLibrary>>enumSynsetWords: aBlock atSynsetIndex: anInteger
- | callback returnValue |
- callback := ExternalCallback block: aBlock descriptor:
(ExternalDescriptor
fromString: 'cdecl: sdword sdword lpstr sdword').
- returnValue := self basicEnumSynsetWords: callback asParameter
atSynsetIndex: anInteger.
- callback notNil ifTrue: [ callback free ].
- ^returnValue.
- WordNetLibrary>>basicEnumSynsetWords: aCallback atSynsetIndex:
anInteger
- <cdecl: sdword enumWnCDbSynsetWords lpvoid sdword>
- ^self invalidCall.
- WordNetSynsetData>>read
- | stream |
- stream := WriteStream on: String
new.
- WordNetLibrary default enumSynsetWords: [ :tmpPos :tmpWord :tmpLexId |
- pos := tmpPos.
- stream…
- ] atSynsetIndex: index.
- words := stream contents.
|
|
32
|
- Non-numeric computing can involve lots of numbers
- One solution does not fit all
- The compiler is your friend
- Code generation is powerful and fun
- Design patterns are everywhere
- Applications don’t need to be monolithic
- Language tech & linguistics are fun
|