Portfolio: How to make a dictionary

Lecture Twelve, 15th January 2008

Computational Lexicography

1. Introduction
2. Learner's Diary
3. Tasks and Quizzes
4. Evaluation
5. References

1. Introduction

This was the last lecture with relevant information for the exam. Topic was the computational lexicography, this describes how the lexicographers handle words.

2. Learner's Diary

Criteria for good lexicography

Quantity

Completeness of coverage:
- extensional coverage: number of entries
- intensional coverage: number of types of lexical information

Quality

Correctness of information:
- Types of lexical information

Consistency of structure:
- Macrostructure
- Microstructure
- Mesostructure

Demonstration at the lexicographic workflow cycle

Lexical data acquisition

From corpus to lexicon

Corpus

Layer 1
Primary data (audio / video recording)

Layer 2
Secondary data (transcription, annotation, metadata)

Lexicon

Layer 1
Corpus lexicon (wordlist, concordance, HMM, ... )

Layer 2
Lexicon matrix (entries x data categories, no generalisations)

Layer 3
Lexicon with selected generalisations (procedurally optimised: semasiological, onomasiological)

Layer 4
Lexicon with generalisation hierarchies (general, type, default, inheritance)

Concordances

What is a concordance?

A KWIC (KeyWord In Context) concordance is a special kind of preliminary, corpus based dictionary, each word in a text corpus is paired with its contexts of occurrence in this corpus

Example:

My first sight of England was on a foggy March night in 1973 when I arrived on the midnight ferry from Calais. (Bill Bryson: Notes from a Small Island)

Alphabetical ordered KWIC
KWIC - Keywords with right-hand contexts

1973 when i arrived
a foggy march night
arrived on the midnight
calais
england was on a
ferry from calais
first sight of england
foggy march night in
from calais
i arrived on the
in 1973 when i
march night in 1973
midnight ferry from calais
my first sight of
night in 1973 when
of england was on
on a foggy march
on the midnight ferry
sight of england was
the midnight ferry from
was on a foggy
when i arrived on

KWIC concordance construction

The KWIC procedure

1. Corpus creation: make a corpus of texts in electronic format

2. Tokenisation (re-process each text):
- process punctuation marks
- break the text into context units (lines/sentences)

3. Keyword list extraction (all words in text)

4. Context collation (for each keyword)

5. Search for KWIC in corpus

6. Store output and format
- for printing, hypertext (CD, web)

3. Tasks and Quizzes

What is a KWIC concordance?

KWIC is a computer based dictionary. Every word is integrate in the context of the text. A popular kind of KWIC concordance is Google.

Which are the two main components of lexicon construction based on empirical data?

Corpus creation and lexicon creation are the main components

Which layers of abstraction are involved in corpus acquisition?

Layer 1 - Primary data (audio / video recording)
Layer 2 - Secondary data (transcription, annotation, metadata)

Which layers of abstraction are involved in lexicon construction? Describe them.

Layer 1 - Corpus lexicon (wordlist, concordance, HMM)
Layer 2 - Lexicon matrix (entries x data categories, no generalisations)
Layer 3 - Lexicon with selected generalisations (procedurally optimised: semasiological, onomasiological)
Layer 4 - Lexicon with generalisation hierarchies (general, type, default inheritance)

Which layer do standard dictionary types typically belong to?

Layer 4 - Lexicon with generalisation hierarchies (general, type, default inheritance)

What are the 6 main steps in KWIC concordance construction?

1. Corpus creation: make a corpus of texts in electronic format

2. Tokenisation (re-process each text):
- process punctuation marks
- break the text into context units (lines/sentences)

3. Keyword list extraction (all words in text)

4. Context collation (for each keyword)

5. Search for KWIC in corpus

6. Store output and format
- for printing, hypertext (CD, web)

Describe the 6 stages of KWIC concordance construction

1. Corpus creation/collation – put the text in an electronically format.

2. Tokenisation – remove all capital letters and all punctuation marks.

3. Keyword list extraction – put the words into a list and sort them alphabetically, remove duplicate words.

4. Context collation – to integrate the words in a context put three words on the left and three on the right. Split into units of length m+1+n.

5. Keyword search – find correlation for words which occur more than one time.

6. Output formatting – make the design user-friendly and bring together all information.

What can a KWIC concordance be used for?

It can be used for grammatical descriptions or dictionaries.

4. Evaluation

The topic was interesting but it was not easy to understand the KWIC programme. A little bit confusing from time to time.

5. References

http://wwwhomes.uni-bielefeld.de/~gibbon/Classes/Classes2007WS/ITL/index.html

Portfolio

Blog-Archiv

Über mich

Mittwoch, 16. Januar 2008

How to make a dictionary - Lecture 12

Keine Kommentare: