Lecture Twelve, 15th January 2008
Computational Lexicography
1. Introduction
2. Learner's Diary
3. Tasks and Quizzes
4. Evaluation
5. References
1. Introduction
This was the last lecture with relevant information for the exam. Topic was the computational lexicography, this describes how the lexicographers handle words.
2. Learner's Diary
Criteria for good lexicography
Quantity
Completeness of coverage:
- extensional coverage: number of entries
- intensional coverage: number of types of lexical information
Quality
Correctness of information:
- Types of lexical information
Consistency of structure:
- Macrostructure
- Microstructure
- Mesostructure
Demonstration at the lexicographic workflow cycle
Lexical data acquisition
From corpus to lexicon
Corpus
Layer 1
Primary data (audio / video recording)
Layer 2
Secondary data (transcription, annotation, metadata)
Lexicon
Layer 1
Corpus lexicon (wordlist, concordance, HMM, ... )
Layer 2
Lexicon matrix (entries x data categories, no generalisations)
Layer 3
Lexicon with selected generalisations (procedurally optimised: semasiological, onomasiological)
Layer 4
Lexicon with generalisation hierarchies (general, type, default, inheritance)
Concordances
What is a concordance?
A KWIC (KeyWord In Context) concordance is a special kind of preliminary, corpus based dictionary, each word in a text corpus is paired with its contexts of occurrence in this corpus
Example:
My first sight of England was on a foggy March night in 1973 when I arrived on the midnight ferry from Calais. (Bill Bryson: Notes from a Small Island)
Alphabetical ordered KWIC
KWIC - Keywords with right-hand contexts
1973 when i arrived
a foggy march night
arrived on the midnight
calais
england was on a
ferry from calais
first sight of england
foggy march night in
from calais
i arrived on the
in 1973 when i
march night in 1973
midnight ferry from calais
my first sight of
night in 1973 when
of england was on
on a foggy march
on the midnight ferry
sight of england was
the midnight ferry from
was on a foggy
when i arrived on
KWIC concordance construction
The KWIC procedure
1. Corpus creation: make a corpus of texts in electronic format
2. Tokenisation (re-process each text):
- process punctuation marks
- break the text into context units (lines/sentences)
3. Keyword list extraction (all words in text)
4. Context collation (for each keyword)
5. Search for KWIC in corpus
6. Store output and format
- for printing, hypertext (CD, web)
3. Tasks and Quizzes
What is a KWIC concordance?
KWIC is a computer based dictionary. Every word is integrate in the context of the text. A popular kind of KWIC concordance is Google.
Which are the two main components of lexicon construction based on empirical data?
Corpus creation and lexicon creation are the main components
Which layers of abstraction are involved in corpus acquisition?
Layer 1 - Primary data (audio / video recording)
Layer 2 - Secondary data (transcription, annotation, metadata)
Which layers of abstraction are involved in lexicon construction? Describe them.
Layer 1 - Corpus lexicon (wordlist, concordance, HMM)
Layer 2 - Lexicon matrix (entries x data categories, no generalisations)
Layer 3 - Lexicon with selected generalisations (procedurally optimised: semasiological, onomasiological)
Layer 4 - Lexicon with generalisation hierarchies (general, type, default inheritance)
Which layer do standard dictionary types typically belong to?
Layer 4 - Lexicon with generalisation hierarchies (general, type, default inheritance)
What are the 6 main steps in KWIC concordance construction?
1. Corpus creation: make a corpus of texts in electronic format
2. Tokenisation (re-process each text):
- process punctuation marks
- break the text into context units (lines/sentences)
3. Keyword list extraction (all words in text)
4. Context collation (for each keyword)
5. Search for KWIC in corpus
6. Store output and format
- for printing, hypertext (CD, web)
Describe the 6 stages of KWIC concordance construction
1. Corpus creation/collation – put the text in an electronically format.
2. Tokenisation – remove all capital letters and all punctuation marks.
3. Keyword list extraction – put the words into a list and sort them alphabetically, remove duplicate words.
4. Context collation – to integrate the words in a context put three words on the left and three on the right. Split into units of length m+1+n.
5. Keyword search – find correlation for words which occur more than one time.
6. Output formatting – make the design user-friendly and bring together all information.
What can a KWIC concordance be used for?
It can be used for grammatical descriptions or dictionaries.
4. Evaluation
The topic was interesting but it was not easy to understand the KWIC programme. A little bit confusing from time to time.
5. References
http://wwwhomes.uni-bielefeld.de/~gibbon/Classes/Classes2007WS/ITL/index.html
Computational Lexicography
1. Introduction
2. Learner's Diary
3. Tasks and Quizzes
4. Evaluation
5. References
1. Introduction
This was the last lecture with relevant information for the exam. Topic was the computational lexicography, this describes how the lexicographers handle words.
2. Learner's Diary
Criteria for good lexicography
Quantity
Completeness of coverage:
- extensional coverage: number of entries
- intensional coverage: number of types of lexical information
Quality
Correctness of information:
- Types of lexical information
Consistency of structure:
- Macrostructure
- Microstructure
- Mesostructure
Demonstration at the lexicographic workflow cycle
Lexical data acquisition
From corpus to lexicon
Corpus
Layer 1
Primary data (audio / video recording)
Layer 2
Secondary data (transcription, annotation, metadata)
Lexicon
Layer 1
Corpus lexicon (wordlist, concordance, HMM, ... )
Layer 2
Lexicon matrix (entries x data categories, no generalisations)
Layer 3
Lexicon with selected generalisations (procedurally optimised: semasiological, onomasiological)
Layer 4
Lexicon with generalisation hierarchies (general, type, default, inheritance)
Concordances
What is a concordance?
A KWIC (KeyWord In Context) concordance is a special kind of preliminary, corpus based dictionary, each word in a text corpus is paired with its contexts of occurrence in this corpus
Example:
My first sight of England was on a foggy March night in 1973 when I arrived on the midnight ferry from Calais. (Bill Bryson: Notes from a Small Island)
Alphabetical ordered KWIC
KWIC - Keywords with right-hand contexts
1973 when i arrived
a foggy march night
arrived on the midnight
calais
england was on a
ferry from calais
first sight of england
foggy march night in
from calais
i arrived on the
in 1973 when i
march night in 1973
midnight ferry from calais
my first sight of
night in 1973 when
of england was on
on a foggy march
on the midnight ferry
sight of england was
the midnight ferry from
was on a foggy
when i arrived on
KWIC concordance construction
The KWIC procedure
1. Corpus creation: make a corpus of texts in electronic format
2. Tokenisation (re-process each text):
- process punctuation marks
- break the text into context units (lines/sentences)
3. Keyword list extraction (all words in text)
4. Context collation (for each keyword)
5. Search for KWIC in corpus
6. Store output and format
- for printing, hypertext (CD, web)
3. Tasks and Quizzes
What is a KWIC concordance?
KWIC is a computer based dictionary. Every word is integrate in the context of the text. A popular kind of KWIC concordance is Google.
Which are the two main components of lexicon construction based on empirical data?
Corpus creation and lexicon creation are the main components
Which layers of abstraction are involved in corpus acquisition?
Layer 1 - Primary data (audio / video recording)
Layer 2 - Secondary data (transcription, annotation, metadata)
Which layers of abstraction are involved in lexicon construction? Describe them.
Layer 1 - Corpus lexicon (wordlist, concordance, HMM)
Layer 2 - Lexicon matrix (entries x data categories, no generalisations)
Layer 3 - Lexicon with selected generalisations (procedurally optimised: semasiological, onomasiological)
Layer 4 - Lexicon with generalisation hierarchies (general, type, default inheritance)
Which layer do standard dictionary types typically belong to?
Layer 4 - Lexicon with generalisation hierarchies (general, type, default inheritance)
What are the 6 main steps in KWIC concordance construction?
1. Corpus creation: make a corpus of texts in electronic format
2. Tokenisation (re-process each text):
- process punctuation marks
- break the text into context units (lines/sentences)
3. Keyword list extraction (all words in text)
4. Context collation (for each keyword)
5. Search for KWIC in corpus
6. Store output and format
- for printing, hypertext (CD, web)
Describe the 6 stages of KWIC concordance construction
1. Corpus creation/collation – put the text in an electronically format.
2. Tokenisation – remove all capital letters and all punctuation marks.
3. Keyword list extraction – put the words into a list and sort them alphabetically, remove duplicate words.
4. Context collation – to integrate the words in a context put three words on the left and three on the right. Split into units of length m+1+n.
5. Keyword search – find correlation for words which occur more than one time.
6. Output formatting – make the design user-friendly and bring together all information.
What can a KWIC concordance be used for?
It can be used for grammatical descriptions or dictionaries.
4. Evaluation
The topic was interesting but it was not easy to understand the KWIC programme. A little bit confusing from time to time.
5. References
http://wwwhomes.uni-bielefeld.de/~gibbon/Classes/Classes2007WS/ITL/index.html
Keine Kommentare:
Kommentar veröffentlichen