Part of Human Scale Natural Language Processing.
(Notes are a work in progress, sorry thanks!)
Broadly, there are three ways of representing the meaning of words
(“lexical semantics”) computationally:
- Directed graphs (examples: WordNet, Roget’s
Thesaurus). Graph edges characterize the relationship between words
(e.g., synonym, antonym, belonging to a category, is-a, has-a, etc.);
similarity of words can be judged by graph distance. Usually constructed
top-down “by hand,” but there are also hybrid data-driven approaches,
like ConceptNet.
- Scored matrices of semantic features (examples: VADER,
Gilhoolie
and Logie). Researchers identify relevant “features” of words and
then rate words numerically according to those features, usually through
survey (of undergraduate students, lol). Again, these can be augmented
with data-driven techniques; see e.g. the MRC
Psycholinguistic Database.
- Distributional approaches (examples: tfidf, word2vec and
ultimately ML models like Bert).
Semantic features are automatically identified through text analysis,
based on co-occurrence in a corpus. Usually “unsupervised” (e.g., there
is no “ground truth” that the model approximates) and usually requires a
lot of data. My
explanation of distributional word vectors.
You can represent a graph as a matrix, so ultimately all of these
techniques represent the meaning of a word as a sequence of numbers
(i.e., a vector). For this reason, we can in practice use a lot of the
same computational/mathematical tricks for working with data derived
from any of these sources.