Quick notes on computational models of meaning

Part of Human Scale Natural Language Processing.

(Notes are a work in progress, sorry thanks!)

Broadly, there are three ways of representing the meaning of words (“lexical semantics”) computationally:

Directed graphs (examples: WordNet, Roget’s Thesaurus). Graph edges characterize the relationship between words (e.g., synonym, antonym, belonging to a category, is-a, has-a, etc.); similarity of words can be judged by graph distance. Usually constructed top-down “by hand,” but there are also hybrid data-driven approaches, like ConceptNet.
Scored matrices of semantic features (examples: VADER, Gilhoolie and Logie). Researchers identify relevant “features” of words and then rate words numerically according to those features, usually through survey (of undergraduate students, lol). Again, these can be augmented with data-driven techniques; see e.g. the MRC Psycholinguistic Database.
Distributional approaches (examples: tfidf, word2vec and ultimately ML models like Bert). Semantic features are automatically identified through text analysis, based on co-occurrence in a corpus. Usually “unsupervised” (e.g., there is no “ground truth” that the model approximates) and usually requires a lot of data. My explanation of distributional word vectors.

You can represent a graph as a matrix, so ultimately all of these techniques represent the meaning of a word as a sequence of numbers (i.e., a vector). For this reason, we can in practice use a lot of the same computational/mathematical tricks for working with data derived from any of these sources.