Part of Human Scale Natural Language Processing.
We’re going to do a fairly simple part-of-speech tagging task. So I’m going to explain parts of speech to you. Here’s how the story goes: Every word in a sentence belongs to a “part of speech”; “part of speech” refers to the role that the word plays in the sentence. Importantly, a word’s part of speech is not a part of its dictionary definition, and a word can take on different parts of speech in different sentences. For example, the word love in the following sentence
I love cheese.
… is a verb, while in the following sentence
Love is a battlefield.
… love is a noun. The same word, in fact, can have fill the role of two (or more) different parts of speech in the same sentence, e.g.
I love cheese, but what's more, I love love.
(In I love love, the first love is the verb, and the second love is a noun.)
Parts of speech are not linguistic universals (in fact, they’re not even universals of particular theories concerning any one language). Different languages have different parts of speech, and different conventions for assigning words to parts of speech. For now, we’re going to focus on English.
I won’t continue without pointing out that units like “sentence” and “word” are an invented technology of taxonomy whose purpose is to cheat the necessity, for yet another day, of glimpsing with vertigo into the neverending dark vortex of linguistic expression’s infinite expressiveness. As Al Filreis recalls:
“Oh yes, the sentence,” Creeley once told the critic Burton Hatlen, “that’s what we call it when we put someone in jail.”
And as a consequence, there are very few situations, if any at all, in which the boundaries of “the sentence” are not ambiguous. So of course, the “part of speech” of a “word” must also be ambiguous, and making the determination of a word’s role in the sentence is as much an artistic and political act as anything else. This is especially true when working with corpora (like ours) that contain poetry or other work whose purpose is precisely to break linguistic convention.
HOWEVER, we can also agree that identifying parts of speech is useful for certain ends (in our case, analysis and creative re-composition). So here are some general tips, which are all premised on lies, but hopefully useful lies.
Every sentence has a main verb, which characterizes the central action that the sentence describes. In the following sentence
Yesterday, the cat scratched my left arm.
… the main verb is scratched. What happened? Well, someone did some scratching. However, sentences may also have more than one clause, e.g.:
Yesterday, the cat scratched my left arm, and you laughed.
In this case, there are two verbs: scratched and laughed. (Clauses joined with and (and but or or) are called coordinate clauses, meaning that you could easily break the entire sentence apart into two sentences by removing the conjunction.)
When you’re looking for verbs in a sentence, first you need to find the clauses (and, frustratingly, vice-versa). Sometimes it’s a bit tricky, as in
Yesterday, the cat that you found on the street scratched my left arm.
In this case, there’s a verb found in what’s called a relative clause (“that you found on the street” in this case). A relative clause can’t stand on its own; it qualifies a noun, like an adjective with delusions of grandeur. But it still has a verb.
A verb also has a subject, meaning the thing that is performing the action. The subject is a noun somewhere else in the sentence, often just before it but not necessarily. In the sentence above, the cat is the subject of scratched; you is the subject of found.
Some verbs have objects as well, meaning the thing that the action was done to. Often this directly follows the noun, but again not always. In the sentence above, the object of scratched is my left arm; the object of found is less clear, because of how relative clauses are structured. If you paraphrase the task to yourself: “what did you find?” you’ll often end up with the object: in this case, the cat.
Some verbs do not have objects, like laughed in the sentence You laughed. If you ask yourself “what did you laugh?” an answer is not apparent, and the question itself is semantically and syntactically odd. You don’t laugh something, you just laugh. It’s true that you might laugh at something, or laugh with someone, but those are not “objects” in the purest sense, but other things that happen to be involved in the act of laughing (which is why prepositions like with and at occur before those nouns).
Verbs that have objects are called transitive; verbs without direct objects are called intransitive. Notably, whether a verb is transitive or not is not part of its dictionary definition, but instead a property of how the verb is used in a sentence. For example, the verb escape can be intransitive:
Don't lift the lid, the steam will escape!
… or transitive:
We escaped the danger unharmed.
English verbs have a number of forms that we have to take into account both for analysis and for gathering tagged words for text generation. Let’s look at a few of them.
This is the form of the verb that we often find after forms of the verb to be, ending in -ing, to indicate an action that is currently happening, e.g.
The cat is scratching my left arm!
If you see a verb in this context, it’s almost certainly the present participle. Note that English also has something called a gerund, which is a way of transforming a verb into a noun by appending -ing:
Scratching must be pretty fun.
In this case, it’s true that scratching is a form of the verb scratch, but it is not itself a verb. You won’t get dinged if you happen to tag this as a present participle, though.
This is the form of the verb that often happens along with forms of the verb to have, to indicate an action having taken place in the past. Often this ends with -ed, as in
This cat has scratched my left arm many, many times.
However, many past participles are irregular, such as eat’s past participle eaten:
Has the cat eaten anything today?
The simple past is used to indicate an action that happened in the past, without the contrastive aspect of completion that comes with to have plus the past participle:
The cat *chased* me around the apartment for hours.
Often, the simple past and the past participle are identical, but some irregular verbs have different forms. We’ve seen eaten as the past participle of eat; its simple past is ate:
The cat ate six scoops of kibble!
Because they share a form so frequently, it can be difficult to distinguish the simple past from the past participle. One way to make the distinction is this: if the subject of the sentence did it, then it’s the simple past; if they have done it, then it’s the past participle.
Verbs in English are usually completely uninflected, e.g., they have no affix or change of form that marks them special. This is the case for present tense verbs other than those in the third person, such as:
I eat whenever I feel like it.
Uninflected forms also follow the English negation construction (did not, does not) and modal constructions with want, can, should, etc., along with any other use of the infinitive. For example:
The cat did not scratch me just for fun.
I should eat something.
The cat wants to eat, but needs to scratch.
English has what’s called “subject-verb agreement” inflections on its verbs. This means that the form of the verb changes depending on the subject of the verb, and in particular, the person of the verb. Aside from to be, which is its own story, the form of the verb changes to agree with the person of the subject only for present tense verbs in the third person singular. That means when the subject of the verb is he, she, or it, or a singular noun, you have to add a little thingy to the end of the verb, usually -s or -es. For example:
Whenever I eat fish, the cat scratches my arm.
The cat eats ravenously after scratching my arm.
There are several irregular verbs where the “add -s” rule does not apply, such as to have (have is the uninflected form, while has is the third person singular present tense form):
Even though I have a very friendly demeanor, the cat has bloodlust.
Nouns are the entities in a sentence that are asserted to have performed actions, and to have had actions performed on them. The common phrasing here is that nouns are “people, places and things,” and that is true, as in the following sentence:
The cat scratched the professor in her apartment.
The cat, and the professor, and the apartment are indeed people, places, and things. But not all nouns need to meet that semantic criteria; they only need to fill a noun-like role in the sentence, which might also be filled with an abstraction:
Pain is the natural result of a cat's scratch.
Here, pain and result are nouns, though they are neither, strictly speaking, people nor places nor things. Again, what you’re looking for is not necessarily words that fit semantic criteria, but that are used in particular ways in the sentence. (For nouns, this means things that act as subjects, objects, indirect objects, objects of prepositions, etc.)
Nouns in English have number: they are obligatorily marked as either singular or plural with some kind of inflection. Usually the inflection is the affix -s or -es:
One cat makes a mess. Two cats make two messes.
But English also has many irregular plural forms:
One goose pecks one child. Two geese peck two children.
For our purposes in generating text, the distinction between singular and plural nouns is important, as singular nouns as subjects in present tense sentences require the third person singular present verb inflection, while plural nouns do not.
Nouns often belong to noun phrases, which is the noun plus everything that goes along with it, such as any determiners (e.g. the, a, this, those), possessive pronouns (e.g. my, our, their), adjectives, and relative clauses or prepositional clauses that modify the noun. A common property of human language is that linguistic structures can be nested, and English noun phrases are no exception: a noun phrase may have many other noun phrases inside it. For example:
That orange cat in the laundry basket that showed up after the storm meowed.
The subject of this sentence (that orange cat…) is noun phrase that contains within it another noun phrase (the laundry basket that showed up during the storm) which itself has another noun phrase inside it (the storm). One way of identifying a noun phrase is that it can be replaced with a pronoun or other circumlocution. For example:
"What happened to that orange cat in the laundry basket that showed up
after the storm?" "He meowed."
"What's so special about the laundry basket?" "That orange cat in it that
showed up after the storm meowed."
"Wow that storm sure was something!" "Yeah, that orange cat in the laundry
basket that showed up after it meowed."
Okay the second and third examples there are kind of a stretch, but hopefully you get the idea. (I’m not asking you to tag noun phrases, but it’s helpful to know that a single noun phrase may contain many nouns, and each of those still count as nouns.)
English has a part of speech called the preposition. These words are used to assert the fashion in which the following noun phrase is associated with the rest of the sentence in terms of its location, where “location” here can be understood in all of its possible metaphorical applications (spatial, temporal, conceptual, etc.). Prepositions are a mostly “closed class” of words, meaning that (unlike, say, nouns, verbs, and adjectives), it’s both possible to list almost definitively all of the words belonging to the class, and there are no reliable ways to “convert” a word belonging to a different part of speech to a prepositional form (in the same way that, say, -ish can change a noun into an adjective, or -tion can turn a verb into a noun). Common prepositions include in, to, by, before, after, etc.
The sentence constituent that includes both the preposition and the phrase that follows it is called a prepositional phrase. For example, in the sentence below,
I found the damn cat under the couch.
… under is the preposition, and under the couch is the complete prepositional phrase. Prepositional phrases can modify either the action of a clause, e.g.,
We rushed to the cat food store.
… where to the cat food store describes the location and/or destination of the act of rushing; or they can modify a noun, as in:
The cat under the mat is less hungry than the cat currently scratching me.
… where under the mat is describing the location of the cat.
Adjectives are words describing qualities. They are either asserted of a noun using the verb to be, or occur before the noun, and in that capacity serve to distinguish that noun as having that particular quality. The former is called a predicative use, while the latter is attributive. As an example of a predicative adjective:
I am angry; that cat is angry; everyone is angry.
Here, the adjective angry is a quality that is being asserted of I, that cat, and everyone using various forms of the verb to be. By contrast, attributive adjective use looks like this:
The calm cat loves skritches, while the angry cat prefers to scratch.
Here we have two cats, which are distinguished from one another based on their qualities (calm and angry). In this instance, the adjective comes before the noun (though after any determiners or possessive pronouns). You can have multiple attributive adjectives before a noun:
a hungry impatient regal orange cat
English, like many Germanic languages, has many compound nouns: nouns that are formed by the immediate juxtaposition of two or more other nouns. Some examples include: keyboard, basketball, leap year, video game, etc. Sometimes in English, the two component nouns in the compound are separated by a space; sometimes they are not. Sometimes (as with video game vs. videogame), the question of whether a space is needed is a hotly contested cultural issue. Some compound nouns have a space between words in part of the compound, but not in the other, such as basketball net. In any case, when the space is present, it can be difficult to distinguish a two-word compound noun from a noun with an attributive adjective. Is leap in leap year an adjective, or part of the noun compound?
One way to test this is to turn the word in question into a predicative adjective instead and feel out how it sounds. For example:
a calm cat → *that cat is calm* (sounds fine)
a basketball → *that ball is basket* (sounds weird?)
a basketball net → *that net is basketball* (also sounds weird??)
This usually works, but this test can be ambiguous when the first part of the compound is itself an adjective, e.g.:
a freshman → *that man is fresh* (sounds like a joke/pun)
the public domain → *that domain is public* (weird but comprehensible)
some dental floss → *that floss is dental* (sure, maybe)
… so it’s not bulletproof. Proceed with care but also don’t get too caught up in getting things right or wrong. It’ll all work out okay in the end.
As adjectives modify nouns, adverbs modify other parts of speech, including verbs, prepositions, and adjectives themselves. The following sentence has a number of examples:
The somewhat hungry cat barely in the laundry basket will eat tomorrow.
Here somewhat is an adverb modifying hungry, barely modifies in, and tomorrow modifies the entire sentence.
Adverbs are most often identified as “words that end in -ly” and this is in fact a productive way to turn an adjective into an adverb. For example, calm becomes calmly, bare becomes barely, and so forth. However, not all words that end in ly are adverbs (words like cowardly or lively for example are firmly adjectives). And not all adverbs end with -ly, including many very frequent adverbs like still, yet, also, then, here, there, likewise, always, once, never, often, etc.