Human Scale Natural Language Processing

Essential information

SFPC, online, summer 2024. Instructor: Allison Parrish. Send me e-mail.

Important links: Google Drive folders for section 1 and section 2.

Description

Natural Language Processing (NLP) is a subfield of AI that drives pervasive technologies like spell check, search, bots and content moderation. This multi-billion dollar, energy-intensive industry increasingly dictates the shape of everyday language while perpetuating harmful biases. We will practice “human-scale” natural language processing by forgoing pre-existing datasets and models in favor of communally-written texts. This includes exercises in which participants invent new textual categories and hand-tag each other’s writing. Participants will learn the basics of text processing, analysis and generation in Python, including parsing, regular expressions, Markov chains and vector similarity.

Many exercises will be performed using analog media as well (for example, cut-ups, free-writes, etc.). In addition, we will prioritize technical approaches that function well on low-end hardware rather than carbon-intensive computation.

Course objectives

Understand the nature and value of the labor that underpins computer-generated text, through a hands-on approach.
Understand criticisms of natural language processing, including large language models.
Experience the act of building corpora and datasets as a means of making community (rather than exploiting it).
Experience computation as a tool for experimental writing.

Ethos and methodology

This is a hands-on class, meaning that you will be writing code. Novice programmers will find plenty of code to re-use and re-assemble in the example notebooks. Programmers with more experience are encouraged to experiment with and build upon the material presented in class. The quality of student work is correlated most closely with curiosity and creative concepts, not with technical proficiency.

We are using the Python programming language. Python is widely used in many areas of computational practice, from academia to entrepreneurship; it runs on both supercomputer clusters and microcontrollers. Python is free and open source and has a vibrant community of contributors and enthusiasts. I think Python is a versatile and powerful language that is nonetheless friendly for beginners. If you’re interested in supplementing your Python instruction beyond the content of this class, I’ve linked to a number of resources below.

I strongly discourage the use of LLM-based programming assistants. The purpose of writing a computer program is to produce an unambiguous statement of your intent; you can’t do this unless you understand what you’re writing. Furthermore, evidence suggests that the use of LLM programming assistants is detrimental to both students of programming and software engineers. Refer, e.g., to Vaithilingam et al., whose study shows that LLM-based code generation tools do not “improve the task completion time or success rate,” but do lead to “difficulties in understanding, editing, and debugging” that “significantly hinder” programmers’ “task-solving effectiveness.” Prefer bringing your question to me, a co-teacher, or a fellow student before bringing it to ChatGPT.

We’ll be using Google’s collaboration tools in class (e.g., Google Drive, Google Docs, etc.). As such, you’ll need a Google account. Let me know if this is a problem for you, and we’ll work something out.

Finally, I will be conducting this class in the English language, and many of the code examples will refer to grammatical and linguistic properties of English specifically. Likewise, student contributions to the collective corpus should be in English. This is unfortunate but necessary for both collaboration and for the deep dive into linguistic structure that we’ll be doing in the class. On the final day of class, we’ll have a discussion about the wisdom of an “English-only” rule, and what in the class would have to change for it to be compatible with languages other than English, or many languages at once.

Schedule

Session 1: Text as material

Section 1: 2024-06-13; section 2: 2024-06-15.

Introduction
Lecture: Language models can only write ransom notes
Syllabus and schedule
Class introductions
Provocation: Representing text digitally
Exercise: Collaborative digital Dadaist writing (see How to make a Dadaist poem)
Python:
- Installing Anaconda
- How to use Jupyter Notebook
- Download and play around with Implementations of early and well-known poetry generators
Homework assigned: Produce 1000 to 2000 words for our collective corpus.
- Decisions about our corpus: Only self-written texts, or also text from the public domain? (No copyrighted text or “fair use”)
- How do we manage “activating” content that might require content warnings?
- What rights and responsibilities do we have in relation to the corpus and works created with the corpus after the class is over? How do we credit authorship? What kinds of use are appropriate?

Session 2: Text and procedure

Section 1: 2024-06-20; section 2: 2024-06-22.

Share your corpora contributions
Python:
- Just enough Python
- The poetics of searching, sorting, and permuting (the copy of frankenstein is here)
- Supplemental material: Counting things, The poetics of grouping
Homework assigned: In a Jupyter Notebook, Invent and implement a procedure that transforms the corpus (or a subset of the corpus) in some way. Be ready to show your notebook during the next session.

Session 3: Language models

Section 1: 2024-06-27; section 2: 2024-06-29.

Share homework
Lecture: Introduction to language models (see the Markov models section in this notebook, though we’re not going to use the code from that notebook)
Python: Markov chain text generation with Shoestrings
Quick lecture: Introduction to English syntax
Quick lecture: Introduction to Label Studio
Homework assigned in two parts:
- Play with Markov chain text generation and the corpus
- Tag parts of speech in at least 100 sentences in Label Studio

Session 4: Syntax

Section 1: 2024-07-11; section 2: 2024-07-13.

Share homework
Python:
- Fetching parts of speech from Label Studio
- Introduction to Tracery and grammar-based text generation
Quick lecture: Dimensions of semantic similarity
Homework assigned in two parts:
- Use our parts of speech database and Tracery to compose something.
- Assign scores to at least 250 words in Label Studio.

Session 5: Semantics

Section 1: 2024-07-18; section 2: 2024-07-20.

Share homework
Review: Dimensions of semantic similarity
Python:
- Vector approaches to semantics
Discussion: Why English?
Homework assigned: Prepare for showcase

Resources for learning Python

We’re going to be thorough with the basics, but we’re also going to move fast. Fortunately, there are many resources out there for learning Python. You might benefit from going through some of them. I recommend:

Reading list

Due to time constraints, this class does not incorporate required readings or reading discussions. I’ve included below a brief bibliography of papers and articles that are relevant to the class. I’d be happy to discuss the content of any of these papers with you individually, or lead a small extracurricular reading discussion group. (Also pleased to provide PDFs for anything you don’t have access to individually.) Get in touch if you need more recommendations!

Please also check the online syllabi for Reading and writing electronic text, Computational letterforms and layout and Computational approaches to narrative, three related classes that I teach at NYU ITP/IMA.

Papers and articles

Baraka, Amiri. “Technology & Ethos.” Raise, Race, Rays, Raze; Essays since 1965, Random House, 1972, pp. 155–58.

Booten, Kyle, and Lillian-Yvonne Bertram. “Unbreathed Words: A Conversation with Lillian-Yvonne Bertram.” ASAP/Journal, vol. 7, no. 2, 2022, pp. 261–72.

Drucker, Johanna. “Why Distant Reading Isn’t.” PMLA, vol. 132, no. 3, May 2017, pp. 628–35. https://doi.org/10.1632/pmla.2017.132.3.628.

Giles, Harry Josephine. “Some Strategies of Bot Poetics.” Harry Josephine Giles, 6 Apr. 2016, https://harrygiles.org/2016/04/06/some-strategies-of-bot-poetics/.

Golumbia, David. “ChatGPT Should Not Exist.” Medium, 14 Dec. 2022, https://davidgolumbia.medium.com/chatgpt-should-not-exist-aab0867abace.

Hovy, Dirk, and Shannon L. Spruit. “The Social Impact of Natural Language Processing.” Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Association for Computational Linguistics, 2016, pp. 591–98. ACLWeb, https://doi.org/10.18653/v1/P16-2096.

Long, Karawynn. “Language Is a Poor Heuristic for Intelligence.” Nine Lives, 26 June 2023, https://buttondown.email/ninelives/archive/language-is-a-poor-heuristic-for-intelligence/.

McQuillan, Dan. “Predicted Benefits, Proven Harms: How AI’s Algorithmic Violence Emerged from Our Own Social Matrix.” The Sociological Review Magazine, June 2023. https://doi.org/10.51428/tsr.ekpj9730.

Morris, John. “How to Write Poems with a Computer.” Michigan Quarterly Review, vol. 6, no. 1, 1967, pp. 17–20.

Pipkin, Everest. “A Long History of Generated Poetics: Cutups from Dickinson to Melitzah.” Medium, 20 Sept. 2016, https://everestpipkin.medium.com/a-long-history-of-generated-poetics-cutups-from-dickinson-to-melitzah-fce498083233.

Soria, Claudia. “Decolonizing Minority Language Technology.” State of the Internet’s Languages Report, 1 Jan. 2020, https://internetlanguages.org/en/stories/decolonizing-minority-language/.

Trettien, Whitney Anne. Computers, Cut-Ups and Combinatory Volvelles: An Archaeology of Text-Generating Mechanisms. 2009. MIT, http://whitneyannetrettien.com/thesis/.

Whalen, Zach. “The Many Authors of The Several Houses of Brian, Spencer, Liam, Victoria, Brayden, Vincent, and Alex: Authorship, Agency, and Appropriation.” Journal of Creative Writing Studies, vol. 4, no. 1, 2019, p. 45.

Books

Bertram, Lillian-Yvonne. Travesty Generator. Noemi Press, 2019.

Funkhouser, Chris. Prehistoric Digital Poetry: An Archaeology of Forms, 1959-1995. University of Alabama Press, 2007.

Hartman, Charles O. Virtual Muse: Experiments in Computer Poetry. Wesleyan University Press, 1996.

Mac Low, Jackson, and Anne Tardos. Thing of Beauty: New and Selected Works. University of California Press, 2007.

Anything and everything from Counterpath’s Using Electricity series.