Modern Life Skills: Recognizing ML Weirdness

A few weeks ago, the CEO of a Chinese company got in some legal trouble. They had her dead to rights: a camera had seen her crossing a street against the light, facial recognition software had matched her image to a database, and so she was promptly issued a ticket.

A few weeks ago, the CEO of a Chinese company got in some legal trouble. They had her dead to rights: a camera had seen her crossing a street against the light, facial recognition software had matched her image to a database, and so she was promptly issued a ticket.

The only problem was that it wasn’t her: it was a bus ad with her picture.

This will only get more common in the future. Humans already have experience with two kinds of pathological bureaucratic interaction: poorly-coded forms and intransigent bureaucrats.

It’s astonishing but true that in 2019, you can still fill out a form on a website and get rejected because your credit card number had dashes. Or didn’t have dashes. Or your phone number’s area code was encased in parentheses, or not.

This stuff is easy to solve, but it does take a tiny bit of effort. Writing a program to recognize different variants of phone numbers is literally an introductory exercise. It’ll take you under an hour of copy-and-paste if you’ve literally never programmed before. And yet, still an issue.

Bureaucratic intransigence also seems solvable but still persists. My wife ran into this kind of problem when she changed her last name to mine: to get Form A, she needed to fill out Form B, and to get Form B, she needed Form A.

The problematic-website-forms issue is programming 101; the problematic City Hall-and-bank forms issue is just Interacting With Big Institutions 101. But thanks to the proliferation of machine learning, we’re reaching new frontiers in artificial stupidity, and since ML is being applied to so many fields, we’ll see it everywhere.

One of my favorite examples is that classic line from Uncle Tom’s Cabin:

In the digital version of the 1879 edition, Eliza squeezes up little Henry “in her anus” (pg. 7). And in the 1962 Harvard edition, Eva on seeing Mammy throws herself “into her anus” (pg. 176).

I don’t remember it being that kind of book.

What happened was obviously that we reached a new technological frontier: computers are smart enough to recognize letters, and smart enough to guess the identity of smudged letters based on which guess best matches a word. But their Bayesian prior (at least back in 2009, when this error happened) was set at the word level, regardless of other context. “Into her arms” is a cliche; “into her anus” is… less of a cliche. But “arms” and “anus” are both words, and if the letters look 90% like “anus” and 10% like “arms,” the computer will make the appropriate educated guess.

Computers require a certain amount of cleverness to even make this mistake, but they require a lot more cleverness to avoid it. More RAM, too: if you’re constrained by memory, one thing you can’t afford to do is guess the meaning of a word by looking at a hundred thousand adjacent words.

As computing gets cheaper, the odds of of these errors will decrease. But ML is getting popular even faster than it’s getting better, so the total number of these errors will rise. There’s already a nice collection of toy examples, and the world’s smartest CS PhDs are hard at work creating more.

One nice laboratory for spotting this problem will be the newly-legalized generic erectile dysfunction market. Roman has raised $90m, and Hims has raised just under $100m. Both companies plan on starting with generic ED meds and bootstrapping into broader healthcare brands.[1] They probably think the problem is getting consumer awareness, but the real issue is that they’re showing up twenty years late to an arms race between gray market online pharmacies and thespam filters.

Spam filters use Bayesian heuristics — what’s the probability that this email is legitimate, conditional on its containing a given word or phrase? And Viagra merchants have spent years trying to outwit the spam filters. It’s hard, because any sufficiently creative euphemism is, once it’s caught, and even stronger indicator of spamminess. As Paul Graham pointed out back in 2003, “v1agra” is an even spammier word than ‘Viagra,” because nobody was forwarding their friends jokes about “v1agra.”

If you look at the business plan of a typical e-commerce CPG play, a big part of the plan is to get people onto your mailing list so you can upsell them later. The generic pills won’t make much margin, but there’s other stuff that will. That plan’s not going to work so well when every spam filter in the world has been trained on catching the brand names, the USAN name, the euphemism, the euphemism-for-a-euphemism, the ghostly-shadow-of-a-coded-reference, the entire universe of typos for all of the above, etc. Roman and Hims would have an easier time selling something completely illegal. (“We’ve raised this Series B to complete our long-planned pivot from the lucrative, albeit constrained, market in crystal meth and coups d’etat-as-a-service…”).

Don’t mention the p1llz

I’m sure they have a plan, but I’m not sure they have a prayer. Even today, my spam folder is full of pitches for the same stuff; presumably, today’s spammers have found all sorts of clever tricks to stay 0.01 steps ahead of Gmail. And unlike other marketers, the spammers don’t seem to hold conferences where they show off the latest tips and tricks.

What to Do

When you deal with a dumb form, you know roughly what to do:

  1. Carefully read the error message to determine whether this form thinks parentheses in an area code are mandatory or forbidden.
  2. If you’re a programmer, or you could be one, contemplate the fact that buying the Freidl Book will give you a superpower with approximately a 1,000x ROI.

When you deal with bureaucratic caprice, you also know what to do: ask for a manager, and iterate until your problem is solved.

But what do you do when you encounter a pathological algorithm? Fortunately, you have options here as well. ML weirdness isn’t just annoying or frustrating. It’s also topical and funny. So the solution is to know that your tweet will go viral. As computers get cleverer, their mistakes get even cleverer. We’re going to live in a pretty hysterical future.