The Data Business, At Three Resolutions

Plus! Diff Jobs; The Instagram Numbers; Trader Pay; The Distribution of Near-Misses; Benefits; Funding Cycles

In this issue:

The Diff April 8th 2024

The Data Business, At Three Resolutions

Byron Tau's Means of Control is a fun look at the world of government entities buying data from private companies for law enforcement and intelligence purposes. For anyone who's had experience working with data for hedge funds, or using it to target ads, it's an especially delightful read—it turns out that the same companies providing raw data that offered a better way to track intra-quarter trends in same-store sales were using that same data to help target drone strikes. Who knew!?

This kind of question actually comes up not infrequently in sales meetings. If you're selling data, there's a temptation to rhapsodize about how comprehensive and granular it is. And there's a tendency for some of the buyers to ask questions like: "Does that mean you're tracking me?" At a well-run data vendor, the answer is some form of "probably," or "probably not," coupled with "we definitely don't know, because the only reason we ever look at data about individuals is to remove outliers, and even then we have strict permissions and extensive logging to prevent exactly the kind of stalkerish behavior you're worried about." That's an honest and fair answer, but it only applies to specific kinds of data. You can think of mass data collection being applied at three resolutions:

  1. Trying to track everybody; this is what the vast majority of data used to inform investment decisions is. The goal is to see broad changes in spending behavior and use those to inform fundamental theses, so datapoints only matter when they're statistically significant. If your spending panel shows insane outlier customer retention for Doordash in Boise, but it turns out that your panel has exactly two members in Boise and one of them orders a lot of takeout, your Boise-level data is worthless (though that customer might be incorporated into a broader report about mid-sized cities, or the Western US, etc.).
  2. Trying to track a subset of people: this describes the online advertising ecosystem. The goal is to show the right ad to the right person at the right time, so for any given advertiser the vast majority of data is worthless, but a tiny fraction of it is priceless. The magic of bid density means that there are nonlinear returns to having more data, and that these returns accrue at first to people who can get access and later on to people who can monopolize access.
  3. Tracking data to follow just one person. This is where things get fraught, and it's where a lot of privacy discourse is focused. Most people are not especially concerned that the products they see when they visit eBay or the movies they're suggested on Netflix are tailored based on their preferences. In fact, if this works well, it's invisible—you just end up thinking of Netflix as a site with entertaining shows, not as an intelligence agency that's building an elaborate psychological profile of you and monetizing it by selling you subscriptions. Using data to target individuals is a big part, but not the only part, of what law enforcement and intelligence are doing when they buy this data. There are occasional uses in finance, like tracking private jets to predict pending mergers.[1]

Even though these are separate use cases, they're all ways to use similar kinds of data. In general, the backstory for a given dataset is that it was collected for the "target some people" use case, then got sold to data brokers who introduced it to hedge funds (for the "track everyone" case) and the government (more for the "track one person" case).

The result looks a lot more insidious than it is. There is a huge data infrastructure devoted to tracking people's spending, locations, interests, and thoughts. Some fraction of it—a smaller fraction than before, and a shrinking one—is available for purchase with minimal oversight. Companies may have policies about data use, but if a company sets up a location-tracking SDK, pays dozens of apps to include it, sells the data to a dozen data brokers who sell it to end users and to companies that blend it with other data and sell results, there's basically no way for anyone in that information supply chain to know exactly how the data is sourced and how it's being used. And that's especially acute at either end of the chain: end users have the least certainty about where their data comes from, while the people providing the data typically have no idea where it ends up.

For law enforcement, this kind of widespread data is a way to engineer serendipity. Suppose you're at a restaurant, catching up with a friend, and your friend happens to mention exactly where he buried the money he got from robbing a bank a few years back. And suppose that one table over, a police officer happens to overhear this. That's a lucky break! It will not be especially hard to get a search warrant and start digging. But suppose the police extend that analogy and suggest putting a network of unobtrusive microphones in every public place in the city, and continuously listen in to conversations, using Whisper assembling a transcript and searching it for suspicious words. This is still, in some technical sense, a way to overhear discussions of criminal behavior in a context where there's no expectation of privacy, but at scale its effect is different.

This general issue comes up reasonably often: the inability to cost-effectively implement one side's preferred policy ends up being a load-bearing feature of political compromises ($, Diff).[2] "Cops can eavesdrop on conversations in public" is something civil libertarians are okay with when the number of cop-ears is limited by the number of cops, but there's a very different environment with ubiquitous surveillance.

And it's not just that they get comprehensive information, but that they end up with a surplus of information, at which point choosing which crimes to investigate involves more discretion. If the percentage of crimes stopped goes from 10% to 90%, that's good; if it goes from a semi-random 10% to a 10% that's entirely at police discretion, that's a big change in the government's effective org chart, which can happen without any public debate.

There are not infrequent anecdotes in the news about using large-scale data purchases to deanonymize people, either to get them convicted of crimes or just to publicize parts of their life they'd prefer to keep secret. But these tend to come out on a lag: the data gets collected, then shared, then analyzed, then published. That means it's easy to misread where we are in the cycle: this kind of thing—deanonymizing Grindr users, parallel-constructing a case against a drug dealer who was originally identified through suspicious location data[3], identifying secret military bases through Strava accounts, etc.—all of this is starting to go away. It's a combination of changes in user behavior as more people wake up to privacy risks, and Google and Apple tightening up their policies on tracking.

The information still exists, and more of it's getting collected all the time, but it's increasingly first-party data that is used by ad-based companies to target offers rather than sold to independent businesses that will do their own targeting. So there are still privacy leaks—meet a pseudonymous Internet person in real life, and you don't need to ask for their name, because Facebook's "people you may know" feature will helpfully share it with you in a few days. This means that people seeking rich data profiles on individuals are dealing with big companies that have legal and PR departments. These companies cooperate with law enforcement to varying degrees; Apple famously refused to decrypt the phone of a mass shooter in 2016, but also complies with 70-90% of the US government's requests for data.

On the other side of the spectrum, there's still plenty of aggregated data available for investment or competitive analysis. It's less fraught, and even if Google and Facebook aren't going to give it up, there are plenty of companies that will share data, at the right price. Pure-play full-stack data businesses—the ones that directly collect data, and that monetize by selling it—are challenging to build and scale, but many other kinds of companies find that their business model works fine without this revenue source and much better with a new high-margin revenue line that repurposes the same information for a new use case.

In a way it's a classic infrastructure story. Chicago is a good place to trade S&P futures as a historical consequence of being a good place to slaughter cows—once it's the place where people convert agricultural commodities into cash and vice-versa, it's a good place to offer bets on the prices of those commodities, and with a critical mass of liquidity providers in one set of assets, there's a natural extension to supplying liquidity for other kinds of trading, too. Similarly, the data infrastructure got built to process huge amounts of only-nominally-opt-in data sharing through the app ecosystem. Those investments, in both infrastructure and people, can be repurposed for other kinds of data.[4] So data will still be collected, and you'll still periodically get unpleasant surprises about how little privacy you have and the escalating inconvenience of maintaining a given standard of that privacy. But even as the data world gets more opaque from the outside with the rise of walled gardens, it gets more transparent from the inside. The people collecting, selling, and analyzing the data know more about where it comes from and how it's being applied, even if the rest of the world doesn't.

Disclosure: Long META.

  1. This sounds like insider trading, and it does involve making a trade—probably involving buying short-dated out-of-the-money call options—based on foreknowledge of some material event. But in the US, at least, the legal standard involves someone giving a tip and getting a benefit; if you just happen to find out, you're generally in the clear. The fact that it's legal doesn't mean it isn't suspicious, and securities regulators will prudently wonder if the jet tracking was a cover story for something more nefarious. And yes, it is something that happens in the movie Wall Street, but Gordon Gekko arguably would have walked, while Bud Fox would almost certainly have done time. ↩︎

  2. The existence of this phenomenon probably explains some share of modern political disaffection. If The Wall doesn't get completely built and the student loans aren't entirely forgiven, people who supported those policies feel ripped off, even if the compromise their favored politicians accepted was that one side would get to make a grand gesture that had no real-world effect, and the other side would complain about it as if it had actually happened. ↩︎

  3. Think of someone who travels a lot within the same city, has a surprising number of brief rendezvous in parking lots, and sometimes visits a dangerous neighborhood and then drives exactly one mile per hour below the speed limit on the way home. ↩︎

  4. One fun feature of this, for both investors and ad targeters, is that when the best dataset goes away and the next-best is still available, you can treat the historical material as training data to recreate some of what you lost, and to understand the limitations of the new data. It's better to have loved and lost your ability to predict sales within two basis points than to have never loved it at all. ↩︎

Diff Jobs

Companies in the Diff network are actively looking for talent. See a sampling of current open roles below:

Even if you don't see an exact match for your skills and interests right now, we're happy to talk early so we can let you know if a good opportunity comes up.

If you’re at a company that's looking for talent, we should talk! Diff Jobs works with companies across fintech, hard tech, consumer software, enterprise software, and other areas—any company where finding unusually effective people is a top priority.


The Instagram Numbers

Meta has revealed Instagram's revenue in a legal filing. This confirms that Instagram grows faster than Meta as a whole, which was widely assumed, and gives us some real numbers: Instagram's compounded growth from 2018 through 2021 was 42%, an impressive number for a company doing tens of billions in revenue. But if you back Instagram out, revenue for the rest of the business, i.e. mostly ads on the Facebook app itself, is not too shabby, at 24% annualized growth on a much higher base.

There are plenty of one-time components to this growth: Meta had the tailwind of handling app tracking better than smaller competitors, plus the positive impact of Covid in 2020-21, so it shouldn't be blindly extrapolated. But it does demonstrate that a comparatively geriatric social media app, founded two decades ago, can still put up growth even as its userbase levels off. That should affect the terminal valuations for other apps where users aging out is a meaningful risk, like Instagram itself. Even though users get tired of products, the products keep collecting more data and getting better at meeting those users' needs. As it turns out, this can outrun attrition for a long time.

Trader Pay

France, like many European countries, has strict rules limiting employers' ability to fire workers without giving them some combination of a notice period and lump-sum compensation, and is currently debating whether or not these rules should continue to be extended to beleaguered members of the working class such as professional traders at investment banks ($, FT). It's an interesting clash of priorities: not only do businesses prefer being able to fire whomever they like, but the high variance of trader compensation, as well as the low job security, actually improves the financial system's stability—when an investment bank's revenues decline, its costs mechanically decline, too, since the biggest of those costs is paying people and the biggest, most variable bonuses are performance-based.

But there's another fun twist to the story: from the trader's perspective, high legally-mandated payouts for firing amount to a put option on their compensation. If they get a free put option, their incentive is to take more risk. Part of what banks are worried about is what it costs to fire an underperforming trader, but another thing it's prudent for them to worry about is how much it pays for a trader to underperform.

The Distribution of Near-Misses

Apropos of several recent stories—another container ship lost power but was safely brought to anchor, a Boeing plane's engine cover fell off and the plane made an emergency landing—it's useful to have a statistical model of news. Specifically: there are some tasks that are intrinsically dangerous, like moving 100,000+ tons of goods on a ship with a crew of a few dozen, or putting hundreds of people in a metal cylinder and flinging them through the sky at hundreds of miles per hour. The people who do these create safety protocols, often starting with the question "how do we make sure nobody dies that particular horrible death ever again?" And these safety protocols involve many layers of redundancy—multiple sensors, backups for everything, external fallbacks when the onboard backup doesn't work, maintenance protocols, maintenance audits to ensure that the protocols are being followed, safety training, etc. The point of that redundancy is that, as fundamentally risky activities, you should assume that something will always go wrong, and build your safety protocol around maximizing the number of things that have to go wrong at once for a catastrophic failure to happen. But that means that the population of near misses is far larger than the population of disasters, but that these near-misses are close to business-as-usual; some fraction of ships have to get tugged to safety, some fraction of planes have to turn around and go back. That is simply not newsworthy when the bigger risk isn't salient. So any time something like this happens, it will briefly feel like the world is falling apart. If it makes you feel better, just remind yourself that the number of outbreaks of rare diseases that you hear about is far lower than it was in 2020, and the number of railroad accidents in the news has declined massively since February of 2023. When the disaster isn't in the news any more, the near-misses don't make the news, either.


Some employers are offering subsidized childcare to attract employees ($, WSJ). This is especially helpful for non-office jobs, where working from home to take care of a sick kid or temporarily adjusting one's schedule to accommodate child logistics isn't feasible. Those jobs are also jobs whose market pay is close to that of the childcare workers being subsidized, so it's tricky to get the economics right. The net result is probably that some companies will go all-in on childcare as a benefit that distinguishes them from competitors, while others will mostly ignore this. The childcare-centric companies are paying to subsidize care, but they're getting two benefits: a more reliable workforce day-to-day, and higher retention over time.

Funding Cycles

What historically propelled Silicon Valley's funding ecosystem, other than the density of technical talent, was angel investing: people who'd made enough money to do risky investments, and made it in fields that gave them the network to find the next generation of companies and the skill to evaluate them, and who were early enough in their own careers that they weren't just going to passively enjoy their money in retirement. That model has gotten a bit stretched as angel investing has professionalized and the wait from first check to liquidity has expanded. But Y Combinator is working on formalizing this by raising money from YC alumni to invest in the next generation of YC funds. Former founders are already well represented among LPs generally, so in a sense this is nothing new. But it's a good reminder that some economic arrangements get continuously recreated because they really do work.