· 6 min read ·

When the Cost of Profiling Drops to an API Call

Source: simonwillison

The cost of turning someone’s public writing into a detailed personal profile used to be high enough that only well-resourced actors bothered. Compiling years of forum comments, filtering the noise, identifying consistent patterns across thousands of messages, synthesizing all of it into a coherent picture: these were research-grade tasks. Simon Willison’s experiment profiling Hacker News users based on their comment history demonstrates something more interesting than a neat LLM trick; it shows just how completely that barrier has collapsed.

How the Pipeline Works

Hacker News stores all comment data publicly, and Algolia provides a search API that makes bulk retrieval by username straightforward. A basic query looks like this:

https://hn.algolia.com/api/v1/search?tags=comment,author_USERNAME&hitsPerPage=1000

Paginate through the results, collect the comment_text field from each hit, and you have years of someone’s public thinking in a few dozen lines of Python. From there, the approach feeds that material to an LLM, likely via Willison’s own llm CLI tool, a utility he maintains that abstracts over multiple model backends including Claude, GPT-4o, and local models via Ollama. The prompt asks for synthesis: professional background, political orientation, areas of expertise, communication style, personal details revealed incidentally.

The Algolia API supports sorting by date and filtering by object type, so you can reconstruct a chronological narrative of someone’s evolving positions on a topic. Fetch all comments by a prolific security researcher, filter for anything mentioning exploit development, and you can watch their thinking develop over a decade. Willison’s llm tool supports piping content directly from stdin:

python fetch_hn_comments.py USERNAME | llm "Based on these Hacker News comments, build a \
detailed profile of this person. Include their likely profession, technical background, \
interests, political orientation, communication patterns, and any personal details revealed."

The complexity lives in the prompt and the volume of input, not the code. Willison has been building up this kind of pipeline across several projects, combining sqlite-utils, Datasette, and the llm tool into a flexible toolkit for LLM-assisted data analysis. The official Firebase HN API offers a complementary angle: fetch a user object at https://hacker-news.firebaseio.com/v0/user/USERNAME.json and you get their account age, karma, and the list of their submission and comment IDs, which you can then retrieve individually. Both APIs were designed for legitimate developer use; neither imposes meaningful obstacles to bulk profiling.

What an LLM Sees That a Human Skimming Would Miss

A person browsing someone’s HN history would notice obvious signals: which topics they engage with, which positions they defend, which technologies they seem to know well. An LLM processing the full corpus simultaneously does something qualitatively different; it correlates signals that are too diffuse to catch in a linear reading.

A person who never mentions their job title might still reveal it through the combination of which product announcements they engage with, which technical arguments strike them as obviously wrong, which salary ranges they treat as normal, and which organizational patterns they recognize on sight. Someone who carefully avoids explicit political statements might still disclose their priors through which characterizations of the opposing side they bother to correct. These patterns only emerge at scale; they are invisible in any individual comment but can be unmistakable across hundreds.

This is the mosaic effect applied to personal data. Individual pieces of innocuous public information combine into something more sensitive than any piece on its own. The intelligence community has understood this principle for decades, and privacy scholars have applied it to consumer data, but LLMs are the first tool that makes the synthesis cheap and fast enough to execute without institutional resources.

The Contextual Integrity Problem

Helen Nissenbaum’s contextual integrity framework, developed in her influential 2004 paper and expanded in her book Privacy in Context, gives a more precise account of what goes wrong here than the standard “but it’s public data” argument fully resolves. The core claim is that privacy violations are not primarily about secrecy; they are about inappropriate flows of information across contexts. When you post a HN comment, you are sharing information within the context of a technical discussion forum, governed by norms around public argument and professional discourse. The implicit expectation is that your comment is read by people participating in that discussion.

Aggregating thousands of those comments into an LLM-generated psychological profile violates those contextual norms even though every piece of underlying data is technically public. The same information, flowing in a different context, becomes something the original poster never anticipated. You commented on a thread about database indexing strategies; you did not consent to that comment being folded into a dossier estimating your personality type and political alignment.

The research by Kosinski, Stillwell, and Graepel, published in PNAS in 2013, demonstrated this dynamic clearly with Facebook likes: patterns of public behavioral signals predict private attributes including sexual orientation, political views, religious beliefs, and personality traits with accuracy well above chance. That work required substantial labeled training data and statistical modeling expertise. The equivalent today requires a prompt and an API key.

The Same Infrastructure I Work With

I build Discord bots, which means I spend a lot of time reading and processing public messages programmatically. The pipeline Willison describes, fetching messages by user, concatenating them, prompting an LLM for synthesis, is functionally identical to what a bot does when building user context for a conversation or summarizing a channel’s history for a new participant. The code looks the same; the scale and intent differ, but not the mechanics.

That proximity makes the privacy question concrete rather than abstract. When I build a bot that reads message history to give Claude context for a support interaction, I am making implicit assumptions about what users expect when they post publicly in a server. I try to keep that data scoped to the immediate purpose and avoid storing it beyond what the conversation requires. The capability to do more exists; the temptation to make the experience richer by using it exists alongside it. Resisting that temptation is a deliberate design choice, not a default behavior.

One thing that strikes me about Willison’s experiment is how it surfaces the gap between “this data is technically accessible” and “this data was shared with the expectation that it would be used this way.” Building tools that consume public data responsibly means taking that gap seriously, even when nothing technical prevents you from ignoring it.

What This Changes About Public Participation

The practical implication is that participating in public technical forums now carries a different kind of exposure than it did before capable LLMs became accessible. Individual comments were always public; the synthesized profile that emerges from ten years of them is new, or at least newly cheap to produce. The labor cost that previously made bulk profiling impractical was doing a lot of privacy work without anyone acknowledging it as such.

This is not an argument for retreating from public technical discussion. The value of open discourse in forums like HN is real, and withdrawing from it as a privacy measure would represent a genuine loss. The argument is narrower: the implicit social contract of posting publicly has changed, and most participants have not updated their mental model to reflect that change.

What remains unresolved is what the platforms hosting this data owe their users in terms of clarity about how it can be used. Terms of service and privacy policies written before LLM-assisted bulk analysis was practical may not reflect the actual bargain users are making when they post. That gap is not unique to HN; it applies to any open forum where “public” historically implied readable by participants rather than trivially profilable by anyone. Whether closing that gap requires stronger norms around how aggregated public data gets used, greater transparency from platforms, or changes to API access policies is a question that the technical ease of experiments like Willison’s makes increasingly hard to defer.

Was this interesting?