What Profiling Hacker News Users Reveals About the New Aggregation Problem
Source: simonwillison
The aggregation problem has always been easier to describe than to demonstrate. A single comment about a frustrating workplace is innocuous. Five thousand comments about workplace frustration, health worries, career anxieties, and political opinions, spanning fifteen years of a person’s life, is something categorically different. Simon Willison’s recent post on profiling Hacker News users based on their comments makes that difference concrete in a way that abstract privacy arguments rarely do, and the technical barrier to reproducing it is essentially zero.
The Machinery
The Hacker News Algolia API is the path of least resistance to bulk HN data. It requires no authentication, has gentle rate limits, and returns structured JSON for every comment ever posted to the site:
curl "https://hn.algolia.com/api/v1/search_by_date?tags=comment,author_pg&hitsPerPage=1000"
Replace pg with any HN username and you get a paginated stream of that person’s entire public comment history, including the text, timestamp, parent story title, and story URL for each entry. For prolific users, this can mean thousands of comments reaching back to 2006 or 2007 when HN launched.
Willison’s own hacker-news-to-sqlite tool, part of his broader Dogsheep ecosystem for personal data aggregation into SQLite, makes the collection step even simpler:
pip install hacker-news-to-sqlite
hacker-news-to-sqlite user hn.db USERNAME
That drops everything into SQLite, where Datasette or sqlite-utils can query and explore it. Then comes the part that changes the character of the exercise entirely. His llm CLI tool lets you pipe that corpus directly to any language model:
sqlite-utils query hn.db "SELECT comment_text FROM comments" --csv \
| llm -s "Build a detailed profile of this person: their professional background, areas of expertise, opinions, and patterns in how they communicate."
The output is a behavioral portrait derived from public text that the subject may have written over many years, across many moods, in many contexts, never imagining it would be synthesized this way.
Why the API Design Matters
The Algolia API was designed for search, not for bulk export, but “search by date, no query, filter by author” is effectively a full history dump. The API returns up to 1,000 items per page; a simple loop over page numbers retrieves a complete corpus:
for page in $(seq 0 9); do
curl "https://hn.algolia.com/api/v1/search_by_date?tags=comment,author_USERNAME&hitsPerPage=1000&page=$page" \
| sqlite-utils insert hn.db comments - --pk objectID --alter
done
This is not a hack or an exploit. It is the intended interface. The Firebase-based official HN API is slower for this use case because it returns a list of item IDs from the user’s profile and requires fetching each item individually. Algolia’s design makes bulk retrieval of a user’s history the low-friction path.
That design choice was made in a different era, before LLMs changed what was computationally feasible to do with unstructured text at scale. The two technologies were not designed to interact, but they interact seamlessly.
The Aggregation Problem, Operationalized
Privacy researchers have written about the aggregation problem (also called the mosaic effect) for decades. The idea is that combining individually harmless pieces of information can produce something genuinely harmful. Willison’s workflow turns that abstract concern into a reproducible three-step process anyone can run from a terminal.
In 2023, researchers at ETH Zurich published a study showing that GPT-4 could infer personal attributes from Reddit comment histories including location, occupation, gender, age, religion, and political affiliation, using only the text of the comments. The methodology transfers directly to HN. The technical barrier to doing that kind of inference is now a pip install and a curl command.
The standard counterargument is that HN comments are public. Users chose to post under persistent usernames and build a public record. The HN Terms of Service do not prohibit automated reading. All of this is true, and it is also insufficient for a few reasons.
First, the decision to post a comment in 2009 was made in a context where the realistic threat model was “someone might read this comment.” It was not made with the threat model of “someone will feed all my comments across fifteen years to a model that will synthesize my political evolution, health anxieties, career frustrations, and interpersonal patterns into a structured profile.” People cannot meaningfully consent to uses of their data that did not exist at the time of the original act.
Second, the GDPR’s concept of inferred data covers LLM-derived profiles even when the source data is public. A profile inferred from public comments is still personal data under European law, and its creation carries its own legal weight that most people building these profiles for casual exploration are not thinking about.
Third, there is the chilling effect. Once users understand that their entire HN history can be instantly profiled, some fraction will self-censor or abandon persistent usernames. The long-form, candid, technical discourse that makes HN valuable depends on a degree of psychological safety that surveillance erodes, even when that surveillance is technically legal.
What the llm Tool Reveals About Intent
The llm tool itself has features that underscore how seriously Willison takes the exploratory, accountable framing of these experiments. Every prompt and response is automatically logged to a local SQLite database at ~/.config/io.datasette.llm/logs.db. You can inspect your own inference history with llm logs or explore it with Datasette. The tool is transparent about what it is doing, and it is designed for personal use as much as for building pipelines.
The --schema flag requests structured JSON output, which means you can extract a structured profile object from a comment corpus rather than receiving freeform text:
sqlite-utils query hn.db "SELECT comment_text FROM comments LIMIT 500" --csv \
| llm --schema '{"type": "object", "properties": {"expertise_areas": {"type": "array", "items": {"type": "string"}}, "apparent_role": {"type": "string"}, "topics_of_interest": {"type": "array", "items": {"type": "string"}}}}' \
"Extract a structured profile from these comments"
Structured output makes it trivial to build downstream tooling, store profiles in a database, or compare profiles across users at scale. The distance between a curious experiment and an automated surveillance tool is shorter than it appears from the outside.
Willison is deliberate about applying these tools to himself first and being transparent about the dual-use implications. His broader work on demonstrating what is now possible with the llm CLI has genuine value for helping people understand the capabilities they are living with. Demonstration is a form of disclosure.
The Broader Pattern
This connects to something appearing across AI tooling right now. The bottleneck for privacy-sensitive analysis used to be expertise. You needed to know how to work with APIs, write parsing code, understand data formats, and interpret unstructured text at scale. LLMs collapse most of those requirements. The Algolia API plus the llm CLI plus a few shell commands are now sufficient to generate a behavioral portrait of any HN user going back nearly two decades.
Similar dynamics apply to other platforms with public, persistent, attributable post histories. Reddit has its own search API. GitHub exposes years of commit messages, issue comments, and code reviews under real names. Stack Overflow’s data is available in public data dumps. The BigQuery public HN dataset is updated regularly and queryable with SQL at scale. The intersection of LLMs with these corpora is not a future concern; it is a present capability that most people using those platforms have not fully processed.
The right response to tools like this is not to ignore them or to restrict API access, which only slightly raises the barrier while making the data less useful for legitimate purposes. The response is to revise how we think about the privacy of public, attributable, persistent text. Writing a comment on HN in 2010 and writing a comment in 2026 are technically the same act, but they are not the same act in terms of the inferences that can be drawn from them over time, at scale, and in combination with everything else you have written.
The threat model has changed. The social norms and platform designs that govern these spaces were built for a different one, and the gap between those two things is where experiments like Willison’s are doing their most useful work.