· 6 min read ·

The Aggregation Problem at LLM Speed

Source: simonwillison

The aggregation problem is a well-established idea in privacy scholarship. Individually innocuous data points combine to produce something qualitatively more revealing than any single element. Your name is public. Your employer is findable. Your forum posts are archived. But when you concatenate a few hundred of those posts and hand them to a large language model, what comes back is something that none of the original disclosures implied you were sharing.

Simon Willison demonstrated this recently by building a tool that profiles Hacker News users from their comment history. The pipeline is short. The HN Algolia API exposes full-text search over every comment on the site, filterable by author:

GET https://hn.algolia.com/api/v1/search_by_date
  ?tags=comment,author_<username>
  &hitsPerPage=1000
  &page=0

Paginate through results, strip the HTML from the comment_text field, concatenate, and send to an LLM with a prompt asking for a professional and personality profile. Models now support context windows of 200K tokens, which means most users’ entire comment histories fit in a single API call. The cost per profile runs to a few cents at current pricing.

Using the Anthropic Python SDK directly, the core of this looks like:

import anthropic, requests, html

def get_comments(username: str, max_pages: int = 5) -> list[str]:
    comments = []
    for page in range(max_pages):
        resp = requests.get(
            "https://hn.algolia.com/api/v1/search_by_date",
            params={
                "tags": f"comment,author_{username}",
                "hitsPerPage": 1000,
                "page": page
            }
        ).json()
        hits = resp.get("hits", [])
        if not hits:
            break
        comments.extend(
            html.unescape(
                h.get("comment_text", "")
                 .replace("<p>", "\n")
                 .replace("</p>", "")
            )
            for h in hits
        )
    return comments

def profile_user(username: str) -> str:
    client = anthropic.Anthropic()
    text = "\n---\n".join(get_comments(username))
    msg = client.messages.create(
        model="claude-opus-4-6",
        max_tokens=2048,
        messages=[{
            "role": "user",
            "content": (
                f"Based on these Hacker News comments by '{username}', write a detailed "
                "profile covering professional background, technical expertise, "
                "communication style, recurring interests, and any notable opinions "
                "or values that emerge consistently.\n\n"
                f"{text}"
            )
        }]
    )
    return msg.content[0].text

This is plumbing, not engineering. The interesting question is what it tells us about the current moment.

What Changed

The HN comment archive has been public since the site launched. The Firebase-based HN API and the Algolia search index both predate LLMs by years. The BigQuery public dataset covering the full comment history has been available for bulk analysis since 2015. Researchers have been mining it with SQL for a decade. None of the underlying data access is new.

What changed is the inference gap. Classical analysis of comment histories required real effort: manual reading, topic modeling, careful feature engineering, and hand-labeled training data. Results were statistical and coarse. An LLM bypasses all of this. It reads comments much as a careful human reader would, draws the same lateral inferences, and produces a coherent narrative profile, but does it instantly, for anyone, at scale.

A 2023 study from ETH Zurich showed that GPT-4 could infer the approximate location of Reddit users from 1,000 comments with roughly 85% accuracy, and political affiliation with accuracy in the 70-80% range. Age estimates came within about five years on average. These are capabilities that would have required a dedicated research team and a labeled dataset to approximate with pre-transformer methods. They now exist as API calls.

The economics compound this. The Anthropic Batch API supports up to 10,000 requests per batch at a 50% cost reduction. At those rates, profiling every active HN commenter from the past six months, tens of thousands of accounts, becomes a single job costing in the hundreds of dollars. Nothing in the HN API or the model APIs prevents this; the Algolia API is rate-limited, but generously.

Contextual Integrity

The philosopher Helen Nissenbaum developed the concept of contextual integrity to describe why privacy cannot be reduced to a simple public/private binary. Information shared in one context carries implicit norms about appropriate flows. A comment posted in a technical discussion on what language to use for systems programming carries expectations about who will read it and in what spirit. Those expectations are social, not legal, and they matter precisely because privacy as a lived experience depends on them.

When you aggregate 3,000 such comments and generate a psychological portrait, you are doing something that no participant in those original conversations understood themselves to be enabling. This is different from the argument that HN users should have no expectations because they posted publicly. The contextual integrity argument is more precise: the norms of the original context do not extend to the aggregated profile, even when each individual comment was explicitly public.

This framing has legal weight in some jurisdictions. GDPR treats pseudonymous data as personal data when re-identification is reasonably possible, and Article 22 imposes restrictions on automated profiling decisions. The “publicly available” status of HN comments is not a blanket exemption under EU law. In the United States, the LinkedIn v. hiQ ruling established that scraping publicly accessible data does not violate the Computer Fraud and Abuse Act, but that ruling addresses access, not what you do with the data afterward.

The Self-Profile Case

Willison’s framing when building demonstrations like this is typically that he builds them to surface capabilities openly rather than leaving them as implicit background threats. The most sympathetic use case is self-profiling: running the analysis on your own account to understand what it reveals, much as you might pull a credit report on yourself. That’s a reasonable thing to want.

The problem is that the capability does not stay there. Employers, journalists, competitive intelligence analysts, and people with less defensible motivations all have access to the same API and the same models. The tool does not ask why you’re running it. Corporate profiling of public social media is a documented commercial industry; tools that lower the cost and raise the fidelity expand the range of actors who can afford to participate.

There’s also a subtler issue: the profile an LLM produces goes beyond what the user explicitly disclosed. Someone who posts frequently in security threads and occasionally mentions their employer and once mentioned a conference they attended has not, individually, revealed much. The LLM synthesizes across all of it and produces an inference about seniority, probable team, likely compensation band, and professional network. That inference is not pulled from any single comment. It is constructed, and the person who wrote those comments did not construct it and did not consent to it being constructed.

No Good Patch

There is no technical fix that resolves this cleanly. The data is public, the APIs are open, and the models will keep getting cheaper and more capable. Suggestions like injecting adversarial noise into your own comments to confuse profilers are both impractical for ordinary users and likely ineffective against models that can reason about text in context rather than matching patterns.

Platform-level mitigations, such as rate limiting on comment retrieval or requiring authentication for bulk access, could raise the friction. HN has almost none of these. The Algolia API is provided as a public good, and changing that would break legitimate uses including many developer tools, search clients, and analytics dashboards that the community has come to rely on.

What demonstrations like Willison’s do is make the capability legible. Most people who post on HN assume a social context: a community of technical peers reading a comment in the thread where they left it. The LLM profile of your comment history is something else, a coherent synthetic document that nobody asked you to produce and that exists outside any thread. Knowing that this document can be generated, by anyone, for a few cents, in under a minute, changes what it means to post publicly. Whether it changes what you post is, for now, up to you.

Was this interesting?