· 6 min read ·

Indexing the Web That Algorithms Left Behind

Source: hackernews

The Kagi Small Web page does something deceptively simple: it shows you recent posts from personal blogs and independent sites, updated continuously, with no algorithmic ranking and no SEO-optimized content farms in the mix. It looks like an RSS reader crossed with a news feed. What it actually represents is a considered position on what went wrong with the web and one team’s attempt to fix it through infrastructure rather than culture.

Kagi is a paid search engine built on the premise that if you’re not the product, the results get better. Founded by Vladimir Prelovac, it charges users a monthly subscription and competes on result quality rather than ad revenue. The Small Web initiative fits that thesis: surfaces content that commercial incentives have made effectively invisible.

What the Personal Web Lost

In the early 2000s, finding interesting personal sites was a social act. Blogrolls were the mechanism: lists of links in every blog’s sidebar pointing to other blogs the author read. DMOZ, the Open Directory Project, maintained a human-curated index of the web organized by topic. Yahoo’s directory predated it. The web was small enough that curation by humans was plausible, and the signal for a good site was often another person recommending it.

PageRank changed the economics of discoverability. Google’s insight was that links were votes, and pages with more authoritative links ranked higher. For several years this worked extremely well for finding personal content too, because people genuinely linked to things they found interesting. Then SEO became an industry. Link schemes, content farms, keyword stuffing, and eventually entire businesses built around gaming PageRank made the link graph a less reliable signal. By the mid-2010s, searching Google for anything with commercial intent meant wading through affiliate content and thinly-veiled product pages. Searching for a genuine personal perspective on almost anything was harder.

Google Reader shutting down in 2013 is the symbolic inflection point, though the causes were already in motion. RSS had been the substrate of the blogosphere: the protocol that let readers follow dozens of sites without visiting each one. When Reader died, the ecosystem of RSS-aware readers thinned, and personal publishing migrated toward platforms with built-in audiences: Medium, Substack, Twitter, later Mastodon. The personal domain became less common. The blogroll disappeared entirely.

How Kagi Approaches the Problem

Kagi Small Web is a curated feed backed by a maintained list of sources. Kagi keeps a collection of small, independent, non-commercial sites and crawls their RSS feeds and pages. The result is surfaced in two ways: the public feed at kagi.com/smallweb, and as a source that subscribers can boost in their personal search configuration. Kagi’s search settings let users explicitly weight domains up or down, so someone who wants personal blog posts to appear alongside or above corporate documentation can configure that.

The technical approach is deliberately low-tech. RSS and Atom feeds are the primary ingestion mechanism. The feed page presents posts chronologically with no ranking beyond recency. There is no machine learning model deciding what you want to see. This is the point: the small web’s value is partly its resistance to optimization. As soon as you rank by engagement, you create pressure toward content that maximizes engagement, which is a different thing from content that is genuinely useful or interesting.

The comparison with Marginalia Search is instructive. Johan Gunnarsson built Marginalia as an independent search engine that algorithmically favors non-commercial, text-heavy, low-JavaScript sites. Marginalia’s ranking actively penalizes the technical signals that correlate with commercial optimization: heavy tracking scripts, cookie consent dialogs, excessive JavaScript frameworks. The results feel like using the web circa 2005. You get personal pages, hobbyist wikis, old forum threads, university course pages. The approach is clever because it uses technical proxy signals to infer the kind of content Marginalia wants to surface, without requiring human curation at scale.

Kagi’s approach is curation. Marginalia’s approach is algorithmic inference. Both are attempts to solve the same problem but they make different tradeoffs. Curation is higher quality but doesn’t scale. Algorithmic inference scales but can be fooled and will miss sites that don’t fit its heuristics. A genuinely good personal blog running a modern static site generator with some analytics probably looks suspicious to Marginalia’s signals even if it produces excellent content.

The IndieWeb and Protocol-Level Solutions

The IndieWeb community has taken a different angle entirely: instead of trying to build a better index, build protocols that make personal sites interoperate like social networks. Webmention lets one site notify another when it links to it, creating a federated comment and reaction system. Micropub provides a standard API for posting to your own site from external clients. h-card and h-entry are microformat standards for marking up identity and content on personal pages.

The IndieWeb vision is that your personal domain is your identity, and the protocols allow you to participate in social conversations without surrendering your content to a platform. It is principled and technically coherent. It has not achieved mass adoption, which is the persistent problem with protocol-level solutions to what are partly social problems. The people who set up Webmention on their personal sites are already the people committed enough to maintain their own infrastructure. The gap to reach a broader audience remains.

The Gemini protocol represents an even more radical position: a new application-layer protocol deliberately simpler than HTTP, with a text-focused document format that makes it impossible to embed tracking, ads, or JavaScript. Geminispace is a genuine small web that exists in parallel to the HTTP web, accessible only with Gemini clients. It is coherent as a statement about values but is by construction invisible to ordinary web users.

The Curation Problem at Scale

The honest limitation of Kagi’s approach is that human curation does not scale to the size of the web, and it encodes the biases of the curators. The sites that get added to Kagi’s small web list are the sites that Kagi’s team or users are aware of. That means English-language sites are overrepresented. Sites that are already linked from other small web communities are overrepresented. The genuinely obscure personal site with no social media presence and no links from known blogs is exactly the kind of site this effort is meant to surface, and it is exactly the kind of site least likely to be discovered for the list.

This is not unique to Kagi. DMOZ had the same problem before it was shut down. The original Yahoo directory had it. Human curation is always a projection of the curators’ social graph.

The RSS-as-ingestion approach helps: if a site publishes an RSS feed and submits it, or if Kagi finds it through links, it can be added. The HN discussion around the Small Web feature included people both submitting their own sites and asking how to get included, which is both a positive signal and a reminder that discoverability-within-the-small-web is itself a problem.

Why It Still Matters

The timing of this effort is not coincidental. AI-generated content is beginning to flood the web at a scale that makes the content farm era look modest. A system that can produce plausible SEO-optimized text on any topic at near-zero marginal cost will produce it. The result is a web where the signal-to-noise ratio in standard search continues to decline. The small web, made up of posts people wrote because they wanted to, not because they were optimizing for traffic, becomes more valuable as synthetic content becomes cheaper.

Kagi’s bet is that a paid search product can afford to surface that content even when it does not generate engagement metrics that justify itself to an ad-supported model. The Small Web feed is essentially a statement that some content is worth surfacing because of what it is, not because of how many people clicked on it.

The personal web did not disappear after the social media era. It got smaller and quieter, but the people who maintained their own sites and wrote on their own domains kept doing it. The infrastructure around them atrophied: fewer RSS readers, less blogroll culture, no major directory. What Kagi and Marginalia and IndieWeb are all doing, in different ways, is rebuilding pieces of that infrastructure. None of them will individually restore what existed before. But the pattern of multiple independent projects converging on the same problem is a reasonable sign that the problem is real and that solutions are possible.

Was this interesting?