<Rant alert>
Divide and conquer! One of the best-known strategies in ancient politics that is as alive as ever. This principle is probably behind every imperial expansion in history, since it's much easier to defeat and absorb smaller tribes that don't like each other than when they are united.
Sadly, we're seeing this principle in play today as it relates to data and information. Note that I am talking about siloed public data, not siloed private data. Private data is always behind a paywall. On the other hand, public data is freely accessible and often has been created by users rather than painstakingly collected by the company running the respective service (let's call these publishers). Users generate this content fully aware (usually) that it is going to be public. By definition, public data is public (it's weird that I have to say this).
Quite legitimately, publishers of public content are trying to make money off it, although the business model is not always obvious or optimal. If you're lucky, there will be an API with such data and clear documentation. But often your only option is manual browsing or scraping. And this is where problems really begin if you want to index/scrape a website. Scraping puts a load on the host's bandwidth (which might be interpreted as a DDoS attack when done irresponsibly), and to protect themselves, such hosts employ the robots.txt file, which tells web crawlers who can and who cannot scrape their websites. The problem is that rather than devise rules that would allow scraping to be done without affecting the performance, they simply prohibit everyone but the major search engines from crawling the website.
On the one hand, this seems perfectly fine: every website wants to be indexed by a search engine so that users can find them easily through web searches. But the real issue here is that this essentially excludes any other service from accessing the data, even if it is otherwise freely available. Representatives of HiQ, a company that sued LinkedIn over access to its public data, have expressed this rather well:
“HiQ believes that public data must remain public, and innovation on the internet should not be stifled by legal bullying or the anti-competitive hoarding of public data by a small group of powerful companies.”
And this is where big search engines essentially manage to divide and conquer: under the veil of search engine "indexing" they get exclusive access to tons of nominally public data, but which few others can easily index for their own purposes. Small players must negotiate access to each data repository separately (or even license for a payment), while the huge behemoths get it for free, getting bigger in the process, thus stifling innovation and competition.
Now, some may object that everyone has a right to make money off their data and determine who gets it for what price. At a high level, I agree. But I have some doubts about the validity of how sites implement this:
- The content I am talking about here is freely available by simply browsing. Hence it is public. Moreover, much content is user-generated and these users made a decision to have it publicly available (such as their LinkedIn profiles).
- Since large search engines get such data for free, they can extract a lot of useful information from it at zero cost. Why a smaller player should not be able to have the same level of access is not at all clear to me.
Granted, not everyone provides indexing benefits, but any public data used in a third-party app can (and should) be labeled appropriately, serving as a direct promotion channel. A user is a user, whether she accesses content via a search engine, browsing a website directly, or through an aggregator. It's the same pair of eyes, after all.
There are more nuances to this all, of course, but the truth of the matter is that the internet is undergoing a phase of centralization, with the largest players using their monopolistic positions as unfair advantages. The upside is that monopolists ignore many smaller, highly specialized niches, leaving opportunities for small players to evolve using different strategies. Our niche is startup technology due diligence. Don't let yourself be divided and conquered.