The last few weeks I tried to understand how Google works. I came to the realization that it’s essential to know how you can attract free traffic to your website as a startup founder. For instance, I run a site for learning Japanese with small tools like a converter from hiragana to romaji and the site doesn’t rank for almost any search terms on Google. At best, Google sends 1-3 visitors a day. This makes Nimonikku unsustainable. Hence, I needed to act. After reading up on the most popular SEO sites like Moz or Ahrefs on how to rank better, later even reporting a private backlink network, and ranking on the first page of Google, I slowly started to understand how Google’s search engine works. Easy, right?
How does a Search Engine’s Algorithm work
First, let’s define what a search engine must do. We can split it up into 2 parts:
- Front-end: Users enter their search query and expect relevant results.
- Back-end: So called bots / crawlers / spiders sieve through websites and index them according to their content.
Both parts sound rather trivial, but they aren’t. In order to find the best results for a search query, the engine determines the searcher’s real intent and matches the query with the index it built from crawling the web. As you can imagine, natural language processing plays a big role.
However, that’s not all. Because of the internet’s boundless size, you must distribute your index across thousands of machines while answering search queries in less than a second. Have you ever seen a loading screen on Google or Bing?
For the backend, you can already assume that indexing the whole web will take some time. According to Netcraft’s survey, there were 1.8 Billion sites accessible in January 2018. That’s a lot of websites to load and process.
These sites aren’t static, think of any news site, blog like this one, or any Twitter feed. They continuously change and the job of search engines is to always deliver the most up-to-date results. This boils down to a scalability problem. They must find a solution for reducing the problem size by prioritizing high value sites and discarding scammy and low value ones.
But, a solution to these two problems shouldn’t be that difficult to locate given the high amounts of open source tools available on Github. Consequently, you wouldn’t start from square one like 10 or 20 years ago. And, the affordability of servers through virtual private server companies gives you the opportunity to build a fleet of low-cost crawlers.
For instance, Linode offers virtual private servers for just $5/month with unlimited inbound traffic, perfect for downloading the whole web regularly.
It’s like owing your own car repair shop. You order your chassis from company A, your engine from company B, and your lights from company C. You put on your own glue and if you succeeded, you manufactured a car. That’s one of the reasons why so many internet startups emerged over the past years. If you know where you can buy part X and know how to connect it to part Y, you established a business in case someone wants to pay for your product.
The impressive thing is that free, open, and *good* solutions for all these ridiculously hard problems already exist and as a developer you just have to glue them together to produce something that 10 years ago would have taken a team of 100 people a year to make— Cory Zue (@czue)
Why I want to create an alternative to Google
Isn’t Google already good enough? If I’m not happy with the results, why don’t I use Bing, Duckduckgo, Baidu or Yandex? If you did yourself, you will know that the quality of the search results is rather poor. So, why do I want something else? Am I attempting mutiny? Hold off on grabbing your pitchforks just yet.
To tell the truth, I have a few reasons why Google should have more competitors. But first, we should look at the current competition. You probably know the biggest one already. It’s Bing. Have you ever tried it? Let’s just give them an A for effort.
If we try to search for a fairly popular term on Bing like “angular.io”, we will see one of it’s biggest problems. Can you spot it? The description is complete nonsense. Why is that so?
But, why should there be anything else than Google if it already lives up to the task? Think about it, search engines are one of the most used tools by almost everyone who accesses the internet. Yet, the level of competition or new startups entering the space is minuscule if not existing. How much money do you think is in that space? Google’s stock is trading at $1254 as of writing this post (end of August 2018). Shouldn’t there be a literal gold rush happening right now?
Let’s take a look at the list of “general search engines” aggregated on Wikipedia for multilanguage input. We can find the following:
- Duckduckgo (powerd by Bing)
- Ecosia (powered by Bing)
- Munax (discontinued)
- Qwant (powered by Bing)
- Yahoo (powered by Bing)
If we remove duplicate entries for Bing, Ask.com, Bing, Exalead, Gigablast and Yandex remain. Do you call this competition? Especially, if new alternative search engines take the results from Bing and slap on traits like “Privacy oriented” (duckduckgo.com) or “We plant trees” (ecosia.org). I don’t mind these motives. I just don’t call this “competition”.
A search engine’s main goal should be to deliver the most relevant and helpful website / resource for a search query. This means actually trying to figure out what the intent of a query is, then look into an index, and select the most fitting entries.
If we look on Product Hunt, a site for hip and new projects, and search for “search engine”, we won’t be greeted with a prettier picture.
- Star Wars Search Engine: Find your most wanted Star Wars facts and figures
- X-Ray Search Engine: Find other people’s email
- Searchin - The Sneaker search engine: Come on!
The problem of having no competition makes Google a monopoly. This allows them to escape unscathed occasionally. Even the EU stepped in and fined Google for manipulating search results which favored their own products. As pointed out in an article by the Independent, Margrethe Vestager who is in charge of competition policy said:
Google abused its market dominance as a search engine by promoting its own comparison shopping service in its search results, and demoting those of competitors.
If this wasn’t enough, take for example Google’s recent expansion into displaying more and more featured snippets. This practice results in fewer clicks for all the search results ranked below the first position. In of itself, it’s good for consumers because Google satisfies the searcher’s query faster by cutting out visiting someone else’s site. But, if your website doesn’t attract any visitors, it might as well not exist.
In March of this year, Google went even sofar as to remove alternative results altogether, as outlined by Moz’s article on Zero result search engine results pages (SERPs). Go figure. This has already been rolled back given the outcry it caused.
If we dive even deeper into Google’s search engine preferences, entering a new space to be listed in the search results becomes harder and harder. Big companies who existed for a long time, created thousands of pages which rank higher than you even though nobody linked to their pages. Hint: Quora. Just the fact that their website maintains a higher domain authority, puts them above you even though your content may be more useful to the searcher.
Another example, let’s take the search term: “web analytics heatmap”. We want to find Clifford Oravec’s site for web analytics heatmaps who also writes about bootstrapping companies. Yet, because nobody until now linked to his page by using these terms, it would have never showed up in Google’s search results for said term. Note: By the time you read this article and the fact that I linked to him by virtue of the preceding link, he may now actually rank for that search term. This is just awful.
Google thinks that it’s better to display blog posts about sites with lists of web analytics heatmaps rather than the sites that solve the problem itself. Finding sites that actually give me a solution for web analytics heatmaps was my real intent by searching for it. Nevertheless, Google couldn’t figure it out.
You may argue that a searcher could have looked for these articles and I would counter that they should have entered “best web analytics heatmaps” or even better “comparison of web analytics heatmap services” instead. These terms better reflect the real intent of these prevalent listicles. But, even if you go back 2, 3, 4 or even 5 pages, you won’t find gettamboo.com.
But, but, didn’t Google strive to be better at natural language processing and become the site for finding the most relevant websites?
Let’s get a little conspirative. Imagine that Google returned the most relevant sites even if there aren’t any backlinks with specifically engineered anchor texts. And, let’s assume the domain authority of said sites didn’t influence the rankings of other sites on the same domain. What would that do to Google?
Their ad revenues would tank. Today, small website owners depend on ads to attract visitors from Google search because sites who link to a great resource don’t know that using the phrase “Click here” doesn’t help the target site in ranking for the topics mentioned on the origin site. And, if Google indeed ranked sites based on the context of a link, less people would need to pay for ads which in turn would make Google’s stock holders miserable.
You may now say that Google has to make tradeoffs or people will exploit the system. You are right but the current system is also exploitable. So, pick your poison.
I would even go sofar as to say that their whole algorithm was flawed from the start. All these updates over the past years are more or less patches which try to fix exploitable issues from the original algorithm. In my eyes, this is an opportunity for innovative startups to tackle it from scratch. But please, if you succeed, don’t give in to big checks from Apple, Google, Microsoft, and the like.
I call this a gap in the market!
Why I won’t create a better search engine than Google
Given the enthusiasm that I share for spinning up a real competitor to Google, why is it that I won’t do it? Well, it’s god damn difficult. Even Google struggles with figuring out what a page is about. Their initiative for pushing schema.org’s structured data markup on website owners speaks of itself.
As you can imagine, natural language processing is hard. Not only does a search engine see a website differently than a human, it must filter out irrelevant parts like footers, navigation bars, ads, you name it. This is a huge challenge.
The technical challenges aside, going after the best search engine will produce other hurdles as well. One of them is: How will you market your new search engine to consumers? The spots for “We plant trees” and “We care for your privacy” are already taken.
My idea was to write the search engine in a new and popular programming language, namely Rust. The community in Rust is quite energetic and likes to promote cool projects which make an effort to spread Rust’s usage in various fields.
Take for example Mozilla’s try on a highly parallelized browser engine called Servo or another developers effort to write an operating system from scratch called Redox-OS, members in the Rust community know them because of their complexity and people’s boldness to enter these spaces.
However, as it was pointed out during a discussion with Clifford, this tactic is can be futile. For instance, there is a project which aims to create a better Slack variant by developing it in Elm amongst other improvements. Slack is a site for doing real time communication between company employees by use of channels. Given that I and maybe you too, have never written a single line of Elm, why should we care about that project at all? That’s just how it is, if I were to write the search engine in Rust.
Given that Rust is a fairly new language, in the future I would run into hiring problems because there aren’t enough proficient Rust developers around the world to help me expand my company. If your are a college grad or visiting a college right now, are there any courses for Rust or let’s say even Elm? Probably not.
I would also argue that given the context in which I came up with the idea of writing a search engine, I would be wise not to pursue it. Looking at the graph above, I should be somewhere around the peak of thinking: I know it all. But in fact, that isn’t the case.
One of the biggest counter arguments, why I shouldn’t go after that business, is time. Clifford also pointed that out to me. Due to the complexity of writing a search engine, you can expect that you won’t be done in just 1, 2 or 3 months let alone a single year. This would have been the case if I had chosen Rust as the primary language. Additionally, finding other avid Rust developers will be hard because in my social circles there aren’t any.
To make Clifford’s objection of time clearer, I made an illustration of a person’s productive time during their lifetime.
As you can see, people have a very limited amount to their disposal. This makes it even more important to choose the projects which you are most likely to succeed in. Lengthy projects should also yield a higher return of investment. And, the most valuable thing to invest, is your time.
If we look at the graph above, the average lifespan of a person is 100 years (I know that’s wrong) and you were to invest 1 year to build a search engine, you would not just invest 1% of your productive time, but in reality you would invest 3.3%. How did I came up with that? If you deduct everything, you are left with a total of 30 years of productive time. This realization will from now on shape whether to pursue future projects and hopefully yours too.
To conclude, do I think that it’s possible to compete with Google and the rest of the existing search engines. Totally! Do I think that this endeavor can be tackled by your own. Probably not. Do I think that it’s doable if you were a team of 10 people with 3 linguists, 4 marketers and 3 developers? More or less.
If you developed a strong opinion while reading this article, I hereby invite you to be my guest. And yes, I know. Patents.