Secrets from the Algorithm: Google Search’s Internal Engineering Documentation Has Leaked

Learn what you always wish you knew about Google's algorithms.

May 30, 2024

Google, if you’re reading this, it’s too late. 😉

Ok. Cracks knuckles. Let’s get right to it. Internal documentation for Google Search’s Content Warehouse API has leaked. Google’s internal microservices appear to mirror what Google Cloud Platform offers and the internal version of documentation for the deprecated Document AI Warehouse was accidentally published publicly to a code repository for the client library. The documentation for this code was also captured by an external automated documentation service.

Based on the change history, this code repository mistake was fixed on May 7th, but the automated documentation is still live. In efforts to limit potential liability, I won’t link to it here, but because all the code in that repository was published under the Apache 2.0 license, anyone that came across it was granted a broad set of rights, including the ability to use, modify, and distribute it anyway.

A screenshot of the internal version of documentation for the deprecated Document AI Warehouse where Google had accidentally exposed the content warehouse

I have reviewed the API reference docs and contextualized them with some other previous Google leaks and the DOJ antitrust testimony. I’m combining that with the extensive patent and whitepaper research done for my upcoming book, The Science of SEO. While there is no detail about Google’s scoring functions in the documentation I’ve reviewed, there is a wealth of information about data stored for content, links, and user interactions. There are also varying degrees of descriptions (ranging from disappointingly sparse to surprisingly revealing) of the features being manipulated and stored.

You’d be tempted to broadly call these “ranking factors,” but that would be imprecise. Many, even most, of them are ranking factors, but many are not. What I’ll do here is contextualize some of the most interesting ranking systems and features (at least, those I was able to find in the first few hours of reviewing this massive leak) based on my extensive research and things that Google has told/lied to us about over the years.

“Lied” is harsh, but it’s the only accurate word to use here. While I don’t necessarily fault Google’s public representatives for protecting their proprietary information, I do take issue with their efforts to actively discredit people in the marketing, tech, and journalism worlds who have presented reproducible discoveries. My advice to future Googlers speaking on these topics: Sometimes it’s better to simply say “we can’t talk about that.” Your credibility matters, and when leaks like this and testimony like the DOJ trial come out, it becomes impossible to trust your future statements.

THE CAVEATS

I think we all know people will work to discredit my findings and analysis from this leak. Some will question why it matters and say “but we already knew that.” So, let’s get the caveats out of the way before we get to the good stuff.

Limited Time and Context – With the holiday weekend, I’ve only been able to spend about 12 hours or so in deep concentration on all this. I’m incredibly thankful to some anonymous parties that were super helpful in sharing their insights with me to help me get up to speed quickly. Also, similar to the Yandex leak I covered last year, I do not have a complete picture. Where we had source code to parse through and none of the thinking behind it for Yandex, in this case we have some of the thinking behind thousands of features and modules, but no source code. You’ll have to forgive me for sharing this in a less structured way than I will in a few weeks after I’ve sat with the material longer.
No Scoring Functions – We do not know how features are weighted in the various downstream scoring functions. We don’t know if everything available is being used. We do know some features are deprecated. Unless explicitly indicated, we don’t know how things are being used. We don’t know where everything happens in the pipeline. We have a series of named ranking systems that loosely align with how Google has explained them, how SEOs have observed rankings in the wild, and how patent applications and IR literature explains. Ultimately, thanks to this leak, we now have a clearer picture of what is being considered that can inform what we focus on vs. ignore in SEO moving forward.
Likely the First of Several Posts – This post will be my initial stab of what I’ve reviewed. I may publish subsequent posts as I continue to dig into the details. I suspect this article will lead to the SEO community racing to parse through these docs and we will, collectively, be discovering and recontextualizing things for months to come.
This Appears to Be Current Information – As best I can tell, this leak represents the current, active architecture of Google Search Content Storage as of March of 2024. (Cue a Google PR person saying I’m wrong. Actually let’s just skip the song and dance, y’all). Based on the commit history, the related code was pushed on on Mar 27, 2024 and not removed until May 7, 2024.

Correlation is not causation – Ok, this one doesn’t really apply here, but I just wanted to make sure I covered all the bases.

THERE ARE 14K RANKING FEATURES AND MORE IN THE DOCS

There are 2,596 modules represented in the API documentation with 14,014 attributes (features) that look like this:

Screenshot of API Documentation with the following text: GoogleApi.ContentWarehouse.V1.Model.CompressedQualitySignals A message containing per doc signals that are compressed and included in Mustang and TeraGoogle. For TeraGoogle, this message is included in perdocdata which means it can be used in preliminary scoring. CAREFUL: For TeraGoogle, this data resides in very limited serving memory (Flash storage) for a huge number of documents. Next id: 43 Attributes * ugcDiscussionEffortScore (type: integer(), default: nil) - UGC page quality signals. (Times 1000 and floored) * productReviewPPromotePage (type: integer(), default: nil) - * experimentalQstarDeltaSignal (type: number(), default: nil) - This field is not propagated to shards. It is meant to be populated at serving time using one of the versions present in the experimental_nsr_team_wsj_data field above (using the ExperimentalNsrTeamDataOverridesParams option to populate it; see http://source/search? ExperimentalNsrTeamDataOverridesParams%20file:ascorer.proto). The purpose of this field is to be read by an experimental Q* component, in order to quickly run LEs with new delta components. See go/oDayLEs for details. * productReviewPDemoteSite (type: integer(), default: nil) - Product review demotion/promotion, confidences. (Times 1000 and floored) * experimentalQstarSiteSignal (type: number(), default: nil) - This field is not propagated to shards. It is meant to be populated at serving time using one of the versions present in the experimental_nsr_team_wsj_data field above (using the ExperimentalNsrTeamDataOverridesParams option to populate it; see http://source/search? ExperimentalNsrTeamDataOverridesParams%20file:ascorer.proto). The purpose of this field is to be read by an experimental Q* component, in order to quickly run LEs with new site components. See go/oDayLEs for details. * exactMatchDomainDemotion (type: integer(), default: nil) - Page quality signals converted from fields in proto QualityBoost in quality/q2/proto/quality-boost.proto. To save indexing space, we convert (cut off)

The modules are related to components of YouTube, Assistant, Books, video search, links, web documents, crawl infrastructure, an internal calendar system, and the People API. Just like Yandex, Google’s systems operate on a monolithic repository (or “monorepo”) and the machines operate in a shared environment. This means that all the code is stored in one place and any machine on the network can be a part of any of Google’s systems.

The leaked documentation outlines each module of the API and breaks them down into summaries, types, functions, and attributes. Most of what we’re looking at are the property definitions for various protocol buffers (or protobufs) that get accessed across the ranking systems to generate SERPs (Search Engine Result Pages – what Google shows searchers after they perform a query).

Go paid at the $5 a month level, and we will send you both the PDF and e-Pub versions of “Government” - The Biggest Scam in History… Exposed! and a coupon code for 10% off anything in the Government-Scam.com/Store.

Go paid at the $50 a year level, and we will send you a free paperback edition of Etienne’s book “Government” - The Biggest Scam in History… Exposed! OR a 64GB Liberator flash drive if you live in the US. If you are international, we will give you a $10 credit towards shipping if you agree to pay the remainder.

Support us at the $250 Founding Member Level and get a signed high-resolution hardcover of “Government” + Liberator flash drive + Larken Rose’s The Most Dangerous Superstition + Art of Liberty Foundation Stickers delivered anywhere in the world. Our only option for signed copies besides catching Etienne @ an event.

Discussion about this post

Ready for more?