A massive Google Search internal ranking documentation leak has sent shockwaves through the SEO community. The leak, which exposed over 14,000 potential ranking features, provides an unprecedented look under the hood of Google’s closely guarded search rankings system.
A man named Erfan Azimi shared a Google API doc leak with SparkToro’s Rand Fishkin, who, in turn, brought in Michael King of iPullRank, to get his help in distributing this story.
The leaked files originated from a Google API document commit titled “yoshi-code-bot /elixer-google-api,” which means this was not a hack or a whistle-blower.
SEOs typically occupy three camps:
I suspect many people will be changing their camp after this leak.
You can find all the files here, but you should know that over 14,000 possible ranking signals/features exist, and it’ll take you an entire day (or, in my case, night) to dig through everything.
I’ve read through the entire thing and distilled it into a 40-page PDF that I’m now converting into a summary for Search Engine Land.
While I provide my thoughts and opinions, I’m also sharing the names of the specific ranking features so you can search the database on your own. I encourage everyone to make their own conclusions.
Key points from Google Search document leakWhy is Google specifically filtering for personal blogs / small sites? Why did Google publicly say on many occasions that they don’t have a domain or site authority measurement?
Why did Google lie about their use of click data? Why does Google have seven types of PageRank?
I don’t have the answers to these questions, but they are mysteries the SEO community would love to understand.
Things that stand out: Favorite discoveriesGoogle has something called pageQuality (PQ). One of the most interesting parts of this measurement is that Google is using an LLM to estimate “effort” for article pages. This value sounds helpful for Google in determining whether a page can be replicated easily.
Takeaway: Tools, images, videos, unique information and depth of information stand out as ways to score high on “effort” calculations. Coincidentally, these things have also been proven to satisfy users.
Topic borders and topic authority appear to be realTopical authority is a concept based on Google’s patent research. If you’ve read the patents, you’ll see that many of the insights SEOs have gleaned from patents are supported by this leak.
In the algo leak, we see that siteFocusScore, siteRadius, siteEmbeddings and pageEmbeddings are used for ranking.
What are they?
Why is this interesting?
Remember when I said PageRank is deprecated? I believe nearest seed (NS) can apply in the realm of topical authority.
NS focuses on a localized subset of the network around the seed nodes. Proximity and relevance are key focus areas. It can be personalized based on user interest, ensuring pages within a topic cluster are considered more relevant without using the broad web-wide PageRank formula.
Another way of approaching this is to apply NS and PQ (page quality) together.
By using PQ scores as a mechanism for assisting the seed determination, you could improve the original PageRank algorithm further.
On the opposite end, we could apply this to lowQuality (another score from the document). If a low-quality page links to other pages, then the low quality could taint the other pages by seed association.
A seed isn’t necessarily a quality node. It could be a poor-quality node.
When we apply site2Vec and the knowledge of siteEmbeddings, I think the theory holds water.
If we extend this beyond a single website, I imagine variants of Panda could work in this way. All that Google needs to do is begin with a low-quality cluster and extrapolate pattern insights.
What if NS could work together with OnsiteProminence (score value from the leak)?
In this scenario, nearest seed could identify how closely certain pages relate to high-traffic pages.
Image qualityImageQualityClickSignals indicates that image quality measured by click (usefulness, presentation, appealingness, engagingness). These signals are considered Search CPS Personal data.
No idea whether appealingness or engagingness are words – but it’s super interesting!
I believe NSR is an acronym for Normalized Site Rank.
Host NSR is site rank computed for host-level (website) sitechunks. This value encodes nsr, site_pr and new_nsr. Important to note that nsr_data_proto seems to be the newest version of this but not much info can be found.
In essence, a sitechunk is taking chunks of your domain and you get site rank by measuring these chunks. This makes sense because we already know Google does this on a page-by-page, paragraph and topical basis.
It almost seems like a chunking system designed to poll random quality metric scores rooted in aggregates. It’s kinda like a pop quiz (rough analogy).
NavBoostI’ll discuss this more, but it is one of the ranking pieces most mentioned in the leak. NavBoost is a re-ranking based on click logs of user behavior. Google has denied this many times, but a recent court case forced them to reveal that they rely quite heavily on click data.
The most interesting part (which should not come as a surprise) is that Chrome data is specifically used. I imagine this extends to Android devices as well.
This would be more interesting if we brought in the patent for the site quality score. Links have a ratio with clicks, and we see quite clearly in the leak docs that topics, links and clicks have a relationship.
While I can’t make conclusions here, I know what Google has shared about the Panda algorithm and what the patents say. I also know that Panda, Baby Panda and Baby Panda V2 are mentioned in the leak.
If I had to guess, I’d say that Google uses the referring domain and click ratio to determine score demotions.
HostAgeNothing about a website’s age is considered in ranking scores, but the hostAge is mentioned regarding a sandbox. The data is used in Twiddler to sandbox fresh spam during serving time.
I consider this an interesting finding because many SEOs argue about the sandbox and many argue about the importance of domain age.
As far as the leak is concerned, the sandbox is for spam and domain age doesn’t matter.
ScaledIndyRank. Independence rank. Nothing else is mentioned, and the ExptIndyRank3 is considered experimental. If I had to guess, this has something to do with information gain on a sitewide level (original content).
Note: It is important to remember that we don’t know to what extent Google uses these scoring factors. The majority of the algorithm is a secret. My thoughts are based on what I’m seeing in this leak and what I’ve read by studying three years of Google patents.
How to remove Google’s memory of an old version of a documentThis is perhaps a bit of conjecture, but the logic is sound. According to the leak, Google keeps a record of every version of a webpage. This means Google has an internal web archive of sorts (Google’s own version of the Wayback Machine).
The nuance is that Google only uses the last 20 versions of a document. If you update a page, wait for a crawl and repeat the process 20 times, you will effectively push out certain versions of the page.
This might be useful information, considering that the historical versions are associated with various weights and scores.
Remember that the documentation has two forms of update history: significant update and update. It is unclear whether significant updates are required for this sort of version memory tom-foolery.
Google Search ranking systemWhile it’s conjecture, one of the most interesting things I found was the term weight (literal size).
This would indicate that bolding your words or the size of the words, in general, has some sort of impact on document scores.
Interestingly, the standard hard drive is used for irregularly updated content.
Get the daily newsletter search marketers rely on.
Business email address Subscribe Processing... Google’s indexer now has a name: AlexandriaGo figure. Google would name the largest index of information after the most famous library. Let’s hope the same fate does not befall Google.
Two other indexers are prevalent in the documentation: SegIndexer and TeraGoogle.
The section titled “GoogleApi.ContentWarehouse.V1.Model.QualityNsrNsrData” mentions a factor named isElectionAuthority. The leak says, “Bit to determine whether the site has the election authority signal.”
This is interesting because it might be what people refer to as “seed sites.” It could also be topical authorities or websites with a PageRank of 9/10 (Note: toolbarPageRank is referenced in the leak).
It’s important to note that nsrIsElectionAuthority (a slightly different factor) is considered deprecated, so who knows how we should interpret this.
This specific section is one of the most densely packed sections in the entire leak.
Suprise, suprise! Short content does not equal thin content. I’ve been trying to prove this with my cocktail recipe pages, and this leak confirms my suspicion.
Interestingly enough, short content has a different scoring system applied to it (not entirely unique but different to an extent).
Fresh links seem to trump existing linksThis one was a bit of a surprise, and I could be misunderstanding things here. According to freshdocs, a link value multiplier, links from newer webpages are better than links inserted into older content.
Obviously, we must still incorporate our knowledge of a high-value page (mentioned throughout this presentation).
Still, I had this one wrong in my mind. I figured the age would be a good thing, but in reality, it isn’t really the age that gives a niche edit value, it’s the traffic or internal links to the page (if you go the niche edit route).
This doesn’t mean niche edits are ineffective. It simply means that links from newer pages appear to get an unknown value multiplier.
Quality NsrNsrDataHere is a list of some scoring factors that stood out most from the NsrNsrData document.
It seems like site authority and a host of NSR-related scores are all applied in Qstar. My best guess is that Qstar is the aggregate measurement of a website’s scores. It likely includes authority as just one of those aggregate values.
Scoring in the absence of measurementnsrdataFromFallbackPatternKey. If NSR data has not been computed for a chunk, then data comes from an average of other chunks from the website. Basically, you have chunks of your site that have values associated with them and these values are averaged and applied to the unknown document.
Google is making scores based on topics, internal links, referring domains, ratios, clicks and all sorts of other things. If normalized site rank hasn’t been computed for a chunk (Google used chunks of your website and pages for scoring purposes), the existing scores associated with other chunks will be averaged and applied to the unscored chunk.
I don’t think you can optimize for this, but one thing has been made abundantly clear:
You need to really focus on consistent quality, or you’ll end up hurting your SEO scores across the board by lowering your score average or topicality.
Demotions to watch out forMuch of the content from the leak focused on demotions that Google uses. I find this as helpful (maybe even more helpful) as the positive scoring factors.
Key points:
It’s important to note that click satisfaction scores aren’t based on dwell time. If you continue searching for information NavBoost deems to be the same, you’ll get the scoring demotion.
A unique part of NavBoost is its role in bundling queries based on interpreted meaning.
How is no one talking about this one? An entire page dedicated to anchor text observation, measurement, calculation and assessment.
At the end of it all, you get spam probability and a spam penalty.
Here’s a big spoonful of unfairness, and it doesn’t surprise any SEO veterans.
trustedTarget is a metric associated with spam anchors, and it says “True if this URL is on trusted source.”
When you become “trusted” you can get away with more, and if you’ve investigated these “trusted sources,” you’ll see that they get away with quite a bit.
On a positive note, Google has a Trawler policy that essentially appends “spam” to known spammers, and most crawls auto-reject spammers’ IPs.
9 pieces of actionable advice to considerThis is not a perfect depiction of Google’s algorithm, but it’s a fun attempt to consolidate the factors and express the leak into a mathematical formula (minus the precise weights).
R: Overall ranking score
UIS (User Interaction Scores)
CQS (Content Quality Scores)
LS (Link Scores)
RB (Relevance Boost): Relevance boost based on query and content match
QB (Quality Boost): Boost based on overall content and site quality
CSA (Content-Specific Adjustments): Adjustments based on specific content features on SERP and on page
R=((w1⋅UgcScore+w2⋅TitleMatchScore+w3⋅ChromeInTotal+w4⋅SiteImpressions+w5⋅TopicImpressions+w6⋅SiteClicks+w7⋅TopicClicks)+(v1⋅ImageQualityClickSignals+v2⋅VideoScore+v3⋅ShoppingScore+v4⋅PageEmbedding+v5⋅SiteEmbedding+v6⋅SiteRadius+v7⋅SiteFocus+v8⋅TextConfidence+v9⋅EffortScore)+(x1⋅TrustedAnchors+x2⋅SiteLinkIn+x3⋅PageRank))×(TopicEmbedding+QnA+STS+SAS+EFTS+FS)+(y1⋅CDS+y2⋅SDS+y3⋅EQSS)
Generalized scoring overviewGeneralized Formula: [(User Interaction Scores + Content Quality Scores + Link Scores) x (Relevance Boost + Quality Boost) + X (content-specific score adjustments)] – (Demotion Score Aggregate)