Google Data Leak Clarification?

GOOGLE-LOGO-COLOR-RED-YELLOW-GREEN-BLUE

A few posts regarding an alleged breach of Google ranking-related data surfaced over the US holidays. The initial posts regarding the disclosures concentrated on “confirming” long-held beliefs of Rand Fishkin; the context and significance of the material were not given as much thought.

Document AI Warehouse: Context Is Important

The document that was released contains information about Document AI Warehouse, a public Google Cloud platform for data analysis, organisation, searching, and storing. The title of this publicly available document is Document AI Warehouse overview. The “leaked” material is the “internal version” of the Document AI Warehouse documentation that is accessible to the general public, according to a Facebook post. That is the background to this facts.

Screenshot: AI Document Storage

That appears to disprove the theory that the “leaked” data contains information from Google Search that is internal.

As of right moment, the information on the public Document AI Warehouse page and the “leaked data” appear to be similar.

Internal Search Data Leak?
The data’s source, Google Search, is not mentioned in the original SparkToro post. It states that the assertion was made by the individual who forwarded the data to Rand Fishkin.

The fact that Rand Fishkin writes with such painstaking precision—especially when it comes to warnings—is one of my favourite qualities in him. Rand correctly points out that the source of the data is the one who claims that it came from a Google search. There’s only a claim, no evidence.

He composes:

“I received an email from someone claiming to be inside Google’s Search division and to have access to a massive leak of API documentation.”

Fishkin does not claim that the data’s provenance from Google Search was verified by former Google employees. He claims that the assertion was made by the individual who emailed the data.

The email further claimed that ex-Google staff members had provided more private information about Google’s search operations to third parties and attested to the veracity of the papers that had been released.

According to Fishkin, the leaker disclosed that he had communicated with former Google employees in the course of meeting them at a search industry gathering during a subsequent video meeting. Once more, we’ll have to believe the leakers when they say that the former Google employees said what they did after closely examining the data rather than just making an unofficial statement.

Fishkin claims to have discussed it with three former Google employees. The fact that those former employees of Google did not specifically state that the information is internal to Google Search is noteworthy. They did not verify that the data came from Google Search; rather, they just confirmed that it seemed to be similar to internal Google information.

According to Fishkin, the former Google employees told him:

“When I worked there, I was not able to access this code. However, this appears to be authentic.


“Everything about it is consistent with an internal Google API.”


“The API is Java-based. And a lot of effort was put into following Google’s internal documentation and naming guidelines.

“I am aware of internal documentation that corresponds with this, but it would take more time to confirm.”

“This seems to be genuine based on my cursory examination.”

It is not the same thing to say something comes from Google Search and that it comes from Google.

Maintain An Open Mind
Since there is a lot of unsubstantiated information in the data, it is crucial to maintain an open mind. For instance, it’s unclear if this is a paper from the Search Team that is internal. As a result, extrapolating any useful SEO recommendations from this data is generally not a smart idea.

Furthermore, it is not a good idea to expressly corroborate long-held opinions with data analysis. That’s how confirmation bias ensnares a person.

Confirmation bias defined as:

The propensity to look for, evaluate, favour, and remember information that supports or validates one’s preexisting views or values is known as confirmation bias.

A person suffering from confirmation bias will reject claims that are supported by evidence. One such notion is the decades-old “Sandbox” theory, which holds that Google automatically prevents a newly launched website from ranking. Every day, people tell us that their newly launched websites and pages appear in Google search results almost right away, in the top 10.

However, real, observable experiences such as that will be dismissed by ardent Sandbox believers, regardless of the number of persons who report the opposite experience.

Regarding claims regarding the Sandbox, Brenda Malone, a freelance senior SEO technical strategist and web developer with a LinkedIn profile, messaged me.

“I have firsthand knowledge of the errors in the Sandbox theory. I recently indexed a personal blog with two posts in under two days. Under the Sandbox principle, a small two-post website had no business being indexed.

The lesson here is that looking for evidence to support deeply held ideas is not the right approach to analyse the data if it turns out that the documentation came from a Google search.

What Is Included in the Google Data Breach?
There are five aspects of the compromised data to take into account:

It’s unclear what the information that was leaked was about. Is it connected to Google Search? Is there another reason behind it?


the data’s intended use. Was the data utilised to provide real search results? Or was it employed internally for data modification or management?


The information is not unique to Google Search, according to ex-Google employees. All they could affirm was that it seems to originate from Google.

This is known as confirmation bias.

Other Views Regarding “Leaked” Documents


In addition to having extensive experience with SEO, Ryan Jones also had a strong grasp of computer science, and he offered some insightful remarks regarding the purported data leak.

Ryan posted a tweet:

“We’re not sure if this is for testing or production. It’s primarily for testing possible adjustments, I suppose.

What’s utilised for the web or other sectors is unknown to us. Certain items might only be utilised for news, Google Home, etc.

What a machine learning algorithm accepts as input and what it trains against are unknown. In my opinion, clicks are utilised to train a model to predict clickability rather than being a direct input. (Apart from popular boosters)

Furthermore, I assume that some of these attributes are specific to training data sets rather than all sites.

Is it true that Google didn’t lie? Not at all. However, let us analyse this leak objectively and without bias.

Tweeted by @DavidGQuaid:

Additionally, we’re not sure if this is for Google Cloud Document Retrieval or Google Search.

APIs appear to be selective; that isn’t how I would expect the algorithm to function. What if a developer decides to forego all of the quality checks? It appears that they wish to create a content warehouse application for their company’s knowledge base.

Is Google Search Linked To The “Leaked” Data?
There is currently no concrete proof that the “leaked” data originates from Google Search. There is a great deal of uncertainty over the data’s intended use. Notably, there are indications that this data has nothing to do with Google Search ranking and is merely “an external facing API for building a document warehouse as the name suggests.”

Although the evidence does not yet support the conclusion that this data did not come from a Google search, this seems to be the general direction of the wind.

Leave a Comment