A Data Science Approach To Optimizing Internal Link Structure

By Chris Barnhart On Nov 24, 2021

A Data Science Approach To Optimizing Internal Link Structure

Getting the internal linking optimized is important if you care about your site pages having enough authority to rank for their target keywords. By internal linking what we mean are pages on your website receiving links from other pages.

This is important because this is the basis by which Google and other searches compute the importance of the page relative to other pages on your website.

It also affects how likely a user would discover content on your site. Content discovery is the basis of the Google PageRank algorithm.

Today, we’re exploring a data-driven approach to improving the internal linking of a website for the purposes of more effective technical site SEO. That is to ensure the distribution of internal domain authority is optimized according to the site structure.

Improving Internal Link Structures With Data Science

Our data-driven approach will focus on just one aspect of optimizing the internal link architecture, which is to model the distribution of internal links by site depth and then target the pages that are lacking links for their particular site depth.

Continue Reading Below

We start by importing the libraries and data, cleaning up the column names before previewing them:

import pandas as pd
import numpy as np
site_name="ON24"
site_filename="on24"
website="www.on24.com"

# import Crawl Data
crawl_data = pd.read_csv('data/'+ site_filename + '_crawl.csv')
crawl_data.columns = crawl_data.columns.str.replace(' ','_')
crawl_data.columns = crawl_data.columns.str.replace('.','')
crawl_data.columns = crawl_data.columns.str.replace('(','')
crawl_data.columns = crawl_data.columns.str.replace(')','')
crawl_data.columns = map(str.lower, crawl_data.columns)
print(crawl_data.shape)
print(crawl_data.dtypes)
Crawl_data

(8611, 104)

url                          object
base_url                     object
crawl_depth                  object
crawl_status                 object
host                         object
                             ...   
redirect_type                object
redirect_url                 object
redirect_url_status          object
redirect_url_status_code     object
unnamed:_103                float64
Length: 104, dtype: object

Andreas Voniatis, November 2021

The above shows a preview of the data imported from the Sitebulb desktop crawler application. There are over 8,000 rows and not all of them will be exclusive to the domain, as it will also include resource URLs and external outbound link URLs.