This content originally appeared on Modern Web Development with Chrome and was authored by Paul Kinlan
<p>I'm about to go on a short trip to India, and I've been thinking about
longer-term developer relations work for Chrome and Web in the region. As with
most trips I like to do a bit of research ahead of time so I can get a better
understanding of what the web looks like from the perspective of the country I
am visiting.</p>
<p>I've been following a bunch of the updates to
<a href="https://httparchive.org/">HTTPArchive</a> over the past couple of months and it's
been amazing to see the improvements to the types of data it collects and stores
in its
<a href="https://github.com/HTTPArchive/legacy.httparchive.org/blob/master/docs/bigquery-gettingstarted.md">BigQuery</a>
tables. One specific piece of information that is of massive interest to me is
the <a href="https://developers.google.com/web/tools/lighthouse/">Lighthouse</a> data
generated on each run of HTTPArchive. With this data I was keen to see if I
could use it to get a snapshot of the data and get a high-level understanding of
how people might experience the web in the country.</p>
<p>The good news is that it's not too hard to analyse the Lighthouse data in
HTTPArchive.</p>
<p>For my needs though, the harder part is to get a lock on what a 'top site' in
any given country is, especially when I am thinking about developer relations
work that we could and should be doing.</p>
<p>Here is how I broke the problem down. In each country there are many types of
developers that build for the web and personally I tend to bucket them in to 3
groups: Those whose current project target the local market; Those that target a
foreign market (I building for export); and those that target a global audience.</p>
<p>When I think about the above three groups, it's nearly impossible to work out
the intent of the site and the people behind it. But there are some heuristics
that you can use to at least help you reason and understand the data.</p>
<p>For my analysis I didn't think I could get a list of the top sites visited by
users in India, so I made a simple assumption that '.in' domains are <em>likely</em> to
be built for people in India. The sensitivity and specificity for the question
of ‘indian sites’ is not 100% by focusing on ‘.in domains’ — users all
over the world like to use experiences that aren't just locked to the countries
TLD — but it seems like decent measure of the state of Indian sites as a
first pass.</p>
<p>This type of analysis turns out to be pretty easy. You open up <a href="https://github.com/HTTPArchive/legacy.httparchive.org/blob/master/docs/bigquery-gettingstarted.md">BigQuery</a>
and find the latest table that contains the Lighthouse data run
[httparchive:lighthouse.2018_08_01_mobile] in this case and run the following
query.</p>
<div class="highlight"><pre style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-sql" data-lang="sql"><span style="color:#66d9ef">SELECT</span>
url,
JSON_EXTRACT(report, <span style="color:#e6db74">'$.categories.seo.score'</span>) <span style="color:#66d9ef">AS</span> [seo_score],
JSON_EXTRACT(report, <span style="color:#e6db74">'$.categories.pwa.score'</span>) <span style="color:#66d9ef">AS</span> [pwa_score],
JSON_EXTRACT(report, <span style="color:#e6db74">'$.categories.performance.score'</span>) <span style="color:#66d9ef">AS</span> [speed_score],
JSON_EXTRACT(report, <span style="color:#e6db74">'$.categories.accessibility.score'</span>) <span style="color:#66d9ef">AS</span> [accessibility_score]
<span style="color:#66d9ef">FROM</span>
[httparchive:lighthouse.<span style="color:#ae81ff">2018</span>_08_01_mobile]
<span style="color:#66d9ef">WHERE</span>
url <span style="color:#66d9ef">LIKE</span> <span style="color:#e6db74">'%.in/'</span>
</code></pre></div><p>The above query is filtered on domains ending in '.in', and it returns the
Lighthouse score for each of the Lighthouse test categories. The Lighthouse data
is stored as a JSON object, which you have to extract the required components
via an XPath like syntax for JSON.</p>
<p>The number of results is actually pretty large and not of much use to present
here, but I did pivot these into a histogram.</p>
<table>
<thead>
<th>Score Range</th>
<th>SEO Score</th>
<th>PWA Score</th>
<th>Speed Score</th>
<th>A11Y Score</th>
</thead>
<tbody>
<tr>
<td>0</td>
<td>0</td>
<td>46</td>
<td>279</td>
<td>25</td>
</tr>
<tr>
<td>0.5</td>
<td>84</td>
<td>13992</td>
<td>6502</td>
<td>3973</td>
</tr>
<tr>
<td>0.7</td>
<td>3391</td>
<td>1400</td>
<td>2222</td>
<td>7585</td>
</tr>
<tr>
<td>0.8</td>
<td>1438</td>
<td>19</td>
<td>1147</td>
<td>2374</td>
</tr>
<tr>
<td>0.9</td>
<td>2762</td>
<td>9</td>
<td>1545</td>
<td>1069</td>
</tr>
<tr>
<td>1</td>
<td>7752</td>
<td>13</td>
<td>3189</td>
<td>434</td>
</tr>
</tbody>
</table>
<p>Further drill-down and analysis of the data needs to take place, to understand
exactly which specific issues are affecting the scores, however in some cases
like with the 'PWA Score' I've seen enough of the site scores in the past to
know what issues affect the overall score and I can see some of the challenges
ahead of us now.</p>
<p>Next up. Try and find a way to get the sites that Indian users frequent....
Hint, it's <a href="https://paul.kinlan.me/crux-topsites-and-lighthouse-scores-for-india/">here</a></p>
This content originally appeared on Modern Web Development with Chrome and was authored by Paul Kinlan