by Jon Udell
July 1999
A handy feature of AltaVista's advanced query syntax, not widely known, is that you can ask AltaVista for the number of pages in its index that link to a specified site. For example, the syntax:
link:www.byte.com -url:byte.com
returns the answer:
"AltaVista found about 11093 Web pages for you."
The link: bit asks for pages pointing to www.byte.com. The -url: bit excludes those of www.byte.com's own pages that refer to www.byte.com.
On the Web, mindshare is a function of density of interconnection. A site that attracts lots of links is clearly enjoying a sizable mindshare. The reference count that AltaVista returns can be construed as a direct measure of that mindshare.
That's well and good, but what does a single number convey? Lacking context, not much. 11093 pages pointing to www.byte.com sounds like a lot, but is it? How popular is that site, really, in the scheme of things?
What's missing is context. To provide it, I turned to Yahoo's directory. Among its many categories are a cluster related to computer-magazine sites. As I started plugging site addresses from these lists into AltaVista, the picture began to clarify.
The procedure was cumbersome, though. Pick a site in Yahoo, capture its URL, feed it into an AltaVista advanced query, write down the result. This gets old in a hurry.
What to do? Automate. The Web is exquisitely automatable. Every site's static content is almost as easily accessible to a robotic script as to a human using a browser. The same holds true for dynamic content -- pages generated by script-driven engines, such as AltaVista's search results.
In effect, every Web site is a scriptable component, and the Web as a whole is a vast library of such components. You can invoke these invidually from any scripting language that can issue HTTP requests and interpret the responses. What's more, you can join components to achieve novel effects. That's what I did to create my computer-magazine-site mindshare report, at http://udell.roninhouse.com/mindshare-report.html.
I started by unrolling Yahoo's /Computers_and_Internet/News_and_Media/Magazines category to create a long list of URLs. How? A Perl script starts at a node of the Yahoo directory, and fetches that node's page. On every directory page are three things of interest: absolute nodes, relative nodes, and leaves. An absolute node is a reference to a different category. For example, /Computers_and_Internet/News_and_Media/Magazines is mostly a collection of pointers to other parts of the Yahoo tree, such as /Computers_and_Internet/Hardware/Macintosh/Magazines. On Yahoo's directory pages these absolute references are marked with a trailing @, to distinguish them from relative references. An example of a relative reference is Magazines which, on the /Computers_and_Internet/News_and_Media page, refers to the subcategory /Computers_and_Internet/News_and_Media/Magazines. Finally there are leaves -- that is, actual site addresses such as http://www.byte.com.
My script's recursive routine uses regular expressions to match absolute nodes, relative nodes, and leaves in the succession of pages that it fetches. When it sees an absolute or relative node, it explores that node. When it sees a leaf, it prints out the site's URL and title. Because Yahoo's cross-referencing creates loops, the script remembers where it's been and won't revisit a node.
Note that all this gets even easier in the brave new world of XML. At http://directory.mozilla.org, the home of the Yahoo-like Open Directory project (aka NewHoo), the directory is backed by an RDF (resource description format) data structure -- which is downloadable. Since RDF is expressed in XML syntax, you can run an XML parser on the dump and pick out leaves and nodes without any gnarly regular-expression stuff. (I haven't actually tried this, and I've heard some griping from those who have, but if it doesn't work perfectly yet, it will soon, things are moving along rapidly in the XML sphere.)
When I unrolled Yahoo's computer-magazine category, the raw list -- about 585 items -- was itself an interesting result. There's a downside to Yahoo's compartmentalization. You don't get to see long lists of items related under a super-category. These lists alone have a certain fascination.
Next I fed the URLs to AltaVista, captured its reference counts for each, and ranked the results by reference count. How? A Perl script issues one HTTP request per URL, using the syntax you'll find on your browser's command line when you interactively run the mindshare query. It captures each result, matches on the pattern "AltaVista found about NNNN pages...", sorts the results, matches up URLs with titles, and produces a mindshare-ordered list of links.
What did I learn? I'll admit I was surprised (and pleased) to see that BYTE, which ceased publication last May and has been stagnant online since then, remains 12th on that list of 585 sites. That LinuxWorld ranks third (53294) behind CNET (95619) and ZDNet (83222) is certainly an eye-opener. On balance, the picture that emerged seems a credible representation of Web mindshare for computer-magazine-related sites. Caveat: this is only the set of sites that were in the selected Yahoo category subtree, and it is only AltaVista's view of their mindshare. It's not a complete or perfect view -- nothing on the Web is -- but it seems more than good enough to be useful.
Of course mindshare isn't everything. Having all these links in one place led me to explore a lot of interesting but lesser-known sites that I'd never known about, because I'd never fully plumbed that region of Yahoo.
Does this technique generalize? Yes and no. If you start with an overly broad category, you'll cut an overly wide swath through the directory. For example, Computers_and_Internet/News_and_Media was too broad. I gathered a lot more computer-related sites that way, but also veered into foreign territory -- for example, chemical and biological journals.
I had better luck when I focused on /Science/Nanotechnology. Here's an area that I know little about. Yahoo told me which sites it thinks belong under that category. AltaVista told me the mindshare of each of those sites. Working together, Yahoo and AltaVista gave me a quick read on the "important" sites in that category. Here were the top 10:
www.di.com 1631 www.foresight.org 937 nano.xerox.com/nano 628 www.zeiss.de 385 www.lucifer.com/~sean/Nano.html 223 nanozine.com 159 nanocomputer.org 134 www.molec.com 130 www.physikinstrumente.com 114 www.polytecpi.com 106
Is this the "right" top-ten list, based on an unrolling of Yahoo's /Science/Nanotechnology? It looks reasonable to my untutored eye, but only nanotech buffs can say for sure.
The notion of sites as networked software components isn't new, but I'm continually surprised by the unexpected and powerful new views of the Web that you can create when you use sites in this way, and especially when you use them in combination.
Clearly XML is the way of the (very near) future. Companies like webMethods are busily redefining the EDI landscape as a set of Web-enabled components (literally: Web sites that export business functions) whose services and APIs are defined in terms of XML. But the fact is, you don't need to wait until this sea-change is complete. Humble HTTP-aware scripts can already tap into the Web's vast library of components, and do amazing and useful things with them.
Author and Web/Internet consultant Jon Udell (http://udell.roninhouse.com/) was BYTE Magazine's executive editor for new media, architect of www.byte.com, and author of BYTE's monthly Web Project column. He is also the author of Practical Internet Groupware, forthcoming from O'Reilly and Associates.
# mindshare, Jon Udell, udell@monad.net, http://udell.roninhouse.com/
#
# This script unrolls a Yahoo category to create a list of sites,
# then asks AltaVista how many pages point to each site in the list.
# In effect, it measures the Web mindshare of the sites in this category.
#
# If you use this script, please do so judiciously,
# with respect for Yahoo and AltaVista -- two of the Net's
# most valuable resources.
#
#
# usage: perl mindshare > report.html
#
#!/usr/bin/perl -w
use LWP::Simple;
my $host = "http://dir.yahoo.com";
my $root = "/Arts/Education/Art_Schools/";
#my $root = "/Science/Nanotechnology/";
#my $root = "/Computers_and_Internet/News_and_Media/Magazines/";
my $node_pat = "<li><a href=\"[^\"]+/\">";
my $leaf_pat = "<li><a href=\"http://[^>]+>";
my %seen = ();
my %sites = ();
my %shares = ();
my $domchars = "[a-z0-9\-]";
# build a hashtable of sites and titles
traverse($root);
# build a hashtable of mindshare numbers for each site
foreach $site (sort keys %sites)
{
$site =~ m#($domchars+\.$domchars+)(/|$)#;
my $dom = $1; # not perfect: works for .com, not .co.uk, .edu.au, etc.
my $mindshare = mindshare($site,$dom,$sites{$site});
$shares{$site} = $mindshare;
}
# print results ordered by mindshare
print "<table>\n";
foreach $site (sort bynum keys %shares)
{
print sprintf ("<tr><td align=right><a href=\"$site\">$sites{$site}</a></td><td>$shares{$site}</td></tr>\n" );
}
print "</table>\n";
sub traverse
{
my ($root) = @_;
my $raw = get "$host$root";
my $leaf_or_node = '';
my $description = '';
my $leaf_or_node_addr = '';
while ( $raw =~ m#($node_pat|$leaf_pat)(.+)</a>#g )
{
$leaf_or_node = $1;
$title = $2;
$leaf_or_node =~ m#\"([^\"]+)\"#;
$leaf_or_node_addr = $1;
next if ( $leaf_or_node_addr =~ m#yahoo.com# );
if ( defined $seen{$leaf_or_node_addr} )
{
print STDERR "seen: $leaf_or_node_addr, $seen{$leaf_or_node_addr}\n";
next;
}
else
{
$seen{$leaf_or_node_addr}++;
}
if ( $leaf_or_node_addr !~ m#^http# )
{
if (substr($leaf_or_node_addr,0,1) eq '/')
{
traverse ($leaf_or_node_addr);
}
else
{
traverse("$root$leaf_or_node_addr");
}
}
else
{
print STDERR "\"$leaf_or_node_addr\" => \"$title\"\n";
$site = $leaf_or_node_addr;
$site =~ s#http://##;
$sites{$site} = $title;
}
}
}
sub mindshare
{
my ($site,$dom,$title) = @_;
my $result = get "http://www.altavista.com/cgi-bin/query?pg=q&kl=XX&q=link%3A$site+-url%3A$dom";
$result =~ m#found about (\d+) Web pages#;
my $number = $1;
my $count = ($number =~ m#\d+#) ? $number : 0;
print STDERR "$dom\t$site\t$title\t$count\n";
return $count;
}
sub bynum
{ return $shares{$b} <=> $shares{$a}; }

This work is licensed under a
Creative Commons License.