I’m still thinking about that curve showing pages/website. Since the version I posted, we’ve added some more sites to the scan and found some even bigger websites. Doesn’t matter how big they are or who they are really, just that they make the curve even more extreme and reduce the number of sites needed to get to 80% of government content. Out of the 780 or so, the biggest 155 count for exactly 80% of the content (that guy Pareto really nailed it, huh?).
So after my questions in part two, comes another question … does anyone ascribe a “value” to any of that content? The storage industry has talked about the value of data for years … you put your most accessed in RAM, your next most accessed nearby on a hard disk, your next on a network disk, your next “nearline” (i.e. reachable quickly) and then the least valuable offline (in some kind of tape archive). Does the web change that? Shouldn’t we be thinking about a hierarchy of websites and webcontent that ensures that the most valuable content is the easiest to find? Somewhere in those sites with 100,000 pages is doubtless some incredibly important piece of information that we need to know … but could you find it if you needed it? And doesn’t the value change according to who you are and what you need? Is anyone out there modelling the value of content, how you measure it and what you do when you know it?
There’s still a PHD thesis in all that I’m sure.