Book for Berlin
How big is the Web?
Crappabytes galore
Multiple of bytes
Decimal prefixes
Name Symbol Multiple
kilobyte kB 10^3
megabyte MB 10^6
gigabyte GB 10^9
terabyte TB 10^12
petabyte PB 10^15
exabyte EB 10^18
zettabyte ZB 10^21
yottabyte YB 10^24
In 2000, according to a study by bright planet, the web had 19 terabytes, and the hidden, non indexed web 7500 terabytes of data.
Good old Google brings up some useful stuff though, including this fascinating nugget:
"Public information on the deep Web is currently 400 to 550 times larger than the commonly defined World Wide Web.
The deep Web contains 7,500 terabytes of information, compared to 19 terabytes of information in the surface Web.
The deep Web contains nearly 550 billion individual documents compared to the 1 billion of the surface Web.
More than an estimated 100,000 deep Web sites presently exist. Sixty of the largest deep Web sites collectively contain about
750 terabytes of information ? sufficient by themselves to exceed the size of the surface Web by 40 times."
Mid-August 2005, Yahoo announced that it had indexed more than 19 billions of sites (versus the 8 billions google claimed at the time).
This provoked a series of discussion about the limits of main search engines and the real dimensions of the web.
Shortly afterwards google suddendly ported its own index from 8 billion to 20 billion sites. One wonders about the quality of
such sites "where had they been before?".
End 2005, according to Eric Schmidt (Google CEO) google had indexed 170 terabytes of data, but the "global" web reached already 150 million terabytes.
Nobody knows how big is the web, but what is sure is that in a few years the capacity of flash cards will be
big enough to contain the whole of it, no matter its dimensions.
Similar to the old "rice and checkboard" story, where doubling the rice across the 64 checkboard cases produced
an amount of rice so big that the whole earth could not produce it (9223372036854780000, several hundred billions of billions)
the capacity of flash cards roughly doubles every year and in 2006, with 64 giga capacity we are only at checkboard square 7 (out of 64).
This means, among other things, that even assuming a still exponentially growing web (which is far from granted), we will
all be able to have a whole copy of it inside our own pdas in less than 10 years. The whole web, with all its juwels and
all its commercial crap. ALl the data and all the musics, all the films and all the books. But buried underneath the commercial
filth.
Hence the vital importance of knowing both how to SEARCH and how to EVALUATE the results of our searches.
Searching the web is still more an art than a taxoinomically sound discipline, but the operation can -roughly- be subdivided in several
subsectors.
The following ones seem to me the most important:
SHORTTERM and LONGTERM searching strategies
REGionaL/LOCAL and GLOBAL searching angles
COMBING
EVALUATING the results
The importance of NAMES.
The importance of TEXT on the web, the importance of words.
Words are what we type into a search engine mask and represent the sole link between a search engine and the targets we expect.
Everything on the web is just a name. A page, a file, a server address, a subdirectory containing it.
Even the possibility of finding your target depends from its name.
If we for instance search images of "snowflakes", we will always find also some noise
among the signal: some
white cat or white horse, or -say- a white sailing boat named "snowflake" will appear among the results.
Since the Web is a quicksand, and links disappear quickly (even if they always emerge again somewhere else) we will perform this
exercise using mostly "pseudo-links", i.e. searchstrings (not links) that should always work,
independently from the actual location or relocation
of the targets during the next years.
As a working example we have chosen a research on "tops", those small spinning devices whose english name is extremely
ambivalent ("top" as opposed to bottom is much more frequent than "top" intended as spinning toy).
This should allow us to show the differences between shortterm
and longterm searches, the importance of a sound regional and multi-linguistic approach,
the severe limits of all main search engines (ā la google),
the fundamental importance of combing as opposed to searching on our own and the vital necessity of knowing how to evaluate
results and how to ditch the commercial crap when pulling our nets back from the web.
But before embarking on our searching venture we need some sharp weapons.
The Web-forest is thick and we will need to cut a lot of commercial morasses.
WEAPONS
Opera or Firefox
Anatomy of some webbits
+("/ebooks"|"/book") +"python" intitle:"index of"
intitle:"index of/" "Apr-2004" "jpg" playboy note the "Apr-2004" snippet, that you can change at leisure :-)
(ma4 OR mp3) "index of" +garfunkel
Nov-2005 intitle:"Index of /" {frsh=9999} "digital photography"
LORE
Nomen est omen
The tippiti-top tuppit
a reverser's toy
regional searching and the importance of english on the web
top english
trottola italian subset
toupie french subset (tupie)
kreisel german subset
Danzknöpfli (Schwarzwald), Dilldop (Köln), Dreidel, Havergeis (Schwarzwald), Kreisel, m., Schnurre, f. (old form), Wendekreisel (tippe top)
trompo spanish
Volchok russian subset
Tuoluo chinese subset
Too loo (Hmong) chinese subset
Hyrrä finn subset
Pião portugal/brasilian
Tol dutch
http://www.ifi.uio.no/~knuthe/top/country_names.html