Book for Berlin How big is the Web? Crappabytes galore Multiple of bytes Decimal prefixes Name Symbol Multiple kilobyte kB 10^3 megabyte MB 10^6 gigabyte GB 10^9 terabyte TB 10^12 petabyte PB 10^15 exabyte EB 10^18 zettabyte ZB 10^21 yottabyte YB 10^24 In 2000, according to a study by bright planet, the web had 19 terabytes, and the hidden, non indexed web 7500 terabytes of data. Good old Google brings up some useful stuff though, including this fascinating nugget: "Public information on the deep Web is currently 400 to 550 times larger than the commonly defined World Wide Web. The deep Web contains 7,500 terabytes of information, compared to 19 terabytes of information in the surface Web. The deep Web contains nearly 550 billion individual documents compared to the 1 billion of the surface Web. More than an estimated 100,000 deep Web sites presently exist. Sixty of the largest deep Web sites collectively contain about 750 terabytes of information ? sufficient by themselves to exceed the size of the surface Web by 40 times." Mid-August 2005, Yahoo announced that it had indexed more than 19 billions of sites (versus the 8 billions google claimed at the time). This provoked a series of discussion about the limits of main search engines and the real dimensions of the web. Shortly afterwards google suddendly ported its own index from 8 billion to 20 billion sites. One wonders about the quality of such sites "where had they been before?". End 2005, according to Eric Schmidt (Google CEO) google had indexed 170 terabytes of data, but the "global" web reached already 150 million terabytes. Nobody knows how big is the web, but what is sure is that in a few years the capacity of flash cards will be big enough to contain the whole of it, no matter its dimensions. Similar to the old "rice and checkboard" story, where doubling the rice across the 64 checkboard cases produced an amount of rice so big that the whole earth could not produce it (9223372036854780000, several hundred billions of billions) the capacity of flash cards roughly doubles every year and in 2006, with 64 giga capacity we are only at checkboard square 7 (out of 64). This means, among other things, that even assuming a still exponentially growing web (which is far from granted), we will all be able to have a whole copy of it inside our own pdas in less than 10 years. The whole web, with all its juwels and all its commercial crap. ALl the data and all the musics, all the films and all the books. But buried underneath the commercial filth. Hence the vital importance of knowing both how to SEARCH and how to EVALUATE the results of our searches. Searching the web is still more an art than a taxoinomically sound discipline, but the operation can -roughly- be subdivided in several subsectors. The following ones seem to me the most important: SHORTTERM and LONGTERM searching strategies REGionaL/LOCAL and GLOBAL searching angles COMBING EVALUATING the results The importance of NAMES. The importance of TEXT on the web, the importance of words. Words are what we type into a search engine mask and represent the sole link between a search engine and the targets we expect. Everything on the web is just a name. A page, a file, a server address, a subdirectory containing it. Even the possibility of finding your target depends from its name. If we for instance search images of "snowflakes", we will always find also some noise among the signal: some white cat or white horse, or -say- a white sailing boat named "snowflake" will appear among the results. Since the Web is a quicksand, and links disappear quickly (even if they always emerge again somewhere else) we will perform this exercise using mostly "pseudo-links", i.e. searchstrings (not links) that should always work, independently from the actual location or relocation of the targets during the next years. As a working example we have chosen a research on "tops", those small spinning devices whose english name is extremely ambivalent ("top" as opposed to bottom is much more frequent than "top" intended as spinning toy). This should allow us to show the differences between shortterm and longterm searches, the importance of a sound regional and multi-linguistic approach, the severe limits of all main search engines (ā la google), the fundamental importance of combing as opposed to searching on our own and the vital necessity of knowing how to evaluate results and how to ditch the commercial crap when pulling our nets back from the web. But before embarking on our searching venture we need some sharp weapons. The Web-forest is thick and we will need to cut a lot of commercial morasses. WEAPONS Opera or Firefox Anatomy of some webbits +("/ebooks"|"/book") +"python" intitle:"index of" intitle:"index of/" "Apr-2004" "jpg" playboy note the "Apr-2004" snippet, that you can change at leisure :-) (ma4 OR mp3) "index of" +garfunkel Nov-2005 intitle:"Index of /" {frsh=9999} "digital photography" LORE Nomen est omen The tippiti-top tuppit a reverser's toy regional searching and the importance of english on the web top english trottola italian subset toupie french subset (tupie) kreisel german subset Danzknöpfli (Schwarzwald), Dilldop (Köln), Dreidel, Havergeis (Schwarzwald), Kreisel, m., Schnurre, f. (old form), Wendekreisel (tippe top) trompo spanish Volchok russian subset Tuoluo chinese subset Too loo (Hmong) chinese subset Hyrrä finn subset Pião portugal/brasilian Tol dutch http://www.ifi.uio.no/~knuthe/top/country_names.html


Tippetop or wendekreisel tippetop or tippe top for instance: http://www.fysikbasen.dk/TippetopENGLISH.php IN THE FOREST Antispammer tricks eliminate commercial noise -".com" -hot -tits -money Check for the existence of the target build your queries in a way that also probe *if* their claims are consubstantiated. Example: HotBot: +"Traktor DJ Studio" +linkextension:zip +"Traktor DJ Studio" +linkextension:rar +"Traktor DJ Studio" +linkextension:ace AllTheWeb: +"Traktor DJ Studio" +link.extension:zip +"Traktor DJ Studio" +link.extension:rar +"Traktor DJ Studio" +link.extension:ace Altavista: "Traktor DJ Studio" AND link:zip "Traktor DJ Studio" AND link:rar "Traktor DJ Studio" AND link:ace EVALUATION "Time will tell" be wary of assertions like 'a recent study' on the web there's always a terrible vagueness about exact dates, which means that if you, for inshtance, should search for a simple answer to a question like 'how big is the web', you'll be likely to land inside clusters of old-noise (1996 studies, 1998 papers) and a plethora of 'most recent studies' cited in -say- 2001...