Fravia's talk at the T2'05 Conference by Fravia+, August 2005, DRAFT version 1.1
ABRIDGED PRESENTATION The web: Cornucopia of garbage
The title of my own contribution is "The web: bottomless cornucopia & immense garbage dump",
and in fact, as we will see, the web is both:
a shallow cornucopia of emptiness and a deep mine of jewels, hidden underneath tons of commercial garbage.
This will be a talk about contradictions. I will try to present some effective web-searching techniques, that
will (should) allow anyone interested to take advantage from some of these very contradictions.
The organizers of this event have chosen to give me two slots, therefore there will be
a presentation with a more "general" tone, followed the next day by a more "concrete" searching-oriented workshop.
Most talks in this kind of conferences are (correctly) incentrated on some specific aspects.
This one will -instead- touch many different searching lore. I wish
to present a BROAD palette of searching techniques.
Let's examine some of the most
startling contradictions of the web. It isn't just a matter of curiosity:
such findings may give us some clues about future developments, and -as we will see- may even
help us to improve a little our searching skills:
knowing WHERE TO FIND an answer is tantamount to knowing the answer itself.
And today's Internet is a truly huge library without indexes:
everything that can be digitized is there, from music to images, from documents to books, from software to confidential memos, it is there
indeed, but where?
The stake is very high:
if we learn to search effectively (and evaluate correctly our findings)
the entire human knowledge will become available, at our command and disposal, no matter where, or how,
somebody may have "hidden" our targets.
As you will see most searching techniques are -a posteriori- very simple.
Note for instance how interesting, for searchers, a simple "softwarez" querystring can be:
The web was made for SHARING, not for selling and not for hoarding, so -as we will see-
its very "building bricks" deny
to the commercial vultures the possibility of enslaving parts of it. This is but one of many www-contradictions.
Private and commercial databases are for instance mostly open to seekers:
here an interesting list of oracle_default_passwords.
But we do not have always to resort to 'tricks': the 'real' web of knowledge
is still alive and kicking, albeit unconfortably buried underneath the sterile sands of the commercial desert. This is
very important for seekers, it means that we have a
'double' edge: we can exploit more or less freely all commercial repositories and
we are able to quickly find the relevant scientific public ones.
Nobody knows how big the web is. Moreover there's an "invisible" (or "deep") "hidden databases" web and a "visible" (or "surface") web.
The "hidden databases"
web is
made out of
dynamic, not persistent, pages.
The content of these searchable databases can only be found by a direct query.
Such pages often possess a unique URL address that allows them to be retrieved again later, yet
without a direct query, the database does not publish a specific page.
The "hidden databases" invisible/deep web is supposed to be (potentially) at least 500 TIMES bigger than the visible/surface web,
and most researchers believe the visible "surface" bulk to be around 32 Billion (milliards) pages, only less than one half
of it covered by the main search engines. It is still growing, albeit at
a slower pace than some years ago.
All the main search engines TOGETHER cover just LESS THAN ONE HALF (and probably less than a quarter) of the "visible" web, and only scattered
pages of the "hidden databases" (depending from link encountered on the static pages).
This limit is VERY IMPORTANT for searchers: it means
that you should use OTHER searching techniques instead of relying on the main search engines (or, even worse, on
google alone: the main search engines do not overlap that much after all).
However getting rid of the "commercial noise" is not an option, when
searching effectively the web: it is a PRIORITY.
To make an example, you may concentrate your search on non commercial sites: you may
for isntance find more relevant signal using specific SCHOLAR search engines (and limiting the query to the most
recent
months): ddos "june | july | august | september" +2005"
This is a pretty useful "ddos" query
However this is all simple "googling": seeking, once more, is NOT (or only in part) made using the main search engines.
In order to understand searching strategies, a lore which you'll find relevant for your security hobbies and for your real life as well,
you have to grasp not only how the web looks like, but also how the web-tides move.
First of all the web is at the same time extremely static AND a quicksand, an oxymoron?
No, just another of the many contradictions we will see today.
See: Only less than one half of the pages available today will be available next year.
Hence, after
a year, about 50% of the content on the Web will be new. The Quicksand.
Yet, out of all pages that are still available after one
year (one half of the web), half of them (one quarter of the web), have not changed at all during the year. The static aspect
Those are the "STICKY" pages.
Given the low rate of web pages' "survival", historical
archiving, as performed by the Internet Archive, is of
critical importance for enabling long-term access to historical
Web content. In fact a significant fraction of pages accessible today
will be QUITE difficult to access next year.
Another contradiction is the fact that VERY OLD exploits REMAIN ALWAYS VALID and can
be used ("The web is a sticky quicksand").
Some simple rules when searching:
1. always use more than one search engine! "Google alone and you'll never be done!"
2. Always use lowercase queries! "Lowercase just in case"
3. Go regional!
4. Always use MORE searchterms, not only one "one-two-three-four, and if possible even more!" (5 words searching);
Anonymity (and lack of anonymity) reperesents
one of the most startling contradictions of today's web.
ISPs are now bound to keep track of ALL loggings and emails of
all their users, burning them on dvds and delivering them at once, for any whimsical reason, to the powers that be.
As you all know, a typical ISP-logging logs EVERYTHING and then some.
Is it the end of anonymity?
Nope. In fact the other side of this very interesting contradiction
is to be seen with any wardriving laptop.
WEP encryption is a joke, and anyone using Kismet for GNU-Linux (source code here)
or Retina Wi-Fi scanner for Windoze can bypass it
pretty quickly.
But teher's not even the need to bypass weak WEP-encryptions: you'll find a plethora of completely open
access points everywhere.
Provided you are a tag careful whit your personal data -especially when uploading- and provided
you remember that THERE ARE MANY OTHER IDENTIFIERS
inside your box -and not only your wifi MAC_address- you may browse the web with some amount of relative anonymity.
The main reason you should use more than one main search engine is that
search engines overlap FAR less than you would think.
Hence the importance of using OTHER METHODS to search the web, and not only the main search engines.
Here some hints (more about these techniques tomorrow):
1) go regional, then go regional again
2) go FTP
3) go IRC
4) go USENET/MESSAGEBARDS/BLOGS (yet remember that blogs are nothing more than messageboards where only the owner can start a thread,
this being the main reason -with few exceptions- of their quick obsolescence, short
duration and scant utility)
5) use homepages/rings/webarchives, cached repositories
6) use luring & social engineering
7) use stalking & trolling
Book searching is quite important for seekers, and in this context
"Rapidshare" searches are worth a digression per se.
We will examine various book searching approaches,
anyway at the moment -with books- even banal arrows
will deliver whatever you want.
The depth and quantity of information available on the web, once you peel off the stale and useless commercial crusts,
is truly staggering. I will present
some examples intended to give "a taste" of the
deep depths and currents of the web of knowledge...
The possibility to find whatever the human race can digitize on the web means basically means that
the guardian of the light tower, the young kid in central africa and the yuppie in new york all have access to the
same resources: location is now irrelevant for mankind.
Yet remember that
there are not only files on the web, but also solutions. These may prove to be more and more important in the near future.
Of course, as a consequence (and last contradiction) on such an open web
learning to discern CRAP (and
learning to reverse advertisers' tricks)
will be MORE and MORE important in the future
Your capacity of not being fooled, of understanding the rhetorical
tricks will be of PARAMOUNT importance. So evaluation is the other face of the searching medal.