05 November 2011

It's the DATA, Stupid! - The Fourth Paradigm

This is a shorthand way of describing the life-work of a visionary Microsoft research scientist named Jim Gray. A few weeks after he gave a talk on the subject in 2007, he was lost at sea off the coast of California.

Gray was proposing the Fourth Paradigm: a quasi-new scientific approach that says insight can be gathered from manipulating large amounts of data. Manipulating, sorting, and graphically expressing relationships in very large data sets: new stuff pops out. You can apply statistics to very large data sets and have far greater confidence in the results.

The First Paradigm is sometimes called empirical science - observational or descriptive science. This is the science carried out by interested folk like you and me over the past several thousand years. That, for instance, is Drosophila Melanogaster... what you've been calling the Fruit Fly. You named it, classified it as a fly of the fruit-eating variety.

The Second Paradigm is analytic science: analysis of scientific observations that leads to an understanding of electricity and magnetism, for example. By careful experiment and observation, Michael Faraday was able to connect electricity with magnetism. From this work, James Clerk Maxwell developed... yep, the famous Maxwell's Equations. I'm not making this up: Maxwell built on the scientific experiments and papers of Faraday to develop a working theory of electromagnetism, complete with an elegant mathematical formalism that haunts undergrad physics students to this day.  Actually, these guys are heroes to physicists as much as Fermi, Bohr, and Einstein are.

An Example: In 1994 I was working in northern Saudi Arabia on a phosphate project.  A monster sandstorm beginning in the Sahara far to the west engulfed us, and for a day it was very hard to work. For the next several days the dust haze hung in the air and I realized that each afternoon I could look directly at the Sun without a filter - with my naked eyes and without injury. I noticed a huge Sunspot cluster in the upper left quadrant, and was so impressed that I could actually see this without instrumentation that I sketched it into my field notebook. The next day I could see it again... and it had migrated downward and right. By the fourth day I had a complete sketch of the movement of this Sunspot cluster.  That is an example of First Paradigm science: observation. FROM those observations, I could deduce (a) that the Sun rotated, (b) where the axis of that rotation was (upper right of the observed disk), and (c) how FAST it rotated (I figured roughly 10 days would bring that cluster if it still existed to the same initial point). That part is the Second Paradigm: I analyzed the data and drew some conclusions from them. (PS: Data are always plural - there is always more than one number).

The Third Paradigm is sometimes called computational science; sometimes it's called simulation science. Think ever larger computers, calculating results from ever finer grids of models of the galaxy, models of a complex earth being deformed by stress leading to an earthquake, giant models used to predict weather.  More or less.

An Example of this is my use of a powerful software package called Geosoft Oasis Montaj: this software allows me to bring in vast amounts of data from any source and process the entire mess. It's generally known among geophysicists that you can only "see" about 15% of the content of magnetic data by hand-contouring many measurements on paper. If I pass frequency filters through the data, I can separate the deep sources from the shallow sources. If I pass derivative filters through it I can find the edges of those sources of magnetic anomalies. If I then do two-dimensional (or higher dimensional) modeling, I can obtain a probable shape of the source(s) of the anomaly(s). Say, an electric pig in a magnetic bathtub. This is computational or simulation science.

The Fourth Paradigm is a step beyond this. Grey's point was that hey!*  We are collecting vast amounts of data - more data in seconds now than in all previous history before 1950. There MUST be some relationships, connections, new things in all that mess. If we don't DO something with all these numbers, then what is the point in COLLECTING them?

Data mining is an obvious outcome of this sort of work. Clever digital types can use many different sources of data, search for links - relationships or connections - and from all this can pretty much tell some company what you are going to buy this Christmas, where, and how much money you will spend. That is valuable to a company - it allows the company to save money on inventory and helps them set up displays that will get even MORE money out of you. That's a good thing, right? Maybe.

It's already well-established that corporate recruiters need little training in data mining to find out how you party, what you really do, who your friends are, and how honest you are... no matter what your resume may say. A good thing for the HR people, a bad thing for the careless and dishonest job-hunter.

This same data mining can have unequivocally terrible consequences: people supporting the revolutions in Iran and Syria using Twitter, Facebook, and Anonymizer have died because regime agents have connected different sources of data and figured out who was trashing their regimes... and people have been found, arrested, and have died as a consequence of this kind of data mining.

For better or worse, we have all reached - and fallen into - the ocean of data lying at the end of our continent of former human interaction. Our lives will never be the same again. The Internet is self-healing and in effect self-replicating.

Big Brother is Skynet, and it has found us. 
You may run, but you cannot hide.


* No pun intended, but a book has been published online by T.S. Hey and others (2009) that assembles all the ideas Jim Gray was promoting.

No comments:

Post a Comment