The biggest science story to hit the mainstream media in the last year was of course the big switch on at CERN. What made it such a great story for me was not just the sheer and audacious enormity of the enterprise or the humbling nobility of the colossal experiment but the story behind the story. That story was the absolutely central role of free software philosophy at the heart of everything CERN was (and is) doing. Despite the false start, CERN's search for the Higgs Boson has got into its stride. The same cannot be said for the car crash that is climate science, which may have inflicted terminal damage on the reputation of science. I believe the rigorous application of free software methodology in conjunction with the Fourth Paradigm may save it.
This is not the first time I have written about applying the practices and principles of free software to freeing science from the shackles of proprietary software and organisational structures, whether it is DIY biology or doing science the CERN way. In this instance the stakes are much much higher than the ones I described the problems in an article on Climategate.
the "deluge of data" produced by all branches of science threatens to overwhelm the most fastidious peer review process
Having re-read that last article, I realised that something was missing, or rather that it is necessary to take free software to a new level of application. In my article on Climategate I made the rather obvious point that the IPCC and CRU needed to do their science the free software way in order to restore the integrity of the peer review process. Since that piece Climategate seems to have precipitated a domino effect. It has been followed rapidly by Glaciergate, Amazongate and Paurachiagate which have simply reinforced the conviction that the IPCC and the CRU have gutted that process like a fish on the Grimsby dockside. There are many problems here but one the main ones is the "deluge of data" produced by all branches of science that threatens to overwhelm the most fastidious peer review process, even the ones free of political ideology and financial self interest.
Science has come a long way since the era of the gentleman amateur scientist doing small scale experiments alone and self financed. It's big business now and is subject to all kinds of commercial and political pressure and external formal controls. The deluge of data is, like the social web, threatening to suffocate us. The combination of data explosion and external control makes it utterly imperative that information is utterly free--in terms of access. Access to see the raw data, the computer models and the control and flow of information.
It wasn't until I read about the concept of the Fourth Paradigm that I realised that some very enterprising people have seen and defined the problem, a problem afflicting all areas of science which produces data in the Petabyte range--both in terms of the data and the literature. Bodies like CERN and astronomers spew out enormous amounts of data. Science has gone through the experimental, the theoretical and the computational. It is now in the midst of the fourth stage: the Fourth Paradigm, characterised by the analysis of massive data sets (in which, increasingly, new discoveries are made, not by hypothesis, but by data-driven computing). It's the only way to even begin managing the Tsunami of data deluge. More importantly, if it's free, open and utterly transparent it will be less likely to be burdened or corrupted even by big corporations, laboratories or governments. Equally importantly, it can speed the pace of research by facilitating the faster exchange of data between researchers, both inside and outside formal scientific structures.
I can't believe that I just wrote that, but I did -- and here's the reason why: the concept of the Fourth Paradigm is largely associated with Jim Gray, an ex-employee of Microsoft. He disappeared in a boating accident in 2007 and had been working on this concept. (The story of the the search for his boat and here was a tour de force in itself of how people do things better when there done outside closed hierarchies.)
"Democratize access to rich data.....". Are they having a laugh?
Microsoft have published a book called "The Fourth Paradigm: Data Intensive Scientific Discovery" (2009). You can purchase a paperback or Kindle version but they've also made it freely available as a downloadable PDF. There are two versions: a bog standard version (6MB) and a high resolution version weighting in at a hefty 93MB. Not only have Microsoft made this 287 page book freely available they have -- wait for it -- made it available under the Creative Commons licence Attribution Share Alike 3.0. It's a ripping good read and would be worth the hard copy price ($46 US) in its own right. However, there's always a but and it was a comment by Glyn Moody in a Google buzz that alerted me to the real likely purpose behind the book freebie. Microsoft have been pushing to break into academic computing via deals on HPC systems and Azure, their proprietary cloud platform. Moody is right. It is a lure for the unwary. I defy you not to laugh at the following excerpt from their latest pitch on cloud computing:
Equally importantly, to address 21st century challenges, we must democratize access to rich data and complex computational models, empowering a broader range of researchers via easy-to-use cloud tools
"Democratize access to rich data.....". Are they having a laugh?
The essential argument of the book, expressed through various branches of science, is that what is universally made available via an internet-based e-repository is shared and what is shared is free, what is free must be open and what is open accelerates the progress of human knowledge (shades of Vernon Vinge's Singularity?) and makes for better and efficient management of knowledge and data. Although the book never explicitly says so it ensures that the peer review process should not be corrupted or hijacked and never used to bully or exclude, or to intimidate dissenters by appeals to consensus or (arbitrary) authority.
However, the sheer volume of data being produced in the Petabyte range in areas as diverse as particle physics, astronomy and genetics precludes the possibility of moving stuff around the web, even over grids and clouds--if only for reasons of bandwidth. Just to take a few examples to give some idea of the eye watering volumes being produced: the Trace database for DNA sequencing weights in at 65 Terabytes, our good friends at CERN produce 16 Petabytes from the Large Hadron Collider (per year). Those figures could be replicated right across the technical and scientific spectrum.
Of course the open source community community has had its fair share of cliques, disagreements and forking
One idea is an extension of cluster computing (something GNU/Linux does well with Beowulf). This has been extended with Graywulf (named after Jim Gray) described by Microsoft as "scalable software architecture for data intensive computing". It utilizes commodity servers and as this is a Microsoft Research project you can bet that it is not free and open. proprietary ownership of the software and the cost of servers would effectively exclude many from making any contribution outside the rigid, closed hierarchical Cathedral of formal science, government funding and peer review cliques, where exclusion and abuse replace openness and the spirit of collegial co-operation. I suspect that is why sites like the Extreme Linux Page was set up (though it looks a little dated). Of course the free software community has had its fair share of cliques, disagreements and forking but as there is no central ownership it cannot have the same impact as proprietary monopolies.
Jim Gray's vision (and the vision of of others) was to have all the data and all the scientific literature distributed online. That requires software tools and as you've probably guessed free software has an app for that (Apple is not the only fruit!).
Alright, it's not the catchiest name. It's more important to know that the software is licensed under the GNU GPLv2. It has been developed by the free software community. The software is cross platform and available as precompiled binaries for Debian, Redhat/Fedora, Suse and Ubuntu (Hardy, Intrepid, Jaunty and Karmic).R is a programming language and a software environment for statistical computing and data analysis. It runs on the command line but it also has numerous graphical user interfaces as well as support for Bluefish, Emacs, Kate and Vim. R even has support for scripting languages like Python.
The phenomenon known as flash mobbing has a trivial reputation that occasionally makes the news headlines but like many other things it can be turned to a serious purpose. Flashmob computing went public in April 2004 at the University of San Francisco (after sending out a call to action via Slashdot) when an ad hoc computer cluster utilizing specific software (Morphix Linux off a live CD) co-ordinating desktops and laptops to form a single supercomputer. The initial aim was to enter the top 500 list of supercomputers (of which the GNU/Linux OS has 78%). That failed but the organisers did manage to get 150 commodity computers to produce 77Gflops.
That was back in the mists of 2004. A visit to the FlashMobingComputer.org seems to indicate little activity since then. That's a pity, for although the initial purpose was to get into the top 500 it's also obvious that the concept could be utilized to bring that kind of public computing power to bear on the Fourth Paradigm. In theory at least it is entirely possible for scientists to put out, via the web, especially the social networking sites, a call for some ad hoc supercomputer clustering at specific locations on a specific day for the purpose of solving specific problems or just to do some big number crunching or analysis of huge data sets to discover new stuff.
There are drawbacks of course. Distributed and grid computing can be done in virtual space and are easier to "organise". No need to hire a physical location and get masses of bodies to one central location. The other very obvious downside is that it fails to leverage numbers. Flashmob computing is confined to one place. The internet is not. You can call upon a global community. However, as I indicated earlier, given the colossal amounts of data being continually spewed out by scientific experiments, the current solutions like cloud, grid and distributed computing are increasingly not sufficient on their own.
The average home computer user is overwhelming Windows centric. Science is heavily GNU/Linux centric
There is nothing to prevent splicing the two methodologies together. It would be a bit of an effort but it would be quite possible to initiate flashmob computing instances across many global locations at a given time, co-ordinated via the internet and use some of the techniques of cloud, grid and distributed computing to glue it all together (bandwidth permitting). Perhaps that is what "swarmcreativity" would actually look like. However, the flashmob computing site looks pretty inactive as the last reported activity seems to be in 2004. Perhaps with the Fourth Paradigm it may get a new lease of life, though I doubt it.
The average home computer user is overwhelming Windows centric. Science is heavily GNU/Linux centric. The scientific community, being more technically literate know, use and understand free and open software. The Institute for Theoretical Physics at the University of Magdeburg use TINA (Tina is not an acronym!), a Beowulf cluster using GNU/Linux for massive parallel supercomputing. Serious stuff, but I like their final page entry explaining their motivation: Hey, it's fun, really. Get some old PCs, get a cheap hub, a few cables, a penguin and enjoy your private supercomputer.
For once, just once, GNU/Linux has got in on the ground floor, got to the high ground first and occupied it. Microsoft will, with their usual combination of FUD and bloated coffers swollen by their patents and licences, attempt to dethrone free and open software in an area crucial to the future of science. It must not happen.