Dealing with the data flood

In nearly every area of business and science -even in our private life- we are confronted with an increase in data flows. Data that holds valuable information, and may provide us knowledge, but often is inaccessible because of its form and volume. Fortunately, new methods are emerging and evolving that enable us to create knowledge from data. Specifically, methods and tools that extract previously unknown information from aggregations of data.

Date August 18th 2002
Author Jeroen Meij

Prologue

Many companies are already using data mining techniques to approach their target group of potential customers. The same techniques will support the design of the optimal ripening and storage strategy for fruits.
Some of these techniques will also provide a bird’s eye view of document col-lections, including relations between documents. In life sciences data mining will assist to assign functions to genes, and in linking chemical structures (drugs) to biological effects.

Envisat, a new environmental satellite has just been put into orbit. Data mining will help the interpretation and understanding of the high resolution data this satellite will provide.
Specifically this understanding may be the most important effect of the use of data mining tools, an understanding which will enable us to learn and to increase our knowledge base .

The exact value of knowledge has been the topic of much debate the last decades. Early in the 21st century, both the economic and societal value of knowledge are widely recognized. This is not only true in the academic fields, but also for the research environment in industry, or any other ‘learning’ environment.

This comes as no surprise: by definition, obtaining knowledge requires person-al effort, making knowledge an enduring scarcity and therefore a valuable asset.
This book describes new tools and directions that will help us convert some-thing cheap, which is abundantly available — data — into something scarce and valuable: knowledge.

The Hague, April 2002

Ir R.M.J. van der Meer
Chairman STT/Beweton

Preface

by Prof D.J. Hand, Department of Mathematics, Imperial College, London, United Kingdom

With their use of the telescope, Galileo Galilei and others opened the doors to the macroscopic universe. They enabled us to see objects which were so far away that they were invisible to the unaided eye. With their use of the microscope, Antoni van Leeuwenhoek and others opened the doors to the microscopic universe. They enabled us to see objects, which were so small that they were invisible to the naked eye. These instruments, the telescope and microscope, amplified natural human abilities many millionfold, permitting humanity to study and understand things the existence of which we previously could never have even dreamed. This book describes, and illustrates with real case studies, another set of instruments, which enable us to see things we could never perceive with the unaided eye and brain. Telescopes explore gigantic objects, and microscopes explore minuscule objects. The instruments described in this book explore aggregate objects. Aggregate objects are collections of data describing many individual objects. The constituent objects have properties, and one can study such objects singly, but the unassisted mind cannot study an aggregate object as a whole. This would not matter, if the properties possessed by the aggregate object were the same as those possessed by the individual objects. But they are not. Aggregate objects have other properties, often quite different from those of their constituents. And aggregate objects often have properties which their constituents cannot possess.

What sort of things are aggregate objects? A human population is an aggregate object. The collection of purchases by shoppers in a supermarket is an aggregate object. The set of descriptions of all the visible stars is an aggregate object. Descriptions of segments of the human genome form an aggregate object. A collection of paths taken, when surfing the web is an aggregate object. A company’s database of credit card transaction records is an aggregate object. A library of extracts from a newspaper is an aggregate object. 

And what sort of properties do aggregate objects possess? In particular, what sort of properties do such objects possess that their constituents can not? A human population can have several different kinds of individuals within it, but a single individual is of only one type. The collection of purchases by shoppers in a supermarket may enable one to predict how new customers will behave, but such a prediction cannot be made merely by observing one shopper. By studying a collection of stars, we can develop a theory about the natural life stages of a star, but, short of watching for billions of years, this cannot be done by observing a single star. By studying similarities and differences between genome sequences, we can determine the cause of and possible treatments for disease, but this cannot be done by studying a single gene sequence in isolation. And by studying patterns of credit card transactions, we can detect the account which might be fraudulent; again, this cannot be done by studying a single transaction. 

This book, then, describes and illustrates instruments for seeing beyond ourselves, for exploring the properties of objects which we cannot grasp with the unaided brain. 

The instruments — tools, methods, techniques — described in this book are very much children of the computer age. To study an aggregate object, described in terms of its individual constituents, requires an ability to sort, extract, combine, and otherwise manipulate the descriptive symbols describing the various attributes of the individuals. Computers provide us with this ability. Computers process the data describing the individuals, converting it into information about them, and then transforming that information into knowledge. This distinction between data, information, and knowledge is an important one. Data are simply symbolic descriptions of the individuals. By themselves they mean nothing. Data with semantics, however, is information. Give me the raw datum that the height of a man is five, and it means nothing. Tell me that the man is five feet tall, and it is useful information. Put it in the context of the general height of men, and it is knowledge, which I can use. Give me a huge body of numerical data and I can do nothing with it. But give me, in addition, the tools illustrated in this book, and I can find relationships, I can recognize structures, and I can detect patterns and anomalies. I can discover knowledge. 

The tools illustrated here have a long history. The earliest discipline to concern itself with data analysis was statistics. Since the origins of statistics predate the computer, the aggregate objects with which early statistics dealt necessarily involved relatively few constituent objects. With the advent of the computer, however, the breadth of application of statistics increased. In parallel, other disciplines then began to develop tools for data analysis, typically with slightly different aims and objectives from statistics. Database technology, naturally, was concerned with such problems — not from the perspective of inference, which was always at the base of statistics — but from the perspective of describing and manipulating an existing database. Machine learning appeared on the scene — again, not originally with the aim of analyzing data per se, but rather with the aim of emulating or simulating the way natural systems learnt, and then with the simple aim of building systems which could learn. And, most recently, data mining has appeared in response to the advent of the gigantic data sets, which are now accumulating: data sets of billions of data points, that is, of billions of constituent objects, are now commonplace. All of these disciplines overlap. They each have valuable lessons to teach each other. A knowledge of one is insufficient without some knowledge of the others. This book demonstrates the application of such tools. 

The scope of application of such methods is unlimited. There is no aspect of human life, which is not affected by the need to analyze raw data, by the need to convert data into knowledge. The breadth of different areas discussed in this book demonstrates this beyond question. Furthermore, the exponential increase in the amount of data accumulating, the progress in data acquisition technologies, the dramatic increase in the size of data storage facilities, and the increase in computer power, all of which are discussed in this book, mean that the need for these new tools is becoming ever more important. 

I imagine that Galileo and Van Leeuwenhoek must have felt that they were living at the most exciting times in human history, when their tools began to open up the universe to permit the most extraordinary voyages of discovery. The same is true now. The tools described in this book represent a revolution in our ability to see and understand the universe around us. They present us with the means by which to take part in unprecedented adventure.

Dealing with the dataflood