Antoni van Leeuwenhoek (1632-1723), a Dutch merchant, through his own technology of handcrafted lenses, revealed to the scientific community the hidden worlds of bacteria, protozoa and blood cells. Using detailed observation data collected in laboratory journals, he laid the foundations for what has become the discipline of microbiology. Hand written laboratory journals have since been replaced by digital archives, too complex to grasp by a single researcher's eye. Such massive amounts of data call for new "lenses" based on innovative database and data mining technology to enable future scientific discovery.
The computer science research topics in the Time-Trails project can be wrapped into a single phrase, i.e. how to handle huge amounts of data for information gathering and efficient decision-making. The research is focused on trajectories, i.e. ‘footsteps’ left behind by mobile users, earth rumbles and social portal interactions. They exhibit a common base of “who, when, where, what” messages that are highly correlated in time and space. They come in very large quantities, often without human interaction and are retained for long periods.
The core ICT science topics are: How do we store and index them? How can we find useful patterns of behavior and user profiles in these volumes? How can we simplify discovery by better exploration? How to categorize, classify and learn user profiles? Last, but not least, how to ensure clean data. All this in the context of Big Data, where efficient use of the underlying computer infrastructure is essential to bend the slope of the ever increase drive to collect and analyse data.
Biggest results so far
Geographically exploring Twitter hot-spots
Geographical data are typically visualized using various information layers that are displayed over a map. Interactive exploration by zooming and panning actions needs real-time re-calculation. For layers containing aggregated information (such as counts and sums) derived from voluminous data sets, such real-time exploration is impossible using standard database technology. Calculations require too much time. More.
ICT science question: a common operation in calculating with multidimensional data is the computation of aggregates. In order to obtain exact results with high performance from high data volumes, we face the challenge of finding clever ways of pre-calculating data as much as possible. An additional technical challenge is to develop technology that fits into standard open source database and GIS software.
Involved COMMIT/partners: NSpyre, Arcadis, UTwente.
Scavenger Hunt game helps in customer recommendations
There are many web portals aiming to recommend businesses like shops, restaurants and cafes. They are interested to know how many people visit them. We help them by developing algorithms for determining actual visits based on GPS-traces from mobile phones and public data from the web only. More.
ICT science question: what is the best way to compute a visit to a point-of-interest based on GPS-traces from mobile phones? Such actual visits can be computed from geographic data as an intersection of the GPS-trajectory with a polygon describing the circumference of the point-of-interest. Polygon-data are, however, not available. We have developed algorithms for estimating circumference polygons of point-of-interest-objects by analyzing more coarse-grained map data and data on other objects. Our algorithms produce quite accurate results even when data of substandard quality is used. The latter is important, because it allows the application to use only publicly available data.
Involved COMMIT/partners: Eurocottage, My Datafactory, UTwente.
Rapidly visualizing Wikipedia page views
Many sectors in our modern society are producing more and more data: science, medicine, finance, business, transportation, retail and telecommunication, to name a few. Visualization is an effective way to interpret the meaning of these data. We develop techniques that greatly speed up the statistical processing of large amounts of data. We use these techniques to rapidly visualize the statistical results. More.
ICT science question: how can we speed up the statistical processing of large amounts of data? What are the best visualization techniques for the statistical analysis of large data sets Complex statistics are usually limited by the amount of data, since statistical tools are not built to handle massive amounts of data. We embed a statistical processor into a high-performance relational database (MonetDB). This combination is unique, as the translation between the two systems is minimal and thus one hundred times faster than comparable systems. This system has the potential to deliver new insight into massive amounts of data.
Involved COMMIT/partners: CWI, MonetDB.
Lost in your data? Let Blaeu give you a few tips
Companies, governments, organizations and scientists have access to more and more data. But only a few of them have access to enough statisticians, enough visualization experts and enough processing power to explore the full richness of all these data. To solve this problem, we have developed a data exploration tool called Blaeu. Our system is named after the famous 16th century Dutch cartographer Willen Blaeu. Our 21st century Blaeu is a digital ‘data cartographer’. More.
ICT science question: how can users who know close to nothing about databases become data scientists?
Involved COMMIT/partners: CWI, MonetDB.