Design and develop multi-scale query processing technologies for efficient and timely processing of ad-hoc queries over huge data volumes.
A key assumption underlying contemporary database architectures is to prepare the database storage and processing scheme for instant delivery of precisely articulated queries (e.g., fetch based on key or simple (join) predicates). However, ad-hoc querying hardly ever is precise, i.e., based on a formulation that can be answered quickly. Instead, aggregate queries and summaries over large subspaces are the scheme for the user to zoom in onto the answer set. Consequently, the query interaction paradigm has reached its end of life in the context of extreme large databases constructed from scientific experiments (e.g., astronomy) and large scale distributed sensory systems (e.g., health, surveillance, logistics).
Description of work
Current systems cannot cope with such ad hoc queries over large datasets. We plan to investigate a completely new query processing paradigm that removes many of the current restrictions. For example, modern systems need to load and analyze all data before being able to process it. Our architecture aims at an incremental data consumption driven by processing needs, i.e., no need to load, analyze and store data not used by queries. Another pitfall of modern systems is the need to build a priori and complete query accelerators, i.e., indexes. This is prohibitive with huge datasets as it would cost an enormous amount of time unless we know exactly which portions of the data are relevant. However, this is not the case for ad hoc query processing.
We aim for a scheme that summarizes data portions succinctly, replace them by analytical models, and adaptively partition the database storage using the “database cracking schemes” pioneered at CWI [1,2]. The idea is that all necessary accelerators are built incrementally, automatically and in a transparent way to the user and the challenge is to achieve this with ad hoc queries in huge datasets. To achieve the goals described above, the query execution engine is replaced by a new one that can detect answer convergence and also can exploit randomized sub-plan evaluation to tackle the query load.
The approach taken is evaluated against the databases provided by KNMI, TomTom and Hyves in particular, but extended to cover other science databases as well. These offer good examples of our motivation scenarios. For example, TomTom contains a massive database that is queried continuously by numerous users. Given the vast amount of users, queries are ad hoc, continuously changing focus and given the nature of the application, answers need to be formulated fast otherwise they are not useful. Our approach aims to tackle exactly these kinds of problems with fast answers of ad hoc queries over huge data.