NL FR EN
www.belgium.be

Statistical analysis of association and dependence in complex data

Research project P6/03 (Research action P6)

Persons :

Description :

One key aim of statistics is to analyse in an appropriate way the dependence and association present in a dataset. The data that are collected nowadays to analyse these dependence structures, are often of a complex nature and also the research questions are of an ever increasing complexity. This requires the construction of new models, or the adaptation of existing models, which is a challenging task. The development of new methods and intensive interaction between experts will also be required to cope with these complex data.

The global objective of the network is to develop new models and methodological tools to do inference and to analyse these complex data structures. To achieve this goal, the network will be structured in five interlocking workpackages, devoted to different types of complex data structures, that will be studied within the network.

1. Workpackage 1: Multivariate data with qualitative constraints

In statistical analysis, the quantity one wants to estimate or test a hypothesis about, often satisfies certain natural qualitative constraints, which one has to take into account, if one wants to make fully use of the nature of the data. Examples of constraints include boundaries (e.g. in frontier analysis), monotonicity, convexity, ellipticity or independence of unobserved components, unimodality and sparsity. Qualitative constraints also arise when using dimension reduction techniques, analysing functional data or dealing with inverse problems. A wide spectrum of statistical techniques is required to analyse this type of complex data. Building further on the results obtained in the previous phase (Phase V) of the network, new challenging research questions will be studied in this area, like e.g. the estimation of stochastic boundaries, the use of dimension reduction techniques with incomplete data, etc.

2. Workpackage 2: Temporally and spatially related data

Methods based on principal component analysis to forecast single variables on the basis of a large panel of time series are widely studied in the economic literature, and will be further explored and compared. Also, the work on dynamic factor models, already initiated during Phase V of the network, will be further pursued. Another topic the network is very much interested in, is the study of non-stationary time series. The achievements in the domain of locally stationary time series, which have been expensively studied during Phase V, will be further developed with a new emphasis on goodness-of-fit tests, adaptive inference and time-space modelling. A unified approach will be taken in order to jointly address several types of non-stationarity (such as time-varying coefficients models, unit roots models,…)

3. Workpackage 3: Incomplete data

Several types of incompleteness in the data arise in practice: missing data, censored data (right censoring, interval censoring, ...), truncated data, misclassified data, coarse data, ... The main focus will be on censored and missing data. In particular, the work on nonparametric estimation with censored data and on frailty models studied already in detail during Phase V, will be further pursued, and it will be studied how repeated data and survival data can be jointly modelled. The focus in the analysis of missing data will be on sensitivity analysis and on the combination of latent structures and mixed and mixture modelling ideas. Another research topic that the network is very much interested in, is the estimation of causal effects of observed exposures, measured with error, in randomised studies.

4. Workpackage 4: Data with latent heterogeneity

Unobserved heterogeneity can be modelled in different ways. A natural and common way to model this heterogeneity is by means of mixed models, which have been extensively studied during Phase V. The gained expertise opens the door to study in particular mixed models with partially specified residual dependencies conditional on the values of the random effects, and generalised linear mixed models. For the latter model, flexible models for the random effects distribution will be investigated, like mixtures of normals to approximate a B-spline basis.

5. Workpackage 5: Highdimensional and compound data

In many applications dealing with genomics, proteomics, metabolomics, etc., data to be dealt with typically include a very large number of variables. The information in such datasets often contains a lot of noise in the form of irrelevant information and masking variables. Additionally, the information can come from different types of sources. Major challenges for these datasets pertain to the detection of structure in highdimensional problems, the filtering of noise and irrelevant pieces of information, multiple testing in the presence of a high number of variables, and drawing much stronger inferences by means of suitable combinations of different data pieces at hand. To deal with these challenges, suitable non- and semiparametric techniques (including smoothing methods) will be developed, that can be used for noise reduction, appropriate dimension reduction and clustering techniques (including methods of mixture modelling) for highdimensional two- and multiway data, and data fusion methods in which several pieces of multiblock multiset data are jointly modelled.

Cross-links between the five workpackages will be established on at least three different levels:

I. Interlocking complexities in the data
In practical situations, often data are encountered that imply interlocking complexities as studied in several workpackages. Methods will be developed for dealing in an appropriate way with such compound complexities. This will require more than a mere concatenation of results as obtained from different workpackages, because when considering e.g. missing data in a multivariate context new kind of complexities will have to be dealt with.

II. Common modelling approaches
The study of dependence is a recurrent topic across all workpackages. In this regard different approaches will be taken to deal with dependence modelling, including the use of copula models, regression models that are based on e.g. various kinds of dimension reduction approaches, and random effects models, like e.g. generalised linear mixed models. Within the network, those different approaches will be compared, both on a theoretical level and on the level of analyses of several benchmarking datasets.

III. Common methods and tools
The different workpackages will rely on a common set of tools, including techniques of kernel smoothing, semiparametric inference, Bayesian inference, optimisation, randomisation and bootstrap. Findings on these tools will be exchanged among workpackages, and generic results on these tools will be aimed at, allowing their use in a broad range of contexts.

Documentation :