How to impute a fictitious raw dataset from aggregate results? (Part 1)

Abstract

      We consider how to deconstruct results published at aggregate level but based on an original, unpublished, raw dataset of individuals/cases, each having values for a common set of categorical variables. Typical examples are confidential data from survey questionnaires or administrative systems, from which only aggregate frequency counts/frequencies and cross- tabulations may be published. The main idea is to impute a simulated dataset of fictitious individuals and variable values, from which the published information can still be reproduced.

 

Motivation for the problem

The problem dealt with here arose in the following context. Every year, the UK Department for Education publishes the ”School and college performance tables” online. The tables that are publicly accessible are actually summaries (at the school level) of more refined data (at the pupil level), since these refined data need to remain confidential. Would it be possible to publish more detailed data than presently done, while respecting this goal of confidentiality? That is the kind of issue that we address in this paper.
More generally, we shall deal here with a questionnaire dataset, i.e. a set of n cases (or individuals or respondents) for which the values of several categorical variables are available. We will call such questionnaire type data a dataset in the rest of this paper. Let us denote DS0 the true, observed, dataset. A partial analysis of DS0 may yield several observed tables of counts and/or frequencies 1 for single variables or for cross-tabulations of several variables. For our present purpose, these uni-dimensional or multi-dimensional margins constitute a set of constraints. Now, the situation we consider here is the following: the true dataset DS0 is unknown to us, but some of its margins are available.
Then, a possible solution to the above issue is to build a fictitious (or simulated or virtual) dataset, DSF, i.e. a set of n′ fictitious individuals and their fictitious values to all variables, in such a way as to respect the constraints of the true dataset 2.

1 Formalizing the problem

Let us consider a dataset composed of n individuals and their values for each variable in a set of V variables,

{A,B,C,D,···}.

We shall only consider the case in which each variable is categorical, i.e. a variable associated with a list of modalities which are its possible values, e.g.

A = {a1,a2,a3,···},

     The true dataset is then defined as the mapping which associates a response pattern or profile on all variables {A, B, C, D, · · ·} to each of the n individuals:

i1 → a1b2c1d4 · · ·
i2 → a2b3c1d1 · · ·

· · ·

     For confidentiality or practical reasons, the true dataset is not available. Instead of knowing the individual values, only summaries are available, such as marginal counts for some/all variable(s), and joint counts for some crossings of e.g. A × B, C × D, etc.. In addition, some a priori structural knowledge might be available: e.g. the combinations a1b2 and a2b3 are both impossible. We shall call all these available summaries / marginal tables — possibly completed by structural knowledge — partial information. Globally, this partial information will be considered as a set of constraints C0 that our target fictitious dataset must respect.
Now our problem can be stated as follows. Given a true dataset DS0, i.e. the n values for the V variables obeying the constraints C0, can we construct a fictitious dataset DSF, i.e. generate values for the same V variables, for n′ fictitious individuals, obeying the same constraints C0? It is important to emphasize that the constraints C0 are derived from a unique observed dataset DS0. For that reason, DS0 itself necessarily conforms strictly to the constraints C0. [Hence the common 0 subscript to stress this fact.]
The constraints C0 can be expressed in terms of observed counts or frequencies. In the first case, there is at least one solution to the problem: the dataset DS0 itself with n′ = n individuals. In the second case, or when n′ ≠ n, one can search for solutions which will typically give an approximation of constraints C0 (e.g. it is impossible to generate a proportion of exactly p = 1/3 with n′ = 100). This text will not consider further the issue of exact vs. approximate solutions, and will focus on constraints expressed through frequencies.

Remark

Another case emerges when the constraints C0 are not derived from a unique true dataset. This case has been dealt with under the heading of ”data-fusion” and requires a treatment of its own and is not discussed in the present paper.

2 Several types of constraints

It seems useful to distinguish several types of sources of information or constraints which — a priori — should not be handled identically:

  • Constraints on margins, undimensional (of type A), bidimensional (of type A × B), or multidimensional (of type A × B × C, etc.),
  • Structural constraints, i.e. prohibition of some combinations of modalities,
  • Softer constraints, i.e. degree and direction of the association between A et B fully or partially quantified.

It may well be the case that other types need to be distinguished in the future. At this stage, we have rather clear views on how to handle the first type of constraints, and the present text will focus on the issue of constraints on margins.

3 Margins in the form of proportions/frequencies tables

Let us first consider that some margins, unary, binary or multiple, are given under the form of (relative) frequencies or proportions tables.
For instance, let us consider a true dataset DS0 with V = 5 variables {A, B, C, D, E}. The full knowledge of DS0 is the full joint distribution,

p(ABCDE),

that is the proportions p(abcde) for any a ∈ A, b ∈ B, etc.(whose sum equals 1 = 100%). Let us assume that the available constraints C0, are the proportions of modalities on variable A, and the proportions of combined modalities on cross-tables A × B, A × E, and C × D, that is the partial distributions,

p(A)
p(AB)
p(AE)
p(CD).

     It is clear that the knowledge of p(AB), i.e. about the joint frequencies of Aand B modalities, contains p(A) and p(B) among others. Since all these proportions are derived from the same, and unique, dataset DS0, the p(A) deduced from p(AB) is identical the p(A) given as a first constraint, as well as the p(A) deduced from p(AE) 3. There cannot be any contradiction between the various sources of information. Hence, in this example, the available knowledge could be simplified with three constraints only: p(AB), p(AE) et p(CD).
Next, we shall see how to construct a dataset DSF satisfying approximately the constraints C0, for any fictitious sample size n′. For that purpose, the actual sample size n of the true dataset DS0 is not required.

3.1 Some proportion / probability calculus

Let us first recall a classical formula in the field of proportion calculus (or probability calculus),

$latex p(uv) = p(u) p(v|u) $
           $latex = p(v) p(u|v) $

(1)

where ”v|u” reads ”v conditionally to u” et uv reads ”the conjunction of u and v”. The two equations above are equivalent and merely indicate that the roles of u and v can be permuted. If v is itself a composite statement v = xy, the formulae (1) can be used recursively, e.g.

$latex p(uxy) = p(u) p(xy|u) $
                         $latex = p(u) p(x|u) p(y|ux) $

(2)

Actually, several other equivalent representations of p(uxy) can be used, depending on the order in which the conditionings occur, i.e. by permuting the roles of u, x and y.
As indicated before, if we had DS0, in terms of proportions, the most precise knowledge would be the full joint distribution, p(ABCDE). Using formula (1) recursively, it is clear that p(abcde) can be written in various ways as a product of (conditional) proportions; for instance two possible such representations are:

                     $latex p(abcde) = p(a) p(b|a) p(c|ab) p(d|abc) p(e|abcd) $

(3)

    $latex = p(abe) p(cd|abe). $

(4)

3.2 From constraints to a dependency network

Connex subsets

The first step consists in identifying the connex subsets of variables, within the set of constraints. Variable A is ”connected” to B, because of the constraint p(AB), and also connected to E due to p(AE). Since no other constraint contains A, B or E, these three variables constitute a connex subset. In our example, there are two connex subsets:

A,AB,AE → A,B,E
CD → C,D

Within each connex subset, the knowledge of some marginal tables, e.g. p(AB), gives the indication of some dependencies, here between A and B: knowing a, some modalities b are more frequent than some other modalities b′.
Between two such subsets, on the contrary, since there is no knowledge of any dependency, one typical assumption is that of an independence between subsets: knowing A, B and/or E does not give any information about C and/or D. This is a very strong assumption which amounts to considering that no knowledge of dependence is a knowledge of independence. Some ideas about how to relax this strong assumption are sketched in Section 4.2.
Formally, this assumption of independence amounts to saying that p(CD) does not depend on ABE, and vice-versa. In terms of the overall distribution on ABCDE, this is equivalent to saying:

$latex p(abcde) = p(abe) p(cd|abe) $
         $latex = p(abe) p(cd) $

(5)

The first equation above is always true and was already indicated in Eq. (4). The second equation (5) uses the independence assumption i.e. that p(cd|abe) = p(cd).
After this first step, it is now possible to work separately on each connex subset, i.e. compute p(ABE) and generate partial profiles abe, compute p(CD) and generate partial profiles cd, and then combine these partial profiles into a global one, abcde.
The second step deals with each connex subset separately. For a given subset, it is first necessary to identify all the unconditional or conditional proportions that can be derived from the marginal tables specific to each subset:

A, B, E → p(A), p(B), p(E) and p(A|B), p(B|A), p(A|E), p(E|A) (6)
C, D → p(C), p(D) and p(C|D), p(D|C) (7)

For the {C, D} connex subset, things are simple since all possible knowledge is given, either directly as p(CD) or by developing these joint proportions in one of the following ways:

$latex p(cd) = p(c) p(d|c) $ (8)
           $latex = p(d) p(c|d) $ (9)

For the connex subset {A, B, E}, there is no full knowledge of p(ABE) but partial knowledge only which requires us to assume some conditional independence,

$latex p(abe) = p(a) p(b|a) p(e|ab) $ (10)
           $latex = p(a) p(b|a) p(e|a), $ (11)

the assumption being that p(e|ab) = p(e|a). The formula above is based on some path or order between the variables, here A then B then E. Another possible path would be A then E then B.

Acyclic vs. Cyclic graph

Using all the available information amounts to using some subset of the distributions in Eq. (6). For ABE, the three distributions p(A), p(B|A) and p(E|A) are sufficient to reproduce all the available information, since,

p(AB) = p(A) × p(B|A)
p(AE) = p(A) × p(E|A)

Each such solution defines a graph of dependencies, here

A, A−→B and A−→E.

These elements can respectively be interpreted as: knowledge about A, knowledge about B when A = a is given, and knowledge about E given E = e (probabilistic knowledge, of course).
The theory distinguishes two types of such graphs, whether they contain cycles or not: cyclic or acyclic graphs. When, as the present case, the graph associated to the constraints C0 is acyclic any path/order will provide the same joint distribution 4.
Although they were proposed in a different context, the models developed under the heading of Bayesian networks have strong connections with what we propose in the present paper. They also typically assume knowledge expressed by acyclic graphs.

3.2.1 Summary of proposed algorithm

To summarize, in each subset, some path is determined to build a partial response profile,

A, B, E → p(A), p(B|A), p(E|A)
C,D → p(C),p(D|C)

     An algorithm for building a complete response profile, i.e. on ABCDE, simply consists in ”going through” these unconditional or conditional proportions tables:

  • Draw a according to p(A) → a
  • Draw b according to p(B|a) → b
  • Draw e according to p(E|a) → e
  • Draw c according to p(C) → c
  • Draw d according to p(D|c) → d

This can be simply implemented by a random draw from each univariate distribution considered. Each iteration of the random draws above provides a, b, e (from the first connex subset) and c, d (from the second one), i.e. a complete response profile,

abcde

This will constitute a fictitious individual with responses abcde. The above procedure is then repeated so as to generate n′ fictitious individuals, composing a fictitious dataset DSF with n′ individuals in the end.
On this simulated dataset DSF, one can compute all possible margins, unidimensional, bidimensional or multidimensional, and in particular the margins analogous to those which provided the initial constraints (margins obtained from DS0). The algorithm above ensures that the margins from DSF will be approximations of the corresponding margins from DS0. And the larger n′ is, the more precise will be these approximations.

Remark

We have implicitely assumed that the random draws were carried out using multinomial sampling (sampling with replacement) from each distribution. The algorithm outlined above could actually be modified by considering constraints given by marginal counts instead of frequencies, and by drawing from a multi-hypergeometric (sampling without replacement) distribution. We have not followed that idea further yet, but this is the key to reproducing the desired margins exactly and not only approximately as proposed above.

4 Shortcomings of a fictitious dataset

4.1 Dependence vs. independence

In the case considered here — margins obtained from a single true dataset DS0 —, the construction of a fictitious dataset requires the assumption of some (unconditional or conditional) independences. In our example, two independence assumptions were necessary,

$latex p(cd|abe) = p(cd) $
$latex p(e|ab) = p(e|a) $

The smaller the number of available margins/constraints is, the more numerous these assump- tions will be. For instance, if only p(A), p(B) and p(E) were available, building some p(ABE) would require that A, B and E are mutually independent. The resulting fictitious dataset DSF may then be very unrealistic, i.e. far from the true one DS0.
Conversely, the more numerous the given margins or cross-tables, the more realistic the fictitious dataset will be. And if the number of available margins must be limited for one reason or another, then it will be important to select crossings for which the dependencies are high. Since no information leads to independence assumptions, what is needed is (possibly strong) dependence knowledge.

4.2 An alternative: extreme points of the DSF space

An alternative idea — which, at this stage, is still to be explored —, would consist in building, not a single fictitious dataset, but a set of fictitious datasets, each one being characterized by different dependence / independence assumptions, concerning the missing knowledge, while keeping the same known margins.

However, although there is a single way for A and B to be independent,

$latex p(ab) = p(a)p(b), $ (12)

there is a wide variety of directions (and, in each direction, of intensity) by which A and B can be dependent. That is to say that the space of possible compatible fictitious datasets can be very large (of high dimensionality).
If the dimensionality is not too high, it is possible in theory to exhibit the set of extreme points of the DSF space. These extreme points are such that the set of their convex linear combinations is actually the DSF space itself.
However, in general, it will not be possible to generate all acceptable fictitious datasets, directly or by means of the extreme points. An important issue then is to select a few elements from that huge space which, by their dispersion, will cover a sufficiently wide part of the DSF space.

4.3 Conclusions

We have restricted our attention to reconstructing a fictitious dataset from aggregate results, given in the form of marginal (joint) frequencies, which can be expressed by an acyclic graph, and which are all derived from a single true dataset. Several extensions need to be considered: e.g. knowledge derived from several (possibly not coherent) sources, constraints on counts, cyclic constraints.


1In the following, we use indifferently the terms ”frequencies” or ”proportions” for relative frequencies.
2An interesting example of this approach is the tax simulator proposed by Landais, Piketty and Saez (2011) which is based upon a set of several thousands virtual individuals, simulated so as to reproduce known sources of information. See www.revolution-fiscale.fr
3Another way of saying this is to say that p(A) can be either computed directly from p(ABCDE) by summation, or derived in two steps, first compute p(AB) from p(ABCDE) and then p(A) from p(AB), both computations being done again by summation. Both paths — and any other — will yield the same p(A).
4If the available knowledge was p(AB), p(AE) and p(BE), it would not be possible to express it fully as an acyclic graph. A cycle involving A, B and E would be necessary.

You can download this article here.

Leave a Reply

Your email address will not be published. Required fields are marked *