Functional dependencies distorted by errors
A relational database D is given with Q as the set of attributes. We assume that the rows (tuples, data of one individual) are transmitted through a noisy channel (or, as many times in case of the data mining applications, the observed data is distorted from the real values in a manner which we cannot know). In case of low probability of the error it may be supposed that at most one data in a row is changed by the transmission or observation. We say that A -> b (A subset of Omega, b epsilon Omega) is an error-correcting functional dependency if the data in A uniquely determine the data in b in spite of this error. We investigate the problem how much larger a minimal error-correcting functional dependency can be than the original one. We will give upper and lower bounds showing that it can be considerably larger than the original sizes, but the growth is only polynomial. (C) 2007 Elsevier B.V. All rights reserved.
On the security of individual data
We will consider the following problem in this paper: Assume that there are n numerical data {x(1),x(2),.., x(n)} (like salaries of n individuals) stored in a database and some subsums of these numbers are made public or just available for persons not eligible to learn the original data. Our motivating question is: At most how many of these subsums may be disclosed such that none of the numbers x1; x2;...; xn can be uniquely determined from these sums. These types of problems arise in the cases when certain tasks concerning a database are done by subcontractors who are not eligible to learn the elements of the database, but naturally should be given some data to fulfill there task. In database theory such examples are called statistical databases as they are used for statistical purposes and no individual data are supposed to be obtained using a restricted list of SUM queries. This problem was originally introduced by [ 1], originally solved by Miller et al. [ 7] and revisited by Griggs [ 4, 5]. It was shown in [ 7] that no more than ((n)(n/2)) subsums of a given set of secure data may be disclosed without disclosing at least one of the data, which upper bound is sharp as well. To calculate a subsum, it might need some operations whose number is limited. This is why it is natural to assume that the disclosed subsums of the original elements of the database will contain only a limited number of elements, say at most k. The goal of the present paper is to determine the maximum number of subsums of size at most k which can be disclosed without making possible to calculate any of the individual data x(i). The maximum is exactly determined for the case when the number of data is much larger than the size restriction k.
On the number of independent functional dependencies
We will investigate the following question: what can be the maximum number of independent functional dependencies in a database of n attributes, that is the maximum cardinality of a system of dependencies which which do not follow from the Armstrong axioms and none of them can be derived from the remaining ones using the Armstrong axioms. An easy and for long time believed to be the best construction is the following: take the maximum possible number of subsets of the attributes such that none of them contains the other one (by the wellknown theorem of Sperner [8] their number is (n / n/2)) and let them all determine all the further values. However, we will show ( n by a specific construction that it is possible to give more than (n / n/2) independent dependencies (the construction will give (I + 1 / n(2))(n / n/2) of them) and – on the other hand – the upper bound is 2(n) – 1, which is roughly root n(n / n/2).
On the security of individual data
We will consider the following problem in this paper: Assume there are n numeric data {x(1), x(2),..., x(n)} (like salaries of n individuals) stored in a database and some subsums of these numbers axe disclosed by making public or just available for persons not eligible to learn the original data. Our motivating question is: at most how many of these subsums may be disclosed such that none of the numbers x(1), x(2),...,x(n) can be uniquely determined from these sums. These types of problems arise in the cases when certain tasks concerning a database are done by subcontractors who are not eligible to learn the elements of the database, but naturally should be given some data to fulfill there task. In database theory such examples are called statistical databases as they are used for statistical purposes and no individual data are supposed to be obtained using a restricted list of SUM queries. This problem was originally introduced by Chin and Ozsoyoglu [1], originally solved by Miller et al. [5] and revisited by Griggs [4]. It turned out [5] that the problem is equivalent to the following question: If there are n real, non-zero numbers X = {x(1), x(2),..., x(n)} given, what is the maximum number of 0 subsums of it, that is, what is the maximum number of the subsets of X whose elements sum up to 0. This approach, together with the Sperner theorem shows that no more than ((n)(n/2)) subsums of a given set of secure data may be disclosed without disclosing at least one of the data, which upper bound is sharp as well. However, it is natural to assume that the disclosed subsums of the original elements of the database will contain only a limited number of elements, say at most k (in the applications databases are usually huge, while the number of operations is in most of the cases limited). We have now the same question: at most how many of these subsums of at most k members may be given such that none of the numbers x(1), x(2),...x(n) can be uniquely determined from these sums. The main result of this paper gives an upper bound on this number, which turns out to be sharp if we allow subsums of only k or k – 1 members and asymptotically sharp in case of subsums of at most k members.
Recent combinatorial results in the theory of relational databases
Recent results of the authors-some of which are joint with Thalheim and Seleznjevin the area of combinatorial investigations of the relational database model are presented here. In relational databases keys-combinations of attributes uniquely identifying the records-play an important role. The structure and size of keys have been widely investigated. Here, after a short review of the earlier results, we discuss two generalizations: the (average) structure and size of keys in a random database and the concept of error-correcting keys in case of unreliable data collection. (C) 2003 Elsevier Ltd. All rights reserved.
Error-correcting keys in relational databases
Suppose that the entries of a relational database are collected in an unreliable way, that is the actual database may differ from the true database in at most one data of each individual. An error-correcting key is such a set of attributes, that the knowledge of the actual data of an individual in this set of attributes uniquely determines the individual. It is showed that if the minimal keys are of size at most ii, then the smallest sizes of the minimal error-correcting keys can be ck(3) and this is the best possible, all minimal error-correcting keys have size at most 3k(3).
Asymptotic properties of keys and functional dependencies in random databases
Practical database applications give the impression that sets of constraints are rather small and that large sets are unusual and are caused by bad design decisions. Theoretical investigations, however, show that minimal constraint sets are potentially very large. Their size can be estimated to be exponential in terms of the number of attributes. The gap between observation in practice and theory results in the rejection of theoretical results. However, practice is related to average cases and is not related to worst cases. The theory used until now considered the worst-case complexity. This paper aims to develop a theory for the average-case complexity. Several probabilistic models and asymptotics of corresponding probabilities are investigated for random databases formed by independent random tuples with a common discrete distribution. Poisson approximations are studied for the distributions of some characteristics for such databases where the number of tuples is sufficiently large. We intend to prove that the exponential complexity of key sets and sets of functional dependencies is rather unusual and almost all minimal keys in a relation have a length which depends mainly on the size of the relation.
The average length of keys and functional dependencies in (random) databases
Practical database applications engender the impression that sets of constraints are rather small and that large sets are unusual and caused by bad design decisions. Theoretical investigations show, however, that minimal constraint sets are potentially very large. Their size can be estimated to be exponential in terms of the number of attributes. The gap between belief and theory causes non-acceptance of theoretical results. However, beliefs are related to average cases. The theory known so far considered worst case complexity. This paper aims at developing a theory of average case complexity. Several statistic models and asymptotics of corresponding probabilities are investigated for random databases. We show that exponential complexity of independent key sets and independent sets of functional dependencies is rather unusual. Depending on the size of relations almost all minimal keys have a length which mainly depends on the size. The number of minimal keys of other length is exponentially small compared with the number of minimal keys of the derived length. Further, if a key is valid in a relation then it is probably the minimal key. The same results hold for functional dependencies.
Partial Dependencies In Relational Databases And Their Realization
Weakening the functional dependencies introduced by Amstrong we get the notion of the partial dependencies defined on the relational databases. We show that the partial dependencies can be characterized by the closure operations of the poset formed by the partial functions on the attributes of the databases. On the other hand, we give necessary and sufficient conditions so that for such a closure operation one can find on the given set of attributes a database whose partial dependencies generate the given closure operation. We also investigate some questions about how to realize certain structures related to databases by a database of minimal number of rows, columns or elements.