Publications of Demetrovics, J.

Functional dependencies distorted by errors

A relational database D is given with Q as the set of attributes. We assume that the rows (tuples, data of one individual) are transmitted through a noisy channel (or, as many times in case of the data mining applications, the observed data is distorted from the real values in a manner which we cannot know). In case of low probability of the error it may be supposed that at most one data in a row is changed by the transmission or observation. We say that A -> b (A subset of Omega, b epsilon Omega) is an error-correcting functional dependency if the data in A uniquely determine the data in b in spite of this error. We investigate the problem how much larger a minimal error-correcting functional dependency can be than the original one. We will give upper and lower bounds showing that it can be considerably larger than the original sizes, but the growth is only polynomial. (C) 2007 Elsevier B.V. All rights reserved.

Demetrovics J, Katona GO, Miklos D. On the security of individual data. In: Annals of Mathematics and Artificial Intelligence. Vol 46.; 2006. p. 98-113. Abstract

On the security of individual data

We will consider the following problem in this paper: Assume that there are n numerical data {x(1),x(2),.., x(n)} (like salaries of n individuals) stored in a database and some subsums of these numbers are made public or just available for persons not eligible to learn the original data. Our motivating question is: At most how many of these subsums may be disclosed such that none of the numbers x1; x2;...; xn can be uniquely determined from these sums. These types of problems arise in the cases when certain tasks concerning a database are done by subcontractors who are not eligible to learn the elements of the database, but naturally should be given some data to fulfill there task. In database theory such examples are called statistical databases as they are used for statistical purposes and no individual data are supposed to be obtained using a restricted list of SUM queries. This problem was originally introduced by [ 1], originally solved by Miller et al. [ 7] and revisited by Griggs [ 4, 5]. It was shown in [ 7] that no more than ((n)(n/2)) subsums of a given set of secure data may be disclosed without disclosing at least one of the data, which upper bound is sharp as well. To calculate a subsum, it might need some operations whose number is limited. This is why it is natural to assume that the disclosed subsums of the original elements of the database will contain only a limited number of elements, say at most k. The goal of the present paper is to determine the maximum number of subsums of size at most k which can be disclosed without making possible to calculate any of the individual data x(i). The maximum is exactly determined for the case when the number of data is much larger than the size restriction k.

On the number of independent functional dependencies

We will investigate the following question: what can be the maximum number of independent functional dependencies in a database of n attributes, that is the maximum cardinality of a system of dependencies which which do not follow from the Armstrong axioms and none of them can be derived from the remaining ones using the Armstrong axioms. An easy and for long time believed to be the best construction is the following: take the maximum possible number of subsets of the attributes such that none of them contains the other one (by the wellknown theorem of Sperner [8] their number is (n / n/2)) and let them all determine all the further values. However, we will show ( n by a specific construction that it is possible to give more than (n / n/2) independent dependencies (the construction will give (I + 1 / n(2))(n / n/2) of them) and – on the other hand – the upper bound is 2(n) – 1, which is roughly root n(n / n/2).

Demetrovics J, Katona GO, Miklos D. On the security of individual data. In: Seipel D, TurullTorres JM, editors. Lecture notes in computer science. Vol 2942.; 2004. p. 49-58. Abstract

On the security of individual data

We will consider the following problem in this paper: Assume there are n numeric data {x(1), x(2),..., x(n)} (like salaries of n individuals) stored in a database and some subsums of these numbers axe disclosed by making public or just available for persons not eligible to learn the original data. Our motivating question is: at most how many of these subsums may be disclosed such that none of the numbers x(1), x(2),...,x(n) can be uniquely determined from these sums. These types of problems arise in the cases when certain tasks concerning a database are done by subcontractors who are not eligible to learn the elements of the database, but naturally should be given some data to fulfill there task. In database theory such examples are called statistical databases as they are used for statistical purposes and no individual data are supposed to be obtained using a restricted list of SUM queries. This problem was originally introduced by Chin and Ozsoyoglu [1], originally solved by Miller et al. [5] and revisited by Griggs [4]. It turned out [5] that the problem is equivalent to the following question: If there are n real, non-zero numbers X = {x(1), x(2),..., x(n)} given, what is the maximum number of 0 subsums of it, that is, what is the maximum number of the subsets of X whose elements sum up to 0. This approach, together with the Sperner theorem shows that no more than ((n)(n/2)) subsums of a given set of secure data may be disclosed without disclosing at least one of the data, which upper bound is sharp as well. However, it is natural to assume that the disclosed subsums of the original elements of the database will contain only a limited number of elements, say at most k (in the applications databases are usually huge, while the number of operations is in most of the cases limited). We have now the same question: at most how many of these subsums of at most k members may be given such that none of the numbers x(1), x(2),...x(n) can be uniquely determined from these sums. The main result of this paper gives an upper bound on this number, which turns out to be sharp if we allow subsums of only k or k – 1 members and asymptotically sharp in case of subsums of at most k members.

Recent combinatorial results in the theory of relational databases

Recent results of the authors-some of which are joint with Thalheim and Seleznjevin the area of combinatorial investigations of the relational database model are presented here. In relational databases keys-combinations of attributes uniquely identifying the records-play an important role. The structure and size of keys have been widely investigated. Here, after a short review of the earlier results, we discuss two generalizations: the (average) structure and size of keys in a random database and the concept of error-correcting keys in case of unreliable data collection. (C) 2003 Elsevier Ltd. All rights reserved.

Demetrovics J, Katona GO, Miklós D. Functional dependencies inpresence of errors. In: Eiter T, Schewe KD, editors. Foundations of Information and Knowledge Systems. Vol 2284. Berlin: Springer; 2002. (Lecture Notes in Computer Science; vol 2284).
Demetrovics J, Katona GO, Miklos D. Error-correcting keys in relational databases. In: Schewe KD, Thalheim B, editors. Foundations of Information and Knowledge Systems. Vol 1762. Berlin: Springer; 2000. p. 88-93. (Lecture Notes in Comput. Sci.; vol 1762). Abstract

Error-correcting keys in relational databases

Suppose that the entries of a relational database are collected in an unreliable way, that is the actual database may differ from the true database in at most one data of each individual. An error-correcting key is such a set of attributes, that the knowledge of the actual data of an individual in this set of attributes uniquely determines the individual. It is showed that if the minimal keys are of size at most ii, then the smallest sizes of the minimal error-correcting keys can be ck(3) and this is the best possible, all minimal error-correcting keys have size at most 3k(3).

Anstee R, Demetrovics J, Katona GO, Sali A. Low discrepancy allocation of two-dimensional data. In: Schewe KD, Thalheim B, editors. Foundations of Information and Knowledge Systems. Vol 1762. Berlin: Springer; 2000. p. 1-12. Abstract

Low discrepancy allocation of two-dimensional data

Fast browsing and retrieval of geographically referenced information requires the allocation of data on different storage devices for concurrent retrieval. By dividing the two-dimensional space into tiles, a system can allow users to specify regions of interest using a query rectangle and then retrieving information related to tiles covered by the query. Suppose that there are m I/O devices. A tile is labeled by i if the data corresponding to this area is stored in the ith I/O device. A labeling is efficient if the difference of the numbers of occurrences of distinct labels in any given rectangle is small. Except for some simple cases this discrepancy exceeds i. In the present paper constructions are given to make this discrepancy small relative to m. The constructions use latin squares and a lower bound is given, which shows that the constructions are best possible under certain conditions.

Asymptotic properties of keys and functional dependencies in random databases

Practical database applications give the impression that sets of constraints are rather small and that large sets are unusual and are caused by bad design decisions. Theoretical investigations, however, show that minimal constraint sets are potentially very large. Their size can be estimated to be exponential in terms of the number of attributes. The gap between observation in practice and theory results in the rejection of theoretical results. However, practice is related to average cases and is not related to worst cases. The theory used until now considered the worst-case complexity. This paper aims to develop a theory for the average-case complexity. Several probabilistic models and asymptotics of corresponding probabilities are investigated for random databases formed by independent random tuples with a common discrete distribution. Poisson approximations are studied for the distributions of some characteristics for such databases where the number of tuples is sufficiently large. We intend to prove that the exponential complexity of key sets and sets of functional dependencies is rather unusual and almost all minimal keys in a relation have a length which depends mainly on the size of the relation.

Demetrovics J, Katona GO, Sali A. Design type problems motivated by database theory. Journal of Statistical Planning and Inference. 1998;72(1-2):149-64. Abstract

Design type problems motivated by database theory

Let k less than or equal to n, p less than or equal to g<m be positive integers. Suppose that the m x n matrix M satisfies the following two properties: for any choice of k distinct columns c(1),c(2)....,ck, there are q + 1 rows such that the number of different entries in c(i) (1 less than or equal to i less than or equal to k-1) in these rows is at most p, while all q + 1 entries of c(k) in these rows are different; this is true for no choice of k + 1 distinct columns. We review results minimizing m, given n,p,q,k. Two of the results are new. The optimal or nearly optimal constructions can be considered as n partitions of the m-element set satisfying certain conditions. This version leads to the orthogonal double covers, also surveyed here. (C) 1998 Elsevier Science B.V. All rights reserved.

The average length of keys and functional dependencies in (random) databases

Practical database applications engender the impression that sets of constraints are rather small and that large sets are unusual and caused by bad design decisions. Theoretical investigations show, however, that minimal constraint sets are potentially very large. Their size can be estimated to be exponential in terms of the number of attributes. The gap between belief and theory causes non-acceptance of theoretical results. However, beliefs are related to average cases. The theory known so far considered worst case complexity. This paper aims at developing a theory of average case complexity. Several statistic models and asymptotics of corresponding probabilities are investigated for random databases. We show that exponential complexity of independent key sets and independent sets of functional dependencies is rather unusual. Depending on the size of relations almost all minimal keys have a length which mainly depends on the size. The number of minimal keys of other length is exponentially small compared with the number of minimal keys of the derived length. Further, if a key is valid in a relation then it is probably the minimal key. The same results hold for functional dependencies.

A survey of some combinatorial results concerning functional dependencies in database relations

A databaseR has some obvious and less obvious parameters such as the number of attributes, the size |r|, the maximum size of a domain, the number of some special functional dependencies (e.g. the minimal keys), and so on. The main aim of this paper is to survey some of the results giving connections and inequalities among these parameters. The methods are of a combinatorial nature. A generalization of the numerical dependency is also considered.

Partial Dependencies In Relational Databases And Their Realization

Weakening the functional dependencies introduced by Amstrong we get the notion of the partial dependencies defined on the relational databases. We show that the partial dependencies can be characterized by the closure operations of the poset formed by the partial functions on the attributes of the databases. On the other hand, we give necessary and sufficient conditions so that for such a closure operation one can find on the given set of attributes a database whose partial dependencies generate the given closure operation. We also investigate some questions about how to realize certain structures related to databases by a database of minimal number of rows, columns or elements.

The Characterization Of Branching Dependencies

A new type of dependencies in a relational database model is introduced. If b is an attribute, A is a set of attributes then it is said that b (p,q)-depends on A, in notation [GRAPHICS], in a database r if there are no q + 1 rows in r such that they have at most p different values in A, but q + 1 different values in b. (1,1)-dependency is the classical functional dependency. Let J(A) denote the set [GRAPHICS]. The set function J(A) is characterized if p=1, 1<q; p=2, 3<q; 2<p, p2-p-1<q. Implications among (p,q)-dependencies are also determined.

On The Number Of Databases And Closure Operations

Closure operations are considered as models of databases. Estimates on the number of closure operations on n elements (or equivalently, on the number of databases with n attributes) are given.

Burosch G, Demetrovics J, Katona GO. The Poset Of Closures As A Model Of Changing Databases. Order-a Journal on the Theory of Ordered Sets and Its Applications. 1987;4(2):127-42.
Győri E. Oracle technique in lower estimation of complexity. In: Demetrovics J, Katona G, Salomaa A, editors. Algebra, combinatorics and logics in computer science. Amsterdam: North-Holland; 1986. p. 433-41.
Algebra, combinatorics and logics in computer science. In: Demetrovics J, Katona GOH, Salomaa A, editors. Colloquia Mathematica Societatis János Bolyai. Vol 42. Amsterdam: Elsevier; 1986.
Demetrovics J, Katona GO. Combinatorial problems of database models,. Colloquia mathematica Societatis János Bolyai. 1986;42:331-53.
Demetrovics J, Katona GO, Pálfy PP. On the number of unions in a family of sets. Combinatorial mathematics : proceedings of the Third International Conference. 1983;555:150-8.
Demetrovics J, Katona GO. Extremal combinatorial problems in relationaldata base. Fundamentals of computation theory : proceedings of the 1981 International FCT-Conference, Szeged, August 24-28, 1981. 1981;117:110-9.