Measuring approximate functional dependencies: a comparative study

Rafal Tekreeti
Aug 25
4 min read

By: Marcel Parciak, Sebastiaan Weytjens, Niel Hens, Frank Neven, Liesbet M. Peeters, Stijn Vansummeren

Functional dependencies (abbreviated: FDs) describe a strong relationship in relational databases, where information is organised into relations (tables), attributes (columns), and tuples (rows). An FD is written X ! Y (read: X implies Y ), where both X and Y are any number of attributes in a relation. We call X the left-hand side and Y the right-hand side. An FD X ! Y in some relation tells us that for each tuple, the value(s) of the left-hand side imply the value(s) of the right-hand side. That is, if we see the same value in X we know the value for Y . We show an example in Table 1, which contains information about patients and the diagnoses they received from a doctor. Each diagnosis has a name and a code, both of which should match up in our data. Hence, we expect the FD diagnosis! fcodeg to hold. However, t4 has a code that differs from t0 and t3 even though the diagnosis name is the same, indicating that we have an error in our data.

#	patient_id	doctor	diagnosis	code
t0	39002	dr. Smith	Type 2 Diabetes mellitus	E11.9
t1	39012	dr. Smith	Cystic fibrosis, unspecified	E84.9
t2	39062	dr. Taylor	Type 2 Diabetes mellitus	E11.9
t3	39093	dr. Taylor	Cystic fibrosis, unspecified	E84.9
t4	39041	dr. Taylor	Type 2 Diabetes mellitus	E84.9
…	…	…

Table 1: Example relation containing information about diagnoses, showing the patient, the doctor, the diagnosis name and its code. The erroneous value of tuple t4 is highlighted red.

Knowing such relationships helps in keeping data consistent and clean, in making database query processing faster, and in facilitating the transformation of data. In many data science scenarios, however, the FDs remain unknown or incomplete. Therefore, a variety of techniques have been developed to reverse engineer FDs from relational data. In practical settings, relational data may get corrupted such that the FDs no longer hold any more. This is due to, for example, errors during manual data entry. Reverse engineering FDs from such erroneous data is particularly challenging. Instead, one must consider approximate functional dependencies (AFDs), FDs that “almost hold” in a tabular dataset.

In this respect, a key decision to then make is when an FD “almost holds”. This decision is formalised by the adoption of an AFD measure, which quantifies the extent to which an FD holds in a dataset by giving a score between 0 and 1. Higher values indicate a higher degree of FD satisfaction. In our example Table 1, where we expect the FD {lecture, date} → {room} to hold, we see an error in t4 such that the FD in fact does not hold. It is up to the AFD measure to assign a score in this case, indicating to us to what the degree the FD is satisfied.

Many AFD measures have been proposed in the scientific literature over the past decades. Unfortunately, these measures vary widely in nature and there has been little study so far in comparing them. Our study fills this gap, assessing how the different AFD measures compare and which AFD measure(s) should be used to discover AFDs in relational data in practice. Specifically, our study compares all measures conceptually and experimentally. In our conceptual comparison, we considered how interpretable the measures are, we identified common design principles of AFD measures, and we assessed the measures’ ability to separate AFDs from non-AFDs using synthetically generated data. In our practical comparison, we collected real-world datasets, identified and annotated semantically meaningful FDs manually. We assessed whether the AFD measures can identify the FDs, i.e., giving annotated FDs a higher score than non-annotated ones.

Key contributions

We have released both our manually annotated benchmark for AFD discovery and our Python implementations of 12 AFD measures we identified in the academic literature. We have identified three classes of AFD measures: measures that count violations, measures that assess the Shannon entropy, and measures that use Logical entropy. Based on these classes, we have created a comprehensive comparison of all AFD measures, showing similarities between them. In this comparison, we have further identified gaps that allowed us to define two new AFD measures. Using synthetic data, we have identified specific properties of relational data that AFD measures may be sensitive to. Subsequently, using our newly-created manually annotated benchmark, we have verified that this sensitivity occurs in practice when discovering AFDs in real-world data, and compared all AFD measures concerning their ability to separate (A)FDs from non-(A)FDs. Based on the obtained insights, we recommend the use of the relatively unknown measure µ as it is fast and effective. The measure RFI' is the runner-up, as it is effective but inefficient to compute. The well-known measure gt takes the third place, as it is fast but less effective than the other two.

The use of the VSC

Our benchmark consists of ten relations with three to 38 attributes and ten thousand to one million tuples, resulting in 1634 candidate attribute pairs that we want to check with each of our 14 AFD measures. While most AFD measures are relatively quick to compute on commodity hardware (less than a tenth of a second per candidate), three of our measures took up to 345 seconds per candidate on average. The VSC’s computer cluster allowed us to run jobs up to 7 days, which made our comprehensive comparison of AFD measures on real-world data possible.

Read the full publication in Springer Nature here

🔍 Your Research Matters — Let’s Share It!

Have you used VSC’s computing power in your research? Did our infrastructure support your simulations, data analysis, or workflow?

We’d love to hear about it!

Take part in our #ShareYourSuccess campaign and show how VSC helped move your research forward. Whether it’s a publication, a project highlight, or a visual from your work, your story can inspire others.

🖥️ Be featured on our website and social media. Show the impact of your work. Help grow our research community

📬 Submit your story: https://www.vscentrum.be/sys