Data integrity checks
The quality of data is very important to us. Although ultimately the data responsibility lies with editors and project managers we take great care to avoid entering of inconsistent data on a technical level, e.g. with the user interface it is not possible to enter begin dates which are later than end dates.
Nevertheless mistakes can happen, not only on the application level but also e.g. when importing data from other projects or deleting files outside of the application. Because data integrity is important for the quality of research we implemented functions to check possible inconsistencies which are described in detail below.
Orphans
Orphans
In this tab entries like dates which are not linked are shown. They could be artifacts from imports or bugs and can be deleted. If they seem to appear regularly again (without imports or known bugs) please report that issue.
Entities without links
Entries shown have no relation to other entities. That could be ok but maybe they are artifacts or were forgotten to link.
Type
These types were created but have no sub types or associated data. Maybe they originate from the first install or were never used.
Missing files
Here are listed file entities which have no corresponding file, most likely because the file itself doesn’t exist anymore.
Orphaned files
Files that have no corresponding entity are listed here.
Orphaned IIIF files
IIIF files that have no corresponding entity are listed here.
Orphaned annotations
Annotations that are linked to an entity, but where the file and the entity doesn’t have a connection are listed here. There are three options to proceed:
Relink entity: Add a link between file and entity
Remove entity: Removes the entity from the annotation
Delete annotation: Deletes the whole annotation
Orphaned subunits
Subunits that are missing a connection to the level above. E.g. a feature without a connection to a place.
Circular dependencies
A check if an entity is linked to itself. This could happen e.g. if a person is married to herself or a type has itself as super. It shouldn’t be possible to create circular dependencies within the application. Nevertheless it’s a useful check for e.g. if data is imported from other systems.
Dates
In this view various results of invalid or inconsistent dates are shown.
Invalid dates
Invalid date combinations are shown, e.g. begin dates which are later than end dates. These issues should be fixed because otherwise the user interface won’t allow to update these entities.
Invalid link dates
Like with “invalid dates” issues with these should be solved as soon as possible.
Invalid involvement dates
Incompatible dates at involvement are shown. E.g. a person participated at an event longer than the event took place.
Invalid preceding dates
Incompatible dates for chained events are shown. E.g. a preceding event started after the succeeding event.
Invalid sub dates
Incompatible dates for hierarchical events are shown. E.g. a sub event began before the super event began.
Check links
With this function every link will be checked for CIDOC validity. Depending on the amount of data this could take some time. Data entered with the OpenAtlas user interface should always be CIDOC valid but in case of e.g. imported data this check should be used afterwards. If invalid links are found they should be dealt with outside the application.
Check link duplicates
There are actually two checks:
The first one checks for duplicate links which are identically and can be safely deleted when clicking the Delete link duplicates button.
In case the first test found no duplicate links it will be checked for entities connected multiple times two a type which is defined for single use. E.g. a place has the type castle and city. In this case you would only see one in the user interface and the other one would get deleted in case anybody updates the entry. Here you have the option to look at these and remove the wrong ones clicking on the Remove link beside the entries in the last column.
Both checks shouldn’t find anything wrong with data entered with the application but nevertheless it could happen because of imports or unknown bugs.
Check similar names
Here you can search for similar names. Depending on selection and data volume this might take some time.
Classes - select the class which you want to search for similar names
Ratio - select how similar the names should be, 100 is the default and means absolute identical
The function uses the fuzzywuzzy package is used which in turn uses the Levenshtein Distance.