NAME

Stat.pm (v0.1) -- Computing and using statistics for large RDF datasets.

DESCRIPTION

Module Stat.pm computes statistics of RDF triple-store. It implements methods that facilitate the use of statistics in epsilon.

Statistics of triple store is represented by means of an associative array indexed by key patterns of the form (S,P,O) where S,P,O are either class identifiers, or, empty holes. The algorithm for the computation of statistics is defined as follows.

Computation of statistics for one triple

Here we sketch the algorithm for the computation of statistics for RDF triple-store.

1) For each ground triple (s,p,o) entered into Stat.pm first compute types of each particular component s,p,o yielding sets of class identifiers tS,tP,tO.

2) Transitive closure of sets tS,tP,tO is computed with respect to relationship rdfs:subClass to obtain all classes of componentes s,p,o. Transitive closures of sets tS,tP,tO are denoted ctS,ctP,ctO.

3) Let -" denote hole. For each key pattern (ts,tp,to) from (ctS|-) x (ctP|-) x (ctO|-) add one to $stat{(ts,tp,to)}.

Internal representation of `%stat` key pattern

Pattern ([s|-],[p|-],[o|-]), where s,p,o are class identifiers, is internally represented as key $ix-$id-$id1-$id2. Index $ix is used to select particular type of key: S,P,O,SP,SO,PO,SPO. Values $id, $id1 represent values of s,p,o with respect to the key.

Implemenetation in big3store shall include clear definition of key pattern data structure. Note that we do not use key pattern SPO in this implementation.

Data structures

%stat: Hash %stat is indexed by the key types. It stores the number of keys (including duplicates) for a given key type.
%stat1: Hash %stat1 is indexed by the key types. It stores the number of distinct keys for a given key type.
%stad: Hash %stad is indexed by the key types. It includes the memory for the computation of the distict keys of a given key type.
$stad_wind: The size of the window used to estimate the number of distinct instances for a given schema triple.
$stad_prob: The probability (in %) that an identifier that pops out of the window is a new identifier.
%cach: Hash table %cach is a cache for the transitive closures of the classes.
$bound: The mode of statistics is either bound or unbound.

Functions

ok = inc($ix, $id [, $id1, [, $id2]]): Increment counter of a key type $ix-$id[-$id1[-$id2]].
\@bts = mem_create($sz): Create a sequence of bits of the size $sz. (Adhoc implementation of bitstrings. To be improved.)
ok = mem_set(\%bp, $nt): Set the value of the $nt-th bit in the memory \%bp to 1.
$bt = mem_get(\%bp, $nt): Return the value of the $nt-th bit from the memory \%bp.
$cn = mem_count(\@bp): Count the number of 1's in the memory \@bp.
ok = hash($ix,$tri): Compute a hash value for a given index $ix from a given triple $tri.
ok = inc_dist1($tri, $ix, $id [, $id1, [, $id2]]): Increment counters of distinct values from $tri of key type $ix-$id[-$id1[-$id2]]. (*Obsolete*)
ok = inc_dist($tri, $ix, $id [, $id1, [, $id2]]): Increment counteris of distinct keys from a triple $tri. The triple $tri has the key type $ix-$id[-$id1[-$id2]].
ok = collect_stat_distinct(): Collect statistics records for each instances of schema triples and store counters of distinct instances back in %stad.
ok = insert_triple_prop($tri): Update statistics of the stored schema graph for one triple $tri. Compute class identifiers for each component of $tri using properties rdfs:domain, rdfs:range and rdfs:subPropertyOf. Generate all stored schema triples that are the types of $tri. Increment the counters for each of the stored schema triples.
ok = insert_triple_all($tri): Update statistics of the complete schema graph for one triple $tri. Compute class identifiers for each component of $tri using properties rdf:type, rdfs:subClassOf and rdfs:subPropertyOf. Generate all possible schema triples that are the types of $tri. Increment the counters for each of the generated schema triples.
ok = insert_triple_top($tri,$ul,$ll): Update statistics of a strip around the stored schema graph for one triple $tri. Compute class identifiers for each component of $tri and generate the schema triples $ul levels above and $ll levels below the stored schema graph. Increment the counters of the selected schema triples.
ok = insert_triple_top_bckp_1($tri): Update statistics for one triple $tri. Compute class identifiers for each component of $tri by selecting the top level of classes around domain and range classes. Increment by one each of the possible key classes of triple $tri.
ok = insert_triple_top_bckp($tri): Update statistics for one triple $tri. Compute class identifiers for each component of $tri by selecting the top level of classes around domain and range classes. Increment by one each of the possible key classes of triple $tri.
ok = print_stat(): Prints statistics by starting with the most populated key pattern towards least populated classes.
ok = print_stat_count(): Print counters of instances of schema triples.
ok = print_stat_distinct(): Print counters of distinct instances ofschema triples.

AUTHORS

Iztok Savnik <iztok.savnik@upr.si>; Kiyoshi Nitta <knitta@yahoo-corp.jp>

DATES

Created 26/01/2015; Last update 10/6/2021.