NAME *qios* -- query processing system for Internet data DESCRIPTION *qios* (v0.9) is a system for the manipulation of data from internet data sources. The system is intended to serve as the lightweight kernel of a data manipulation server. The main aims in the design of *qios* are to provide the system for the manipulation of larger collections of data in a fast manner, various data manipulation functions from classical querying to data restructuring, and, the capabilities to organize, store and browse the data collections obtained from the Internet data sources on the local host. The system currently provides the interface for XML. The treatment of other data formats requires the addition of the interface routines for the conversion of data into internal database format. Data model Data from the Internet data sources is expected to be organized by using some form of the *collection* data structure. The single internal data model is used in *qios*. The model is based on the structural part of F-Logic which turned out to be appropriate for the representation of structured objects represented in formats such as XML. The objects are represented as tuples which can include functional or multi-valued attributes ranging over scalars or objects. The detailed presentation of the *qios* data model is given in *Que.pm*. Query processor The architecture of the *qios* query processor is based on the architecture of the relational query processors. The queries are internally represented by means of query trees which serve as the main data structure for nearly all phases of the query processing. The query trees are built by parsing of query expression. The type-checking procedure uses the standard type-checking rules which are coded into the procedure. Further, the query optimization rules are entered into the system in the form of the augumented query expressions (currently read at startup of system) which are transformed into the internal representation based on query trees including the additional info for detailing the query transformation. The new query optimization rules can be easily added to the system. The query optimization module is based on the work of G.Graefe (with Wisconsin DB group) on query optimizers Exodus and Volcano. The basic data structure for storing the query trees which are alternatives in the query optimization is the data structure *mesh* which is presented in more detail in the module *Optor.pm*. The optimization algorithm is based on dynamic programming. The algorithm produces the sub-optimal bushy query trees. The query optimization environment can serve as the platform for the study of the query optimization algorithms appropriate for Internet data. The physical query optimization is treated separately. The simple sub-optimal algorithm which chooses among the available physical operations for the implementation of logical operations. The decisions about the creation of the indexes for the evaluation of queries is made by the query compilation procedure. The selected final query evaluation plan is implemented in a standard manner by using itererator trees. Query language The query language of *qios* is a functional language. The queries are expressions comprised of the query operations: *select*, *project* and *join*. The query operations are functions which parameters are collections and predicates and which results are collections. The examples of *qios* queries are given together with two sample domains *SPJ* and *mondial* which are stored in the directories "src/spj" and "src/mondial". Operations *select(t, x:p(x))* The operation *select* filters the collection *t* by using the predicate expression *p(x)*. *project(t, [a1,...,an])* The operation *project* selects from the tuples of the collection *t* the attributes *a1,...an*. *join(t1,t2, x,y:p(x,y))* The operation *join* computes a new collection of tuples composed of the pairs of tuples from the collections *t1* and *t2* which satisfy the predicate *p(x,y)*. Table expression A table expression *c.E* where *c* is a class identifier and *E* is either symbol *ext* or *ins* denoting the class extensions which can include either only the members of class *c*, or the instances of *c* (also members of sub-classes). The result of evaluating the table expession is a collection of objects. Predicates The predicates of the operations *select* and *join* can include the following schema operations on attributes and scalars. *==,=* Identity equaity '==' compares objects by ids and scalars by their values. The deep equality '=' compares objects by their values. <,>,<=,>= Arithmetical comparison operations. *in,subset* The set operations. The operation 'in' is standard membership operation and the operation 'subset' tests subset relationship. *and,or,!* Logical operations conjunction, disjunction and unary negation. *=~* String match operation which works in the same manner as the standard unix match operation in the same manner as with 'sed', 'grep' and Perl operation '=~'. The matching expression has to be the right-hand operand; it is stated in quotes >>"<< or >>'<<. *Path expression* The functional path expressions are currently supported. IMPLEMENTATION *qios* is implemented in Perl v5.6.0 which is one of the most widely used Web programming languages. Perl programming environment offers a rich and well-organized collection of Perl modules (CPAN) covering most of modern data formats as well as database interfaces. Recent studies (e.g. by Lutz Prechelt presented in IEEE Computer 33(10),23-29,2000) suggest that, especially for text and data manipulation tasks, the performance of *Perl* is comparable to the performance of the programming languages *Java* and *C++*. The code for the manipulation of the Internet data files including XML and HTML documents is based on the publicly available *Perl* modules: *libwww-perl* library written by G.Aas and M.Koster, HTML-Parser library by G.Aas and A.Chase, and XML::Parser by L.Wall and C.Cooper. The storage manager of *qios* is based on Berkeley DB Verison 1.x and *Perl* interface DB_File for Berkeley-DB written by P.Marquess. Additional publicly available *Perl* modules which are used by *qios* are: *libnet* modules, Term::ReadLine, URI, MIME::Base64, and, Digest::MD5. All modules listed above are available at the nearest CPAN archive (see eg. http://www.perl.com/CPAN/). Modules Glb.pm - global common structures and procedures Osm.pm - object storage manager Rsm.pm - record storage manager Load.pm - xml file/net reader Que.pm - query representation, type-checking Eval.pm - selection of physical operations, query evaluation Parser.pm - query parsing, rule parsing Rule.pm - rule manipulation Cost.pm - cost function Optor.pm - optimizer algorithms, mesh query store Full names of Qios modules are *Qios::Mod* where *Mod* one of the above modules. The man pages of the modules can be therefore obtained by invoking "man Qios::Mod". Qios environment The configuration file, the log files and the database files of a user are stored in the *qios* base directory. By default, the base directory is set to "$HOME/.qios_base" but can also be set by using the shell variable "QIOS_BASE". The base directory is created if it doesn't exist on first activation of *qios*. The *qios* environment consists of database files which reside in the *qios* base directory. A database file has an extension ".db". Each database file is a *qios database* storing the classes and instance objects describing particular domain. The static structure of a database and dynamic properties of storage manager can be tuned for the specific domain by means of *database parameters*. The database parameters are listed in the next section. The default values of the database parameters can be set in configuration file "config" which is stored in the *qios* base directory. The syntax of the parameter assignment statements are *param = value* where *param* is one of the above presented database parameters. The default values of the database parameters are written in the file "config" after the first activation of *qios*. After *qios* is started the database parameters can be tuned for a particular database using *set* command (see *qios* help). The initial database parameters assigned by the creation of a database are stored in the root record of the database file. The dynamic database parameters (eg. buff size) can be changed after the creation of the database; the root object of the database is treated as any other database object. Database parameters *dbdir* The root of the directory that stores the configuration file and the data files. *dbfile* The name of the default database file. The name of the database file is set in program defaults to "base.db". *dbsize* The default size of the database file (in objects). The database size set in program defaults is 100000. *bufsize* The default size of the object buffer. The default value set by the program is 10000. *quefn* The default name of the query file. The program sets empty string. *logfile* The default name of the log file. The program default value is set to "log". *loglvl* The default log level of the system (0..2). The program sets this by default to 2. *dbglvl* The default debug log level (0..2). The program default is 0. *debug* Turn debug on or off (1,0). The program default is "off". *collsvd* The option turns on the collapse of the nested relations including single functional attributes during schema reengineering and DTD load procedures. *collmvd* The option turns on the collapse of the nested relations including single multivalued attributes during schema reengineering and DTD load procedures. *matres* Materialize the result of the query (0,1). The program default is 0. *outtype* The type of output data. The value `0' denotes internal format. The value `1' denotes XML format. Qios system commands This section presents the basic *qios* system commands. The upper case words in command description stand for command parameters. Square brackets '[]' denote optional parameters. Symbol '|' denotes alternatives. See also *qios* help. Query Commands *active [F]* (alias 'a') Set active query file to F. Prints active query if F is not specified. *edit [F]* (alias 'e') Edit active query or file F. *eval [Q]* (alias '!') Execute active query file or query file Q. XML commands *extract U|F* Extract internal database schemata from XML data stored as file F or URL U. The rules used for the extraction of the internal schemata are presented with module *Load*. *load U|F* Load DTD schemata or XML data from file F or from URL U. The extensions ".dtd" and ".xml" have to be used in order to differentiate between the schemata and XML data. DB Commands *commit* Commit current changes. *close [B]* Close active database or database B. *create D* Create database D by using the database parameters stored as 'root' database descriptor. *delete O [ext|sub]* Delete object O, extension of O [ext], or all sub-objects of O [sub]. *list [C] [val]* (alias 'l') List sub-classes of 'object' or sub-classes of C. The parameter "val" causes the printout of the values of sub-classes. *open [B]* Open database B. The root object of the database stores the database parameters. *print O [ext|ins|sub|sbc] [ids|val]* (alias 'p') Print object O (default). The parameters "ext|ins|sub|sbc" are used to choose the sub-objects of O to be printed. The parameter "ext" denotes class extension. The parameter "ins" denotes all instances of class O. The parameter "sub" stands for all sub-objects of O and the parameter "sbc" denotes the sub-classes of class O. The parameters "ids|val" are used to select either the printout of the objects identifiers or object values ("val" is default). *print root* Print database parameters of the current database. *set C = V* Set database parameter C to value V. *use D* Make database D active. *verify pspace [list|check|report|raw]* Check all objects stored in the database. The parameter "list" causes the printout of oids from the active database. The parameter "check" verifies the database and makes corrections if needed (not implemented!). The parameter "report" prints more details about db contents (not implemented!). The parameter "raw" causes the printout of the db-file as it is. *verify vspace [ids|raw]* Print all objects that are currently in buffer manager. The parameter "ids" causes the print of object identifiers. The parameter "raw" prints the object buffer thru the the volatile address space. Miscellaneous *help* Print help message. *quit* Close active databases and quit *qios*. Unix commands Unix commands that can be invoked from *qios* are: *ls*, *cp*, *more*, *mv* and *vi*. SEE ALSO *Qios::Glb*(3), *Qios::Rsm*(3), *Qios::Osm*(3), *Qios::Load*(3), *Qios::Que*(3), *Qios::Parser*(3), *Qios::Eval*(3), *Qios::Cost*(3), *Qios::Rule*(3), *Qios::Optor*(3) AUTHORS Iztok Savnik, Zahir Tari, Acknowledgments Parts of *qios* were developed while I.Savnik was with Database and Information Systems group at Freiburg University.