Doctoral School of Exact and Natural Sciences - Jagiellonian University

Implementation Doctorate Programme II - artificial intelligence of the Doctoral School of Exact and Natural Sciences of the Jagiellonian University for 2020, part 2

Information on endowment from the state budget or a state special purpose fund
The project is financed by the state budget

Name of the programme or fund
IMPLEMENTATION DOCTORATE PROGRAMME II - ARTIFICIAL INTELLIGENCE

Name of the project
Implementation Doctorate Programme II - artificial intelligence of the Doctoral School of Exact and Natural Sciences of the Jagiellonian University for 2020

Project manager
dr hab. Adam Roman, prof. UJ

Value of the endowment
PLN 309 522.34

Total investment cost
PLN 309 522.34

Brief description of the project

The goal of the doctoral dissertation is to develop methods that allow for effective management of the data, support in maintenance (e.g. lowering costs) and code development, and a better understanding of the entire ecosystem of procedures.

Planned subject matter and detailed plan of the dissertation.

Therefore, the subject of the doctoral dissertation fits very well in the area of theoretical computer science (using the complexity theory and optimization methods to design new optimization/control methods for ETL processes, studying their calculative complexity) and technical computer science (using tools in the field of graph theory and artificial intelligence to provide a functional solution).

The detailed plan of the dissertation is presented below. Each task will be preceded by a literature research on the latest results in the optimization of ETL processes and machine learning, as well as classical methods in the field of graph theory and computational complexity theory, coducted by a doctoral student.

The detailed plan of the dissertation:

Developing a mathematical model of a computational graph representing the type of business processes used in the banking sector and examining the complexity class of the isomorphism problem for this class of graphs
Development of an effective method for finding the differences between two procedures using optimization methods and graph neural networks
Development of methods indicating a possible reduction of repetitive code by analyzing a computational graph (representing the ETL process), using deep machine learning techniques and graph embedding, and the use of appropriate similarity measures
Development of a method that finds potential errors, ineffective code and suggests proposed solutions using machine learning, anomaly detection and pattern matching techniques.
Examine the naming ontology in a database for a given business area using machine learning and natural language processing techniques.

Research methodology.

The problem of researching the equivalence of two ETL processes will be reduced to the problem of isomorphism of graphs for a specific class of graphs. Tasks 1 and 2 of the detailed plan described above are intended to determine whether the process model that actually reflects the type of ETL processes used by companies for operating activities has any features that would facilitate solving the isomorphism problem or determine what additional assumptions for this the model would narrow down the class of the considered graphs to one for which the problem of isomorphism could be solved, for example, in multinomial time.

For the purpose of solving task 2, distance metrics shall be proposed, which express structural and semantic differences in the compared processes. The metrics will have to be constructed to reflect the difference between the two ETL processes not only from a purely structural point of view, but most of all from a business one.

Tasks 3 and 4 shall use machine learning tools, in particular neural networks for which graphs are an input. Currently, in the literature there is many works devoted to the so-called graph-embedding for graph structure representation as an input for a neural network. Research in this area will concern the method of immersing graphs in other "convenient" representational structures (eg vector spaces), taking into account the specificity of ETL processes, in particular - taking into account semantic information.

In task 5, methods and techniques of NLP (natural language processing) will be used, along with machine learning tools (e.g. bag of words approaches).

Schedule for the implementation of scientific work