Query ===== A :doc:`Callable ` for querying MAGMA's database and retrieving returned data. Definition ---------- .. py:class:: controller.callable.extractor.Query(self, fields: list[str], conditions: list[list[QueryCondition]]) :callable name: ``query`` :callable type: :doc:`Extractor ` :param fields: A list of fields to return from the query. :type fields: list[str] :param conditions: Conditions to apply to the query. :type conditions: list[list[QueryCondition]] Parameters ---------- .. py:attribute:: fields A list of fields to return from the query. :type: list[str] :required: True :choices: (None, as a Query is constructed using the Data Extraction UI) - A field string **must** follow the format of ``tableName.columnName``. Both table names and column names are ``lower_snake_case``. .. py:attribute:: conditions Conditions to apply to the query. :type: list[list[QueryCondition]] :required: True :choices: (None, as a Query is constructed using the Data Extraction UI) - This parameter consists of two lists. - The first list indicates **OR** logic, while the second list indicates **AND** logic. In the second list, the actual condition objects are placed. - As a result, conditions must be described in the format of ``[ [ cond AND ... ] OR ... ]``. - Other formats, such as further condition nesting is not supported. - If no conditions are to be specified, pass an empty list ``[]``. As an example\: .. code-block:: javascript [ [ { "field": "course.code", "operator": "=", "value": "COMP1000" }, { "field": "course_offering.semester", "operator": "=", "value": "fall" } ], [ { "field": "course.code", "operator": "=", "value": "COMP2000" } ] ] The above condition would be interpreted as: .. code-block:: SELECT * FROM course_offering WHERE (course.code = 'COMP1000' AND course_offering.semester = 'fall') OR (course.code = 'COMP2000') Implementation details ---------------------- **Table joining** The Query Callable automatically finds an optimal solution for joining all tables required by the user. #. Query reads the ``fields`` parameter as the set of tables and fields to be retrieved from the database. #. From the list of tables and fields to be retrieved, Query automatically searches for an efficient path to retrieve all tables that are required to meaningfully join all provided fields into one single table. The algorithm works according to the following: #. Let the set of tables and fields to be retrieved from the database (obtained in Step 1) be :math:`T` and :math:`F` respectively. #. Query retrieves schema information from the database, which is essentially metadata containing information on how each table is connected to other tables via foreign keys. #. Query constructs a table-relationship graph :math:`G`. Each node represents a table in the database, while an undirected edge denotes that there is a foreign key that connects entries of the two tables. #. Query obtains a set :math:`S` of 2-combinations from :math:`T`. That is, :math:`S = \binom{T}{2}`. #. From :math:`G`, Query obtains a set of shortest paths :math:`P` between all pairs by applying the Floyd-Warshall algorithm. This step has a time complexity of :math:`O(n^3)`, but since the number of tables is small, the performance impact is low. #. Denote each path :math:`p \in P` of length :math:`m` as :math:`p(e_0, e_1, ..., e_m)`. #. From Step 2.5, Query obtains a subset of shortest paths :math:`P_T` where the source and destination tables are in :math:`T`. That is, :math:`P_T = \{ p \in P | p(e_0, e_m) \in T \}`. #. Query constructs a shortest-path graph :math:`G^*`, where each node represents a table in :math:`T`. Edges from all :math:`p \in P_T` are added to :math:`G^*`. #. Obtaining a minimum spanning tree (MST) :math:`G^{**}` from :math:`G^*` using Kruskal's algorithm. This step has a time complexity of :math:`O(E \log V)`. #. From :math:`G^{**}`, repeatedly remove all leaves :math:`\notin T` until no such leaf exists. #. Run breadth-first search (BFS) on :math:`G^{**}`. For each node :math:`n_1` reached from :math:`n_0`, add the pair :math:`(n_0, n_1)` into a list. This list contains **table pairs**. #. The resulting list is the sequence of table pairs in order to merge all the tables. This sequence is returned without modification on ordering as the joining strategy. #. According to the joining strategy provided by the algorithm, Query constructs a valid SQL query with the corresponding JOIN clauses. The template is as follows. .. code-block:: SELECT {schema}.{field}, ... FROM {schema}.{table}, ... FULL OUTER JOIN {schema}.{table} ON {schema}.{table}.{foreignKey} = {schema}.{table}.{foreignKey} WHERE ... #. The SQL query is sent as an API call to MAGMA core API's Hasura raw SQL endpoint, and a response is received. The data in the response is in the form of a JSON object, which is then parsed into a DataFrame. **Conditions** Also see the :doc:`QueryCondition ` documentation. // TODO Output ------ A DataFrame is returned with the following columns: ================================== ======== ``..`` etc. ================================== ======== value value value value etc. etc. ================================== ======== - Each column corresponds to one field specified in the ``fields`` parameter. - Each row contains an entry from the database such that its values match the specified ``conditions``. - The ordering of columns is the same as the ordering of the ``fields`` parameter. - The ordering of values is **not** guaranteed to follow any specific order. Example configuration --------------------- .. code-block:: javascript { "name": "foo", "callable": "query", "params": { "fields": [ "student.student_number", "course_offering_student.course_offering_id" ], "conditions": [ [ { "field": "student.student_number", "operator": "LIKE", "value": "20%" }, { "field": "course_offering_student.course_offering_id", "operator": "=", "value": 3 } ] ] } }