Query

A Callable for querying MAGMA’s database and retrieving returned data.

Definition

class controller.callable.extractor.Query(self, fields: list[str], conditions: list[list[QueryCondition]])
Callable name:

query

Callable type:

Extractor

Parameters:
  • fields (list[str]) – A list of fields to return from the query.

  • conditions (list[list[QueryCondition]]) – Conditions to apply to the query.

Parameters

fields

A list of fields to return from the query.

Type:

list[str]

Required:

True

Choices:

(None, as a Query is constructed using the Data Extraction UI)

  • A field string must follow the format of tableName.columnName. Both table names and column names are lower_snake_case.

conditions

Conditions to apply to the query.

Type:

list[list[QueryCondition]]

Required:

True

Choices:

(None, as a Query is constructed using the Data Extraction UI)

  • This parameter consists of two lists.

    • The first list indicates OR logic, while the second list indicates AND logic. In the second list, the actual condition objects are placed.

    • As a result, conditions must be described in the format of [ [ cond AND ... ] OR ... ].

    • Other formats, such as further condition nesting is not supported.

  • If no conditions are to be specified, pass an empty list [].

As an example:

[
    [
        {
            "field": "course.code",
            "operator": "=",
            "value": "COMP1000"
        },
        {
            "field": "course_offering.semester",
            "operator": "=",
            "value": "fall"
        }
    ],
    [
        {
            "field": "course.code",
            "operator": "=",
            "value": "COMP2000"
        }
    ]
]

The above condition would be interpreted as:

SELECT * FROM course_offering
WHERE (course.code = 'COMP1000' AND course_offering.semester = 'fall') OR (course.code = 'COMP2000')

Implementation details

Table joining

The Query Callable automatically finds an optimal solution for joining all tables required by the user.

  1. Query reads the fields parameter as the set of tables and fields to be retrieved from the database.

  2. From the list of tables and fields to be retrieved, Query automatically searches for an efficient path to retrieve all tables that are required to meaningfully join all provided fields into one single table. The algorithm works according to the following:

    1. Let the set of tables and fields to be retrieved from the database (obtained in Step 1) be \(T\) and \(F\) respectively.

    2. Query retrieves schema information from the database, which is essentially metadata containing information on how each table is connected to other tables via foreign keys.

    3. Query constructs a table-relationship graph \(G\). Each node represents a table in the database, while an undirected edge denotes that there is a foreign key that connects entries of the two tables.

    4. Query obtains a set \(S\) of 2-combinations from \(T\). That is, \(S = \binom{T}{2}\).

    5. From \(G\), Query obtains a set of shortest paths \(P\) between all pairs by applying the Floyd-Warshall algorithm. This step has a time complexity of \(O(n^3)\), but since the number of tables is small, the performance impact is low.

    6. Denote each path \(p \in P\) of length \(m\) as \(p(e_0, e_1, ..., e_m)\).

    7. From Step 2.5, Query obtains a subset of shortest paths \(P_T\) where the source and destination tables are in \(T\). That is, \(P_T = \{ p \in P | p(e_0, e_m) \in T \}\).

    8. Query constructs a shortest-path graph \(G^*\), where each node represents a table in \(T\). Edges from all \(p \in P_T\) are added to \(G^*\).

    9. Obtaining a minimum spanning tree (MST) \(G^{**}\) from \(G^*\) using Kruskal’s algorithm. This step has a time complexity of \(O(E \log V)\).

    10. From \(G^{**}\), repeatedly remove all leaves \(\notin T\) until no such leaf exists.

    11. Run breadth-first search (BFS) on \(G^{**}\). For each node \(n_1\) reached from \(n_0\), add the pair \((n_0, n_1)\) into a list. This list contains table pairs.

    12. The resulting list is the sequence of table pairs in order to merge all the tables. This sequence is returned without modification on ordering as the joining strategy.

  3. According to the joining strategy provided by the algorithm, Query constructs a valid SQL query with the corresponding JOIN clauses. The template is as follows.

    SELECT {schema}.{field}, ... FROM {schema}.{table}, ... FULL OUTER JOIN {schema}.{table} ON {schema}.{table}.{foreignKey} = {schema}.{table}.{foreignKey} WHERE ...
    
  4. The SQL query is sent as an API call to MAGMA core API’s Hasura raw SQL endpoint, and a response is received. The data in the response is in the form of a JSON object, which is then parsed into a DataFrame.

Conditions

Also see the QueryCondition documentation.

// TODO

Output

A DataFrame is returned with the following columns:

<callable_id>.<table>.<column>

etc.

value

value

value

value

etc.

etc.

  • Each column corresponds to one field specified in the fields parameter.

  • Each row contains an entry from the database such that its values match the specified conditions.

  • The ordering of columns is the same as the ordering of the fields parameter.

  • The ordering of values is not guaranteed to follow any specific order.

Example configuration

{
    "name": "foo",
    "callable": "query",
    "params": {
        "fields": [
            "student.student_number",
            "course_offering_student.course_offering_id"
        ],
        "conditions": [
            [
                {
                    "field": "student.student_number",
                    "operator": "LIKE",
                    "value": "20%"
                },
                {
                    "field": "course_offering_student.course_offering_id",
                    "operator": "=",
                    "value": 3
                }
            ]
        ]
    }
}