My goal is to take a series of modifications to relations in an SQL database, and check if I need to re-run other SQL queries because their results would change. Allowing people to listen to changes to an SQL query.
I found a similar problem is "how to incrementally refresh a materialized view".
Here is an unimplemented proposal for Postgres that uses relational algebra to optimize this process.
I am looking for something slightly different though. The above proposal calculates the delta for these queries, but I want to know whether I need to calculate the delta in the first place based on the actual query selections.
For a series of changes (update, insert, delete), taking into account the values of the tuple, how can I work out if the query results need to be updated at all.
In practice it would be:
function doesQueryNeedRefresh([changes], query): Boolean { return ... }
For example, if we look at different mutations to the list relation, here are the conditions that invalidate the query.
select name from list where list_id = 1
- update:
- if old
list_id = 1andlist_idchanges nameattr changes
- if old
- insert:
- if
list_id = 1
- if
- delete:
- if
list_id = 1
- if
But then it gets more complex with joins:
select l.name, i.name from list as l
inner join list_items as li on li.list_id = l.id
inner join items as i on i.id = li.item_id
where items.completed = true
For this query, if we look at a single mutation to each relation in isolation:
list- create: if
list_items.list_idcontainslist.id - update: if
namechanges orlist.id - delete: if
list_items.list_idcontainslist.id
- create: if
list_items- create: if
list_idis inlist, oritem_idis initems - update: if foreign keys change (because of no projection)
- delete: if
list_idis inlist, oritem_idis initems
- create: if
items- create: if
completed = trueandidinlist_items.item_id - update: if
idinlist_items.item_id - delete: if
idinlist_items.item_id
- create: if
We could do less checks by assuming that some values cannot change such as primary and foreign keys.
Also, instead of deleting rows, what if we marked things as deleted = true, maybe this would make it simpler.
Is there a way to formalize this in relational algebra? And also, does this help me efficiently perform these calculations?
I am also interested in the smallest subset of these relations that are necessary for these checks to run. Use case: I maintain a local database that syncs a portion of the data from the cloud database, and I want to run queries locally,