Considering sets as elements and elements as sets

89 Views Asked by At

In a data science project, I have $n$ sets, and $k$ elements. Many elements belong to multiple sets. As $k \approx n$, it is convenient to look at these data in both the "straightforward way" (studying all the elements belonging to some set), and what I would call the "dual way": studying all the sets that include some element. In some sense, it amounts to considering sets as elements and elements as sets. Both approaches are interesting in this project, and I would like to somehow formalize their use in the paper I'm writing.

I am sure this "dual" way to consider elements and sets has been studied somewhere, but I can't find any paper doing this. I guess I do not use the proper search terms. Would you have references?

1

There are 1 best solutions below

1
On BEST ANSWER

This problem is fairly open-ended. I'll list a couple of things you can do in no particular order.

Other folks have already mentioned graph theory, especially a bipartite graph, which seems like a very natural choice for representing your relation.

  • Matrices and Linear Algebra
  • Set theory as a notational tool

Since $n$ and $k$ are both finite, one easy thing you can do is organize them into sequences $\vec{d}$ for data and $\vec{s}$ for sets.

After organizing them into sequences, you can represent the elementhood relation as an incidence relation organized into an incidence matrix.

Let's call the matrix $M$. Every element of $M$ is 0 or 1.

$$ M_{ij} = 1 \;\;\text{if and only if}\;\; \vec{d}_i \; \text{is a member of}\; \vec{s}_j $$

Using the incidence relation $M$, you can construct further matrices that represent distances between elements of $\vec{s}$ or between elements of $\vec{d}$. And you can apply all the familiar tools of linear algebra, matrix multiplication, eigenvalues, characteristic polynomials, rank. The singular value decomposition is useful for constructing low-rank approximations to existing matrices, which might also be useful.


Set theory gives you some useful notation that might be helpful for defining your own concepts.

I would first define $D$ as your data/elements and $S$ as your sets. $D$ and $S$ are both sets.

You can choose to make $S$ a subset of $2^D$, the powerset of $D$, but you don't have to. You can make both $S$ and $D$ completely abstract. You can define $R(x, y)$ as a relation, using the following definition.

$$ R(x, y) \; \text{if and only if} \; \text{$x$ is in $\vec{d}$ and $y$ is in $\vec{s}$ and $x$ is in the data set $y$} $$

You can also just use $x \in y$ for that purpose if you want the elements of $S$ to be sets of elements of $D$ rather than abstract "arbitrary" entities.

You can now use set-builder notation, which can describe interesting things, especially when combined with the notation $|\cdots|$ which takes the cardinality (size) of a set.

For example, given two elements of $D$ named $x$ and $y$, here is the probability $P(x, y)$ that they will both be inside or both be outside an element of $S$ chosen uniformly at random.

$$ P(x, y) \stackrel{\text{def}}{=\!=} \frac{|\{ R(x, z) \leftrightarrow R(y, z) : z \in S \}| }{|S|} $$