Notation for collection of elements that may contain duplicates

190 Views Asked by At

I want to represent document D as a collection of words. I was inclined to do this with: D = {w_1, w_2, ... }. However, the curly brackets are often interpreted as sets that cannot have duplicates, while documents may contain duplicates.

(1) Are there other brackets or other notations that I can use to represent that ordered list ?

(2) What alternatives to use when that list is unordered ?

2

There are 2 best solutions below

3
On BEST ANSWER

While a "multi-set" best answers the question in your title, people usually care about the order of the words in a document, in which case it's probably best to go with a sequence, and the notation for that is "$(w_i)$", or "$(w_i)_{i=1}^n$" if you want to fully decorate it.

To check if you want to go with multi-sets or sequences, ask yourself if you would consider "eat, cat, eat" and "eat, eat, cat" to be the same document or not. If the answer is "no", then go with sequences and the above notation. If the answer is "yes", then you're thinking of them as multi-sets, which sadly I don't believe have a standard notation.

1
On

The concept of an unordered list with possible duplicates is usually called a multiset. The notation conventions on this vary wildly. Some people use the same curly braces used for sets and make it clear that they are meant as multisets. Another option is to use square brackets. There is, unfortunately, not any universal convention for this kind of object.

Another possibility is to write multisets as linear cominations, so $\{a,b,b\}=[a,b,b]=a+2b$ for instance. This can be confusing if the elements are things which you can add though.