By Aleksander Demko (November 2017).
This document tries to look at the layers of data structures in every day applications. This is more of an overview to try to understand the interplay of data types in user applications. Such concepts are useful to keep in mind when designing new applications and their interfaces.
This is not a comprehensive computer science focused analysis of data structures and their algorithms – see your text book for that. Nor is this a implementation guide or an exact analysis of performance (e.g. O() notation). Finally, I’m deliberately loose with my definitions – this would not qualify as a mathematical reference by any means.
Core data types are types that a programmer would typically work with.
At the lowest level is a lone bit, a 0 or 1. From this, all types and values are built. The bit could, if we wanted to focus on implementations stay on its own layer, the foundation on which all others are built. I’ll instead group it with all other scalars. This family includes:
Although they can be implemented as integers, this list does not include bit fields. If they’re simple enough, they can be treated as an enumeration but often they’d be considered sets (see maps below).
The first compound data structure we’ll introduced is the list. A list contains 0 or more elements stored in some order. The elements don’t have to be the same type, and can also be other data structures themselves, even lists themselves (i.e. be recursive).
All other structures can be built from lists, possibly inefficiently (See the Lisp programming language). In math, lists these are called tuples. Lists that you can only push and pop elements to one end are stacks. Lists the you can only push and pop elements from opposite ends are queues.
An array is a type of list that is more restrictive in that all elements must be the same type. Arrays are usually tuned for random access by index, and less so for element appending and discarding.
Strings of characters should be modeled as arrays of characters.
A map is list of keys to values pairings, where access by key is fast. Duplicate keys may or may not be allowed. A set is a sub type in which there only keys and no values. The structure may or may not have an ordering to it, but this is often an implementation detail.
A programming language "structure" is a class or group of maps in which all instances must share the key definitions.
A graph is a general structure that simply contains vertices (or nodes) and edges. Nodes can contain data, and edges connect nodes. If all the connection edges have must have a direction, then the graph is "directed graph" or digraph. If they all lack a direction, then it is a undirected graph. A graph without loops or cycles is acyclic while one that has them is cyclic. I would also classify data types that can have loops as cyclic, even if the current instance doesn't have them.
Most data structures can be modeled as graphs. However, this generalization is not often helpful as the implementation will be far from optimal. For example:
Graphs can be implemented as lists, or with adjacency matrices.
User data types are the larger data types that a user would work with in an application.
1 dimensional arrays are probably the most common data types in applications are they are basic lists.
2D arrays, or matrices are also popular especially in math, geometry and graphics.
Arrays can be generalized into higher order dimensions. 3D arrays are common, but 4D and 5D are not.
A node in a tree can contain some data and links to (depending on their configuration) 0-2, 0-n or 0-any number of children. Trees can be implemented as recursive lists, or as restricted graphs.
Mathematical formula or computer programming language "expressions" naturally map themselves to trees, with each node being an operation or scalar value.
In computer science, trees have a root node which gives a direction or parent-child relationship to the rest of the structure. Mathematical trees usually do not do this – there are no parents or children.
A table is a list (or array) of (structured) tuples or rows. However, an ordering is not often required unless some of the fields in the tuples are "indexed". When indexed, a table can be treated as a map like structure with the quality that there could be multiple maps pointing at the same row values.
This section lists various types of applications and how they use the user data types in the real world.
The files in a computer, it's file system, is a tree, as is some mail software’s mail folder structure.
Spreadsheets are 2D arrays of formulas (abstract syntax trees). The "kill app" feature of spreadsheets is the ability of formulas in one matrix cell to use the output of other cells. The tabs could be considered a third dimension. For this discussion, I’m ignoring the formatting properties.
Mind maps are trees and those with more features and linkages grow to become full graphs.
To-do applications are basic lists, and could be considered a sub set of mind maps, especially if they allow sub lists.
Calendars (and perhaps task lists) are examples of time series-based applications, which uses a table of data with a date-time primary key. More specialized examples include event logging applications
Calculators and interactive program interpreters are abstracted syntax tree builders.
Sounds are 1D arrays of sound amplitude samples over time. Images are 2D arrays of pixel elements. Videos could be 3D arrays of pixels, with the 3D dimension over time. However, a better description would be to consider video a list of Images, as the time dimension is fundamentally different that the pixel dimension and is also variable and often longer.
A double-entry accounting system uses a tablet of transactions (or journal entries) at its core. These rows usually contain time stamp and monetary adjustments to 2 or more accounts. Each transaction must sum to 0 (i.e. balance). Each account has a virtual register to show their balance at any point in time. This register might be stored on disk for performance or can be recomputed from the journal. The accounts themselves might form a tree-like hierarchy, grouping the accounts for reporting purposes.
Relational databases are a map of tables which can be queried using a specialized query language (abstract syntax tree builder) for data extraction. Often, in many database applications tree-like data structures are usually mapped into these relations and are captured as 1-to-many foreign key relations. Databases are general tools can be used to model many types of data structures and applications.
The Git source code revision control system uses an append-only tree like data structure to track a history of changes (or commits) of software developer changes to a source code system. Within each change is a tree of files that represent the source code at that time. Append-only data structures (where the nodes, once created, are immutable) have an interesting property that enables data de-duplication as well as easy to synchronize among multiple users.
Word processor documents and web pages can be thought of as trees of page structures, paragraphs, formatting elements and finally characters. However layout and rendering greatly complicates how they’re processed.
Project management software uses directed acyclic graphs with durations to track task dependencies and project critical paths.
Mapping and navigation software uses large graphs to model the road-connected world. They also use graphs to model all other elements (buildings, borders, areas, etc).