Data Structures in R
Data structures are R Programming objects that store and arrange data. All variables and functions in R are objects. The atomic vector, an ordered collection of elements of the same data type or mode, is the simplest basic data structure. R defines six atomic vector types: doubles (numeric), integers, characters, logicals, complex, and raw. When you mix different data types into an atomic vector, R coerces the items to a common type.
R builds on this basis by adding attributes to atomic vectors to create more complicated data structures. A matrix is a two-dimensional vector with a dim attribute defining its rows and columns. An array extends this idea to numerous dimensions. R lists can group any mix of R objects, such as vectors, matrices, or other lists, to handle heterogeneous data.
Data frames are the best data structure for data analysis since each component is a vector of the same length, providing a two-dimensional spreadsheet-like table. This layout works well since columns can contain multiple data kinds but must have the same values. Finally, factors hold categorical variables as numeric vectors with character labels in the “levels” attribute.
Atomic Vectors: The Basic Building Blocks
The most fundamental and frequent data structure in R is the atomic vector, which is an ordered collection of items of the same mode. R treats any number or character string as a one-length vector. R recognizes six atomic vectors, however some are more common in data analysis.
Numeric (or Double) Vectors: Numeric (or Double) Vectors keep integers or floating-point (decimal) numbers. R saves numbers as doubles, which are double-precision floating-point numbers in computer science.
Integer Vectors: These vectors are for integers. An integer is produced when a number is preceded by an uppercase L (e.g., 27L). Memory storage and numerical precision essential, although most data science activities can use the default numeric (double) type.
Character Vectors: R character vectors are basic atomic vectors used to store text data. Strings of letters, numerals, or symbols make up a character vector. Character vectors are atomic vectors that can only store character strings.
Logical Vectors: Logical Vectors have only two values: TRUE or FALSE. Comparison operations (e.g., x > 5) produce logical vectors, which are strong data manipulation and filtering tools.
Complex and Raw Vectors: Rare in overall data analysis. Raw vectors store bytes, while complex vectors store complex integers with an imaginary part: 2+3i.
Atomic vectors must contain only one type of data. Least common denominator coercion is performed by R when you use the c() function to mix diverse types in a vector: logicals are converted to integers, numerics, and characters. The vector’s single-mode integrity is preserved.
Matrices and Arrays: Multidimensional Atomic Vectors
Two-dimensional matrices and arrays are vector extensions. With a dim attribute, they are atomic vectors with dimensions. All matrix and array elements must be the same mode, like vectors.
Matrices: Two-dimensional matrices have rows and columns. Specify the data, rows, and columns to generate a matrix using matrix(). R fills matrices column-wise by default, but byrow=TRUE changes this to row-wise. Combining vectors with rbind() or cbind() creates matrices.
Arrays: Any-dimensional arrays are generalizations of matrices. Using array() and a dim vector to set dimension sizes, you can make one. Imagine a three-dimensional array as layers of equal-sized matrices.
Matrices and arrays may be handled using various vectorized operations, which is a strength of R.
Lists: Flexible, Heterogeneous Containers
Lists are flexible and essential data structures in R programming for storing heterogeneous object collections. Lists can group any mix of R objects, such as numeric vectors, character strings, matrices, functions, or other lists, unlike atomic vectors, which must have the same data type. Lists are great for holding related but dissimilar data, such a playing card’s face, suit, and point value (list(“ace”, “hearts”, 1)).
The list() function creates a list with tags. Single square brackets […] conduct “list slicing” and produce a smaller list, while double square brackets […] or the dollar sign $ (for named lists) extract a single component in its original data type. This distinction is important because many R functions require object extraction before working with lists. Add new components to a list quickly, and delete existing ones by assigning NULL. Lists, called “recursive vectors” because they can include any object type, underpin more complicated R objects like data frames.
Data Frames: The Standard for Tabular Data
For data analysis in R, the data frame is the most common storage structure, similar to an Excel spreadsheet. It is a special list that groups vectors of the same length into a two-dimensional table with rows and columns. A data frame’s main benefit is its capacity to store heterogeneous data; each column can contain numeric, character, or logical data, but all items in a column must have the same type. Data frames are great for tabular data science due to their structure.
Data.frame() can be used manually to generate a data frame with named vectors for table columns. Because manually inputting huge data sets is inefficient and error-prone, data frames are usually constructed by loading data from an external file. Data from plain-text files is often loaded into a data frame using read.table(), read.csv(), or the “Import Dataset” option in RStudio.
Data frames are lists with the class attribute “data.frame”. The str() method reveals its internal structure as a list of variables. Set stringsAsFactors = FALSE to stop R’s data.frame() and read.table() functions from converting character strings into factors. Data frames can be accessed like lists using the $ operator to pick columns by name or like matrices using [row, column] notation to subset values.
Factors: Handling Categorical Data
Factors are used in R programming to store and handle categorical data like gender and eye color. Although factors resemble character vectors, they are internally distinct. Each integer in a factor vector in R represents a category. This numeric vector has two crucial attributes: a class attribute of “factor” and a levels attribute, a character vector with each category’s unique labels. R utilizes the levels feature to display character labels for factors, although it computes like an integer vector, which can be confusing.
A factor can be created by giving an atomic vector to factor(). R automatically infers the levels alphabetically from the vector’s unique values, but you can create and organize them. Using a gender factor, 1 and 2 may be displayed as “female” and “male”. This internal integer structure can be seen via unclass(). Remember that several R functions, such data.frame() and read.table(), transform character strings to factors by default. Setting stringsAsFactors = FALSE generally suppresses this functionality. If needed, use as.character() to convert a factor to a character vector. Factors are important for statistical modeling and graphing since categorical variables are numerically coded.
The Role of Attributes and Classes
In R programming, attributes and classes help turn simple data structures into complex objects. Any R object, like an atomic vector, can have attributes. This information does not impact object values but can be used by R functions to handle the object differently. Names label vector elements, and dim (dimensions) can transform a vector into a multi-dimensional array or matrix without modifying its data type. Matrixes are vectors having dim attributes.
R’s S3 object-oriented architecture relies on class, its most powerful property. An object’s class tells general methods like print(), summary(), and plot() how to act when they encounter it. A generic function checks an object’s class attribute and calls a class-specific method. This is method dispatch. A numeric vector with the class POSIXct is still a vector of numbers (seconds since an epoch), but the class attribute tells the print() function to present it as a date and time.
A factor is an integer vector with a class attribute of “factor” and a levels property with character labels; functions interpret it as a categorical variable. R uses attributes, especially the class attribute, to generate sophisticated data structures like matrices, dates, and factors from core atomic vectors, creating a consistent yet flexible system where functions automatically adapt to diverse data types.
In conclusion, R has many statistical data structures. R’s data analysis stores, organizes, and manipulates data using atomic vectors and spreadsheet-like data frames.