IOLink  IOL_v1.6.1_release
DataFrames

What is a DataFrame ?

A dataframe is simply a model for tabular data, organised in columns and rows. More specifically, a dataframe is column-oriented, meaning that data is stored in columns, rows just being an aggregate of columns' data. Because of this construction, type of data is homogeneous along columns, but can be heterogeneous along rows.

DataFrameView API

The API for dataframe defines a view that has various capabilities and methods. DataFrameView has three capabilities: READ for reading columns' data, WRITE for editing it, and RESHAPE for adding and removing columns or rows.

There are also structure methods that will be available for all instances, especially for information about the shape of the dataset, and column informations. Columns are usually accessed by indices in this interface, ranging from 0 to the shape's width. But more information can be accessed using the following methods:

Vector2u64 shape = frame->shape();
// Here we display the name and data type of each column
for(size_t i = 0; i < shape[0]; ++i)
{
std::string name = frame->columnName(i);
DataType dtype = frame->columnDataType(i);
std::cout << name << ", " << dtype.toString() << std::endl;
}

Reading and writing data

The reading interface of DataFrameView is quite simple and low level. You can only read data from one column at a time, and you must pass a buffer to the reading method with a size corresponding to the amount of elements you want to read.

In this example, we want to read data from the first five rows of the data frame, that has a column with an integer type, and another storing strings.

std::vector<int> buffer0(5);
std::vector<std::string> buffer1(5);
// Read first column data
frame->read(0, 0, 5, buffer0.data());
// Read second column data
frame->read(1, 0, 5, buffer1.data());

The interface to write data is quite similar, with the difference that you must fill the given buffer with the data to write. Similary to previous example, we can write the top five rows of our data frame like this:

std::vector<int> buffer0{1, 2, 3, 4, 5};
std::vector<std::string> buffer1{"monday", "tuesday", "wednesday", "thursday", "friday"};
// Write first column data
frame->write(0, 0, 5, buffer0.data());
// Write second column data
frame->write(1, 0, 5, buffer1.data());

For getting one element of a column, there are some more user-friendly methods to use:

// get the second element of the first column
int value1 = frame->at<int>(0, 1);
// get the third element of the second column
std::string value2 = frame->at<std::string>(1, 2);
// get the second element of the "order" column
int value3 = frame->at<int>("order", 1);
// get the third element of the "name" column
std::string value4 = frame->at<std::string>("name", 2);

Changing the dataframe shape

The RESHAPE capability offers method to change the shape of a DataFrameView instance by adding and removing columns and rows. Because of the column-oriented structure, columns can only be affected one by one, but rows can be processed in continuous chunks.

Columns operations examples:

// Append a new column, its index being the previous width of the shape
frame->addColumn("age", DataTypeId::UINT32);
// We remove the first column
frame->removeColumn(0);

Row operations example:

// Add two lines after the first one
frame->addRows(1, 2);
// Removing the first two lines
frame->removeRows(0, 2);
Vector2u64 shape = frame->shape();
// Add five lines at the end
frame->addRows(shape[1], 5);

Creating a memory data frame

The factory DataFrameViewFactory has a method allocate that can be used to create a DataFrameView with its data stored in memory:

std::vector<std::string> columnNames{"age", "name"};
std::vector<DataType> columnDataTypes{DataTypeId::UINT16, DataTypeId::UTF8_STRING};
std::shared_ptr<DataFrameView> dframe = DataFrameFactory::allocate(Vector2u64{2, 5}, columnNames, columnDataTypes);