How to leverage the Hadoop Distributed File System using the SAS Scalable Performance Data Engine

2 Likes

As part of the SAS Data Management for Hadoop articles series, I’d like to explore the ins and outs of the SAS Scalable Performance Data (SPD) Engine in this post. The SPD Engine is delivered to SAS customers as part of Base SAS. It’s designed to read data very rapidly and in parallel.

In the third maintenance release for SAS 9.4, the SPD Engine expands the supported Hadoop distributions, with or without Kerberos:

Cloudera CDH 4.x
Cloudera CDH 5.x
Hortonworks HDP 2.x
IBM InfoSphere BigInsights 3.x
MapR 4.x (for Microsoft Windows and Linux operating environments only)
Pivotal HD 2.x

The SPD Engine organizes data into a file format that has advantages for a distributed file system like the Hadoop Distributed File System (HDFS). Advantages of the SPD Engine file format include the following:

Data is separate from the metadata. The file format consists of separate files: one for data, one for metadata, and two for indexes. Each type of file has an identifying file extension. The extensions are .dpf for data, .mdf for metadata, and .hbx and .idx for indexes.

SPDE.Blog1.image1.png

The SPD Engine file format partitions the data by spreading it across multiple files based on a partition size. Each partition is stored as a separate physical file with the extension .dpf. Depending on the amount of data and the partition size, the data can consist of one or more physical files, but is referenced as one logical file.

The default partition size is 128 megabytes. You can specify a different partition size with the PARTSIZE= LIBNAME statement option or the PARTSIZE= data set option.

The SPD Engine reads, writes, and updates data in HDFS. You can use the SPD Engine with standard SAS applications to retrieve data for analysis, perform administrative functions, and update the data (Note: SAS/CONNECT and SAS/SHARE are not supported by SPD Engine).

Like SAS data sets, SPD Engine tables support analytical base tables containing hundreds of thousands of columns. These analytical base tables become source tables to predictive analytical routines.

SPDE.Blog1.image2.png

Stay tuned, in my next post we explore how to create SPD Engine tables on HDFS.

Follow the Data Management section of the SAS Communities Library (Click Subscribe in the pink-shaded bar of the section) for more articles on how SAS Data Management works with Hadoop. Here are links to other posts in the series for reference:

How to leverage the Hadoop Distributed File System using the SAS Scalable Performance Data Engine

Free course: Data Literacy Essentials

Get Started