BookmarkSubscribeRSS Feed

How to leverage the Hadoop Distributed File System using the SAS Scalable Performance Data Engine

Started ‎11-13-2015 by
Modified ‎01-19-2016 by
Views 2,983

As part of the SAS Data Management for Hadoop articles series, I’d like to explore the ins and outs of the SAS Scalable Performance Data (SPD) Engine in this post. The SPD Engine is delivered to SAS customers as part of Base SAS. It’s designed to read data very rapidly and in parallel.

 

In the third maintenance release for SAS 9.4, the SPD Engine expands the supported Hadoop distributions, with or without Kerberos:

 

  • Cloudera CDH 4.x
  • Cloudera CDH 5.x
  • Hortonworks HDP 2.x
  • IBM InfoSphere BigInsights 3.x
  • MapR 4.x (for Microsoft Windows and Linux operating environments only)
  • Pivotal HD 2.x

The SPD Engine organizes data into a file format that has advantages for a distributed file system like the Hadoop Distributed File System (HDFS). Advantages of the SPD Engine file format include the following:

  • Data is separate from the metadata. The file format consists of separate files: one for data, one for metadata, and two for indexes. Each type of file has an identifying file extension. The extensions are .dpf for data, .mdf for metadata, and .hbx and .idx for indexes. 

 SPDE.Blog1.image1.png

  • The SPD Engine file format partitions the data by spreading it across multiple files based on a partition size. Each partition is stored as a separate physical file with the extension .dpf. Depending on the amount of data and the partition size, the data can consist of one or more physical files, but is referenced as one logical file.

 

The default partition size is 128 megabytes. You can specify a different partition size with the PARTSIZE= LIBNAME statement option or the PARTSIZE= data set option.

 

The SPD Engine reads, writes, and updates data in HDFS. You can use the SPD Engine with standard SAS applications to retrieve data for analysis, perform administrative functions, and update the data (Note: SAS/CONNECT and SAS/SHARE are not supported by SPD Engine).

 

Like SAS data sets, SPD Engine tables support analytical base tables containing hundreds of thousands of columns. These analytical base tables become source tables to predictive analytical routines.

SPDE.Blog1.image2.png

Stay tuned, in my next post we explore how to create SPD Engine tables on HDFS.

 


Follow the Data Management section of the SAS Communities Library (Click Subscribe in the pink-shaded bar of the section) for more articles on how SAS Data Management works with Hadoop. Here are links to other posts in the series for reference:

 

Version history
Last update:
‎01-19-2016 04:42 PM
Updated by:
Contributors

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

Free course: Data Literacy Essentials

Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning  and boost your career prospects.

Get Started

Article Labels
Article Tags