Five sets of flat files are built from the CRSP Stock and Index databases. A number of our subscribers have long requested alternate formats of our databases in order to automate or streamline their processes and easily ingest CRSP data into them. These flat files are intended for these purposes.
- SIZ – 1925 US Stock and Indexes
- SAZ – 1925 US Stock
- SXZ – 1962 US Stock and Indexes
- S6Z – 1962 US Stock
- SFZ - 1925 US Indexes
- SAS: *.sas7bdat
- ASCII: *.dat
- R: *.rds
|Stock Files||File Names|
|Delist – Daily Return Information||sfz_del|
|Delist – Monthly Return Information||sfz_mdel|
|Security Header Information||sfz_hdr|
|Time Series – Daily Primary||sfz_dp_dly|
|Time Series – Daily Secondary||sfz_ds_dly|
|Time Series - Monthly||sfz_mind|
|Stock and Index Files*||File Names|
|Portfolio Membership - Daily||sfz_portd|
|Portfolio Membership - Monthly||sfz_portm|
* Available in 1925 and 1962 US Stock & Indexes, and 1925 US Indexes, but are NOT available in 1925 and 1962 US Stock.
|Index Files||File Names|
|Time Series – Daily Index||sfz_dind|
|Time Series – Monthly Index||sfz_mind|
Comparisons between the legacy CRSPAccess Stock and Index databases and the CRSP flat files will reveal differences, outlined here:
The flat files provide all of the underlying data that can be used to generate derived data items, but do not include the derived items themselves. Examples of derived data that TSQuery and ts_print can generate that are not found in the flat include excess and cumulative returns, last non-missing, recent, previous period versions of time series data items.
CRSPAccess stored most items, including commonly used Price (PRC), Ask or High Price (ASKHI), and Bid or Low Price (BIDLO), as 4-byte floating point numbers. The SAS format defaults to storing those items as 8-byte (double precision) floating point numbers. For the majority of historical values, this does not result in any difference. However, the conversion to SAS’s double precision, makes more apparent a known issue of phantom precision related to decimal pricing. For example, IBM’s price on 11/27/2015 of $138.46 will, if displayed with five decimal places, show $138.46001. The ASCII version, by outputting only seven significant digits in scientific notation, will have just 1.384600E+02, and no phantom precision.
CRSP is re-engineering our back office to correct the phantom precision and increase the real precision throughout the Stock and Indexes flat files in the coming year.
Subscribers have long-requested the addition of 9-character cusips to our databases. The flat files include them in the name- and header history files.
CRSP Total Market
The investable CRSP Total Market Index is included in the index files associated with the stock and index database. Daily price-only and total returns, levels, and counts are included.
The introduction of double precision allows for the monthly volumes to be stored in same unit (1 share) as daily volumes rather than as a unit of 100 shares. So monthly volumes in the flat files are now stored as one hundred times the value in legacy CRSPAccess. The change also allowed for four missing daily volumes for Citigroup in 2009 and 2010 to be replaced with actual values that were in excess of two billion, and for those monthly volumes to be recalculated.
|KYPERMNO||CALDT||New VOL||MCALDT||New MVOL|
In CRSPAccess, missing values were indicated by the use of defined non-null values (e.g -99 for returns and 0 for prices). In the flat files, missing values are now represented by null values.
In the SAS files, CRSP uses the default SAS missing value, displayed as a “.” In the coming year, CRSP will be further examining all missing value conventions.
Each data item in each table is associated with a Data Category. The Data Category identifies the type of item and the associated formats for use with SAS, ASCII, R, and SQL.
|Data Category||Description||SAS Type||SAS Format||ASCII||SQL Type||R Type|
|FLAG||One-Character Field - Often just ‘Y’ and ‘N’||CHARACTER||$1.||CHAR(1)||TEXT|
|DESCRIPTION||Wide Character Field Containing Text Iinformation||CHARACTER||$W.||VARCHAR(W)||TEXT|
|ID||Field Containing an Alphanumeric Identifier||CHARACTER||$W.||VARCHAR(W)||TEXT|
|MNEMONIC||Alphanumeric Field Containing a Code Value||CHARACTER||$W.||VARCHAR(W)||TEXT|
|NAME||Alphanumeric Field for Names||CHARACTER||$W.||VARCHAR(W)||TEXT|
|START DATE||Start of a Date Range - Paired with an End Date||NUMERIC||YYYYMMDD10.||YYYY-MM-DD||DATE||DATE|
|END DATE||End of a Date Range - Paired with a Start Date||NUMERIC||YYYYMMDD10.||YYYY-MM-DD||DATE||DATE|
|TIMESTAMP||Date and Time, Including Seconds||NUMERIC||DATETIME||YYYY-MM-DD HH:MM:SS||DATETIME||DATETIME|
|AMOUNT||Fixed Point Number||NUMERIC||W.N||WW.NNNNN||DECIMAL (P,N)||FLOAT|
|RATIO||Calculated Floating Point Number||NUMERIC||PERCENTW.N||1.234567890123E+12||FLOAT||FLOAT|
|VALUE||Field with a Wide Range of Values||NUMERIC||E20.||1.234567890123E+12||FLOAT (OR DECIMAL)||FLOAT|
|CODE||Integer Field that Represents One or More Characteristics||NUMERIC||W.||NNNN||INT||INT|
|KEY||Integer Field that is Used as a Key||NUMERIC||W.||NNNNNNN||INT||INT|
|NUMBER||Integer Value < 2,000,000,000||NUMERIC||W.||NNNNNNNNNN||INT||INT|
|QUANTITY||Integer Field with Some Values in Excess of 2,000,000,000||NUMERIC||COMMAW.||NNNNNNNNNNNNN||BIGINT||FLOAT|
Missing values are displayed as follows:
- ASCII missing values, regardless of type, is an empty string
- SAS missing values are an empty string for character fields and SAS missing (displayed as a.) for numberic fields
- R missing values are an empty string for character fields, and ‘R’ missing (displayed as N/A) for float, int, date and datatime.