Dataset Scrubbing Utilities

Perform dataset scrubbing actions and return the scrubbed dataset as a ready-to-go data feed. This is an approach for normalizing an internal data feed.

Supported environment variables:

# verbose logging in this module
# note this can take longer to transform
# DataFrames and is not recommended for
# production:
export DEBUG_FETCH=1

Ingress Scrubbing supports converting an incoming dataset (from IEX) and converts it to one of the following data feed and returned as a pandas DataFrame:

DATAFEED_DAILY = 900
DATAFEED_MINUTE = 901
DATAFEED_QUOTE = 902
DATAFEED_STATS = 903
DATAFEED_PEERS = 904
DATAFEED_NEWS = 905
DATAFEED_FINANCIALS = 906
DATAFEED_EARNINGS = 907
DATAFEED_DIVIDENDS = 908
DATAFEED_COMPANY = 909
DATAFEED_PRICING_YAHOO = 1100
DATAFEED_OPTIONS_YAHOO = 1101
DATAFEED_NEWS_YAHOO = 1102
analysis_engine.dataset_scrub_utils.debug_msg(label, datafeed_type, msg_format, date_str, df)[source]

Debug helper for debugging scrubbing handlers

Parameters:
  • label – log label
  • datafeed_type – fetch type
  • msg_format – message to include
  • date_str – date string
  • dfpandas DataFrame or None
analysis_engine.dataset_scrub_utils.ingress_scrub_dataset(label, datafeed_type, df, date_str=None, msg_format=None, scrub_mode='sort-by-date', ds_id='no-id')[source]

Scrub a pandas.DataFrame from an Ingress pricing service and return the resulting pandas.DataFrame

Parameters:
  • label – log label
  • datafeed_type

    analysis_engine.iex.consts.DATAFEED_* type or analysis_engine.yahoo.consts.DATAFEED_*` type .. code-block:: python

    DATAFEED_DAILY = 900 DATAFEED_MINUTE = 901 DATAFEED_QUOTE = 902 DATAFEED_STATS = 903 DATAFEED_PEERS = 904 DATAFEED_NEWS = 905 DATAFEED_FINANCIALS = 906 DATAFEED_EARNINGS = 907 DATAFEED_DIVIDENDS = 908 DATAFEED_COMPANY = 909 DATAFEED_PRICING_YAHOO = 1100 DATAFEED_OPTIONS_YAHOO = 1101 DATAFEED_NEWS_YAHOO = 1102
  • dfpandas DataFrame
  • date_str – date string for simulating historical dates or datetime.datetime.now() if not set
  • msg_format – msg format for a string.format()
  • scrub_mode – mode to scrub this dataset
  • ds_id – dataset identifier
analysis_engine.dataset_scrub_utils.extract_scrub_dataset(label, datafeed_type, df, date_str=None, msg_format=None, scrub_mode='sort-by-date', ds_id='no-id')[source]

Scrub a cached pandas.DataFrame that was stored in Redis and return the resulting pandas.DataFrame

Parameters:
  • label – log label
  • datafeed_type

    analysis_engine.iex.consts.DATAFEED_* type or analysis_engine.yahoo.consts.DATAFEED_*` type .. code-block:: python

    DATAFEED_DAILY = 900 DATAFEED_MINUTE = 901 DATAFEED_QUOTE = 902 DATAFEED_STATS = 903 DATAFEED_PEERS = 904 DATAFEED_NEWS = 905 DATAFEED_FINANCIALS = 906 DATAFEED_EARNINGS = 907 DATAFEED_DIVIDENDS = 908 DATAFEED_COMPANY = 909 DATAFEED_PRICING_YAHOO = 1100 DATAFEED_OPTIONS_YAHOO = 1101 DATAFEED_NEWS_YAHOO = 1102
  • dfpandas DataFrame
  • date_str – date string for simulating historical dates or datetime.datetime.now() if not set
  • msg_format – msg format for a string.format()
  • scrub_mode – mode to scrub this dataset
  • ds_id – dataset identifier
analysis_engine.dataset_scrub_utils.build_dates_from_df_col(df, use_date_str, src_col='minute', src_date_format='%Y-%m-%d %H:%M:%S', output_date_format='%Y-%m-%d %H:%M:%S')[source]

Converts a string date column series in a pandas.DataFrame to a well-formed date string list.

Parameters:
  • src_col – source column name
  • use_date_str – date string for today
  • src_date_format – format of the string in the `df[src_col] columne
  • output_date_format – write the new date strings in this format.
  • df – source pandas.DataFrame