Transform Documentation

ecopipeline.transform.add_local_time(df: DataFrame, site_name: str, config: ConfigManager) DataFrame

Function adds a column to the dataframe with the local time.

Parameters:
df :pd.DataFrame

Dataframe

site_namestr

site name

configecopipeline.ConfigManager

The ConfigManager object that holds configuration data for the pipeline

Returns:
pd.DataFrame
ecopipeline.transform.add_relative_humidity(df: DataFrame, temp_col: str = 'airTemp_F', dew_point_col: str = 'dewPoint_F', degree_f: bool = True)

Add a 'relative_humidity' column to the dataframe.

Computes relative humidity from air temperature and dew-point temperature using the August-Roche-Magnus approximation. Clips the result to [0, 100].

Parameters:
dfpd.DataFrame

Dataframe containing air temperature and dew-point temperature columns.

temp_colstr, optional

Column name for air temperature. Defaults to 'airTemp_F'.

dew_point_colstr, optional

Column name for dew-point temperature. Defaults to 'dewPoint_F'.

degree_fbool, optional

If True, temperature columns are assumed to be in °F and are internally converted to °C for the calculation. If False, columns are assumed to already be in °C. Defaults to True.

Returns:
pd.DataFrame

Dataframe with an added 'relative_humidity' column (percent, 0–100).

ecopipeline.transform.aggregate_df(df: ~pandas.core.frame.DataFrame, ls_filename: str = '', complete_hour_threshold: float = 0.8, complete_day_threshold: float = 1.0, remove_partial: bool = True) -> (<class 'pandas.core.frame.DataFrame'>, <class 'pandas.core.frame.DataFrame'>)

Aggregate minute-level data into hourly and daily dataframes.

Energy columns (matching .*Energy.* but not EnergyRate or BTU suffixes) are summed; all other numeric columns are averaged. Optionally appends load-shift schedule data and removes partial hours/days.

Parameters:
dfpd.DataFrame

Pandas dataframe of minute-by-minute sensor data.

ls_filenamestr, optional

Path to the load-shift schedule CSV file (e.g. "full/path/to/pipeline/input/loadshift_matrix.csv"). The CSV must have at least four columns: date, startTime, endTime, and event. Defaults to "".

complete_hour_thresholdfloat, optional

Fraction of minutes in an hour required to count as a complete hour, expressed as a float (e.g. 80% = 0.8). Defaults to 0.8. Only applicable when remove_partial is True.

complete_day_thresholdfloat, optional

Fraction of hours in a day required to count as a complete day, expressed as a float (e.g. 80% = 0.8). Defaults to 1.0. Only applicable when remove_partial is True.

remove_partialbool, optional

If True, removes partial hours and days from the aggregated dataframes. Defaults to True.

Returns:
hourly_dfpd.DataFrame

Aggregated hourly dataframe, including a 'system_state' column if a valid load-shift file was provided.

daily_dfpd.DataFrame

Aggregated daily dataframe, including a 'load_shift_day' column if a valid load-shift file was provided.

ecopipeline.transform.aggregate_values(df: DataFrame, thermo_slice: str) DataFrame

Gets daily average of data for all relevant varibles.

Parameters:
dfpd.DataFrame

Pandas DataFrame of minute by minute data

thermo_slicestr

indicates the time at which slicing begins. If none no slicing is performed. The format of the thermo_slice string is “HH:MM AM/PM”.

Returns:
pd.DataFrame:

Pandas DataFrame which contains the aggregated hourly data.

ecopipeline.transform.apply_equipment_cop_derate(df: DataFrame, equip_cop_col: str, r_val: int = 16) DataFrame

Derate equipment-method system COP based on building R-value.

Derate percentages applied:

  • R12–R16: 12%

  • R16–R20: 10%

  • R20–R24: 8%

  • R24–R28: 6%

  • R28–R32: 4%

  • > R32: 2%

Parameters:
dfpd.DataFrame

Dataframe containing the equipment COP column to derate.

equip_cop_colstr

Name of the COP column to derate.

r_valint, optional

Building R-value used to determine the derate factor. Defaults to 16.

Returns:
pd.DataFrame

Dataframe with equip_cop_col multiplied by the appropriate derate factor.

Raises:
Exception

If r_val is less than 12.

ecopipeline.transform.aqsuite_filter_new(last_date: str, filenames: List[str], site: str, config: ConfigManager) List[str]

Function filters the filenames list to only those newer than the last date.

Parameters:
last_datestr

latest date loaded prior to current runtime

filenamesList[str]

List of filenames to be filtered

sitestr

site name

configecopipeline.ConfigManager

The ConfigManager object that holds configuration data for the pipeline

Returns:
List[str]:

Filtered list of filenames

ecopipeline.transform.aqsuite_prep_time(df: DataFrame) DataFrame

Function takes an aqsuite dataframe and converts the time column into datetime type and sorts the entire dataframe by time. Prereq:

Input dataframe MUST be an aqsuite Dataframe whose columns have not yet been renamed

Parameters:
dfpd.DataFrame)

Aqsuite DataFrame

Returns:
pd.DataFrame:

Pandas Dataframe containing data from all files

ecopipeline.transform.avg_duplicate_times(df: DataFrame, timezone: str) DataFrame

Collapse duplicate timestamps by averaging numeric values and taking the first non-numeric value.

Looks for duplicate timestamps (typically caused by daylight-saving time transitions or timestamp rounding) and reduces each group of duplicates to a single row, averaging numeric columns and keeping the first value for non-numeric columns.

Parameters:
dfpd.DataFrame

Pandas dataframe to be altered.

timezonestr

Timezone string to apply to the output index. Must be a string recognised by pandas.Series.tz_localize. See https://pandas.pydata.org/docs/reference/api/pandas.Series.tz_localize.html.

Returns:
pd.DataFrame

Dataframe with all duplicate timestamps collapsed into one row, averaging numeric data values.

ecopipeline.transform.calculate_cop_values(df: DataFrame, heatLoss_fixed: int, thermo_slice: str) DataFrame

Performs COP calculations using the daily aggregated data.

Parameters:
dfpd.DataFrame

Pandas DataFrame to add COP columns to

heatloss_fixedfloat

fixed heatloss value

thermo_slicestr

the time at which slicing begins if we would like to thermo slice.

Returns:
pd.DataFrame:

Pandas DataFrame with the added COP columns.

ecopipeline.transform.central_transform_function(config: ConfigManager, df: DataFrame, weather_df: DataFrame = None, tz_convert_from: str = 'America/Los_Angeles', tz_convert_to: str = 'America/Los_Angeles', oat_column_name: str = 'Temp_OAT', complete_hour_threshold: float = 0.8, complete_day_threshold: float = 1.0, remove_partial: bool = True, pre_aggregation_func=None, post_aggregation_func=None) [<class 'pandas.core.frame.DataFrame'>, <class 'pandas.core.frame.DataFrame'>, <class 'pandas.core.frame.DataFrame'>]

Run the full central transform pipeline on raw minute-level site data.

Renames sensors, rounds timestamps, forward-fills missing values, optionally converts timezones, averages duplicate timestamps, aggregates to hourly and daily dataframes, and optionally merges weather data. Supports optional pre- and post-aggregation hooks for custom processing.

Parameters:
configConfigManager

The ConfigManager object that holds configuration data for the pipeline.

dfpd.DataFrame

Dataframe with raw time-indexed (ideally minute-interval) site data. Important column names should be represented in the variable_alias column in the Variable_Names.csv file.

weather_dfpd.DataFrame, optional

Dataframe with time-indexed (preferably hourly) weather data. Will be merged with the hourly dataframe.

tz_convert_fromstr, optional

String value of the timezone the data is currently in.

tz_convert_tostr, optional

String value of the timezone the data should be converted to.

oat_column_namestr, optional

Name that the Outdoor Air Temperature column should have. Defaults to 'Temp_OAT'.

complete_hour_thresholdfloat, optional

Percent of minutes in an hour needed to count as a complete hour, expressed as a float (e.g. 80% = 0.8). Defaults to 0.8. Only applicable if remove_partial is True.

complete_day_thresholdfloat, optional

Percent of hours in a day needed to count as a complete day, expressed as a float (e.g. 80% = 0.8). Defaults to 1.0. Only applicable if remove_partial is True.

remove_partialbool, optional

If True, removes partial days and hours from aggregated dataframes. Defaults to True.

pre_aggregation_funccallable, optional

A custom function called after minute-level processing and before aggregation. Signature: pre_aggregation_func(df: pd.DataFrame) -> pd.DataFrame.

post_aggregation_funccallable, optional

A custom function called after weather merging and before returning. Signature: post_aggregation_func(df, hourly_df, daily_df) -> tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame].

Returns:
tuple of pd.DataFrame

A three-element tuple (df, hourly_df, daily_df) containing the processed minute-level, hourly, and daily dataframes respectively.

Raises:
TypeError

If pre_aggregation_func or post_aggregation_func is not callable, does not accept the expected parameters, or does not return the expected type.

ecopipeline.transform.change_ID_to_HVAC(df: DataFrame, site_info: Series) DataFrame

Function takes in a site dataframe along with the name and path of the site and assigns a unique event_ID value whenever the system changes state.

Parameters:
dfpd.DataFrame

Pandas Dataframe

site_infopd.Series

site_info.csv as a pd.Series

Returns:
pd.DataFrame:

modified Pandas Dataframe

ecopipeline.transform.column_name_change(df: DataFrame, dt: Timestamp, new_column: str, old_column: str, remove_old_column: bool = True) DataFrame

Back-fill new_column with values from old_column for rows before a name-change timestamp.

Overwrites values in new_column with values from old_column for all rows with an index earlier than dt, provided dt is within the index range. Optionally removes old_column afterwards.

Parameters:
dfpd.DataFrame

Pandas dataframe with minute-to-minute data.

dtpd.Timestamp

Timestamp of the variable name change.

new_columnstr

Name of the column to be overwritten for rows before dt.

old_columnstr

Name of the column to copy values from.

remove_old_columnbool, optional

If True, drops old_column from the dataframe after the copy. Defaults to True.

Returns:
pd.DataFrame

Dataframe with new_column updated for pre-change rows.

ecopipeline.transform.concat_last_row(df: DataFrame, last_row: DataFrame) DataFrame

Concatenate the last database row onto a new-data dataframe to enable forward filling.

Takes a dataframe with new data and a second dataframe representing the last row of the destination database, concatenates them so that subsequent forward filling can use information from the last row.

Parameters:
dfpd.DataFrame

Dataframe with new data that needs to be forward filled from data in the last row of a database.

last_rowpd.DataFrame

Last row of the database to forward fill from.

Returns:
pd.DataFrame

Dataframe with the last row concatenated and sorted by index.

ecopipeline.transform.condensate_calculations(df: DataFrame, site: str, site_info: Series) DataFrame

Calculates condensate values for the given dataframe

Parameters:
dfpd.DataFrame

dataframe to be modified

sitestr

name of site

site_infopd.Series

Series of site info

Returns:
pd.DataFrame:

modified dataframe

ecopipeline.transform.convert_c_to_f(df: DataFrame, column_names: list) DataFrame

Convert specified columns from degrees Celsius to Fahrenheit.

Parameters:
dfpd.DataFrame

Pandas dataframe of sensor data.

column_nameslist

List of column names whose values are currently in Celsius and need to be converted to Fahrenheit.

Returns:
pd.DataFrame

Dataframe with the specified columns converted from Celsius to Fahrenheit.

ecopipeline.transform.convert_l_to_g(df: DataFrame, column_names: list) DataFrame

Convert specified columns from liters to gallons.

Parameters:
dfpd.DataFrame

Pandas dataframe of sensor data.

column_nameslist

List of column names whose values are currently in liters and need to be converted to gallons.

Returns:
pd.DataFrame

Dataframe with the specified columns converted from liters to gallons.

ecopipeline.transform.convert_on_off_col_to_bool(df: DataFrame, column_names: list) DataFrame

Convert “ON”/”OFF” string values to boolean True/False in specified columns.

Parameters:
dfpd.DataFrame

Pandas dataframe of sensor data.

column_nameslist

List of column names containing "ON"/"OFF" (or "On"/"Off") strings to be converted to boolean values.

Returns:
pd.DataFrame

Dataframe with the specified columns converted to boolean values.

ecopipeline.transform.convert_temp_resistance_type(df: DataFrame, column_name: str, sensor_model='veris') DataFrame

Convert temperature resistance readings using a 10k Type 2 thermistor model.

Applies a two-stage pickle-model conversion (temperature-to-resistance, then resistance-to-temperature) to correct sensor readings in the specified column.

Parameters:
dfpd.DataFrame

Timestamp-indexed Pandas dataframe of minute-by-minute values.

column_namestr

Name of the column containing resistance conversion Type 2 data.

sensor_modelstr, optional

Sensor model to use. Supported values: 'veris', 'tasseron'. Defaults to 'veris'.

Returns:
pd.DataFrame

Dataframe with the specified column corrected via the thermistor model.

Raises:
Exception

If sensor_model is not a supported value.

ecopipeline.transform.convert_time_zone(df: DataFrame, tz_convert_from: str = 'UTC', tz_convert_to: str = 'America/Los_Angeles') DataFrame

Convert a dataframe’s DatetimeIndex from one timezone to another.

Parameters:
dfpd.DataFrame

Pandas dataframe of sensor data whose index should be timezone-converted.

tz_convert_fromstr, optional

Timezone string the index is currently in. Defaults to 'UTC'.

tz_convert_tostr, optional

Timezone string the index should be converted to. Defaults to 'America/Los_Angeles'.

Returns:
pd.DataFrame

Dataframe with its index converted to the target timezone (stored without timezone info as a naive datetime index).

ecopipeline.transform.cop_method_1(df: DataFrame, recircLosses, heatout_primary_column: str = 'HeatOut_Primary', total_input_power_column: str = 'PowerIn_Total') DataFrame

Perform COP calculation method 1 (original AWS method).

Computes COP_DHWSys_1 = (HeatOut_Primary + recircLosses) / PowerIn_Total and adds the result as a new column to the dataframe.

Parameters:
dfpd.DataFrame

Pandas dataframe of daily averaged values. Must already contain heatout_primary_column and total_input_power_column.

recircLossesfloat or pd.Series

Recirculation losses in kW. Pass a float for a fixed spot-measured value, or a pd.Series (aligned with df) if measurements are available in the datastream.

heatout_primary_columnstr, optional

Name of the column containing primary system output power in kW. Defaults to 'HeatOut_Primary'.

total_input_power_columnstr, optional

Name of the column containing total system input power in kW. Defaults to 'PowerIn_Total'.

Returns:
pd.DataFrame

Dataframe with an added 'COP_DHWSys_1' column.

ecopipeline.transform.cop_method_2(df: DataFrame, cop_tm, cop_primary_column_name) DataFrame

Perform COP calculation method 2.

Formula: COP = COP_primary * (ELEC_primary / ELEC_total) + COP_tm * (ELEC_tm / ELEC_total)

Parameters:
dfpd.DataFrame

Pandas dataframe to add the COP column to. Must contain:

  • cop_primary_column_name: primary system COP values.

  • 'PowerIn_Total': total system power.

  • Columns prefixed with 'PowerIn_HPWH' or equal to 'PowerIn_SecLoopPump' (primary system power).

  • Columns prefixed with 'PowerIn_SwingTank' or 'PowerIn_ERTank' (temperature-maintenance system power).

cop_tmfloat

Fixed COP value for the temperature-maintenance system.

cop_primary_column_namestr

Name of the column containing primary-system COP values.

Returns:
pd.DataFrame

Dataframe with an added 'COP_DHWSys_2' column.

ecopipeline.transform.create_data_statistics_df(df: DataFrame) DataFrame

Compute per-column data-gap statistics aggregated by day.

Must be called on the raw minute-level dataframe after rename_sensors() and before ffill_missing(). Each original column is expanded into three derived columns:

  • <col>_missing_mins: number of minutes in the day with no reported value.

  • <col>_avg_gap: average consecutive gap length (in minutes) for that day.

  • <col>_max_gap: maximum consecutive gap length (in minutes) for that day.

Parameters:
dfpd.DataFrame

Minute-level dataframe after rename_sensors() and before ffill_missing() has been called.

Returns:
pd.DataFrame

Day-indexed dataframe containing the three gap-statistic columns for each original column.

ecopipeline.transform.create_fan_curves(cfm_info: DataFrame, site_info: Series) DataFrame

Create fan curves for each site.

Parameters:
cfm_infopd.DataFrame

DataFrame of fan curve information.

site_infopd.Series

Series containing the site information.

Returns:
pd.DataFrame:

Dataframe containing the fan curves for each site.

ecopipeline.transform.create_summary_tables(df: DataFrame)

Create hourly and daily summary tables from minute-by-minute data.

Parameters:
dfpd.DataFrame

Pandas dataframe of minute-by-minute sensor data.

Returns:
hourly_dfpd.DataFrame

Hourly mean aggregation of the input data, with partial hours removed.

daily_dfpd.DataFrame

Daily mean aggregation of the input data, with partial days removed.

ecopipeline.transform.delete_erroneous_from_time_pt(df: DataFrame, time_point: Timestamp, column_names: list, new_value=None) DataFrame

Replace erroneous values at a specific timestamp with a given replacement value.

Parameters:
dfpd.DataFrame

Timestamp-indexed Pandas dataframe that contains the erroneous value.

time_pointpd.Timestamp

The index timestamp at which the erroneous values occur.

column_nameslist

List of column name strings that contain erroneous values at this timestamp.

new_valueany, optional

Replacement value to write into the erroneous cells. If None, the cells are replaced with NaN. Defaults to None.

Returns:
pd.DataFrame

Dataframe with the erroneous values replaced by new_value.

ecopipeline.transform.elev_correction(site_name: str, config: ConfigManager) DataFrame

Function creates a dataframe for a given site that contains site name, elevation, and the corrected elevation.

Parameters:
site_namestr

site’s name

configecopipeline.ConfigManager

The ConfigManager object that holds configuration data for the pipeline

Returns:
pd.DataFrame:

new Pandas dataframe

ecopipeline.transform.estimate_power(df: DataFrame, new_power_column: str, current_a_column: str, current_b_column: str, current_c_column: str, assumed_voltage: float = 208, power_factor: float = 1) DataFrame

Estimate three-phase power from per-phase current readings.

Calculates power as the average phase current multiplied by the assumed voltage, power factor, and sqrt(3), then converts from watts to kilowatts.

Parameters:
dfpd.DataFrame

Pandas dataframe with minute-to-minute data.

new_power_columnstr

Column name to store the estimated power. Units will be kW.

current_a_columnstr

Column name of the Phase A current. Units should be amps.

current_b_columnstr

Column name of the Phase B current. Units should be amps.

current_c_columnstr

Column name of the Phase C current. Units should be amps.

assumed_voltagefloat, optional

Assumed line voltage in volts. Defaults to 208.

power_factorfloat, optional

Power factor to apply. Defaults to 1.

Returns:
pd.DataFrame

Dataframe with a new estimated power column of the specified name.

ecopipeline.transform.ffill_missing(original_df: DataFrame, config: ConfigManager, previous_fill: DataFrame = None) DataFrame

Forward-fill selected columns of a dataframe according to rules in Variable_Names.csv.

Parameters:
original_dfpd.DataFrame

Pandas dataframe that needs to be forward-filled.

configConfigManager

The ConfigManager object that holds configuration data for the pipeline. Points to a file called Variable_Names.csv in the pipeline’s input folder. The CSV must have at least three columns:

  • variable_name: name of each variable to forward-fill.

  • changepoint: 1 to forward-fill unconditionally until the next change point, 0 to forward-fill up to ffill_length rows, or null to skip forward-filling for that variable.

  • ffill_length: number of rows to forward-fill when changepoint is 0.

previous_fillpd.DataFrame, optional

Dataframe with the same index type and at least some of the same columns as original_df (typically the last row from the destination database). Its values are used to seed forward-filling into the new data.

Returns:
pd.DataFrame

Dataframe that has been forward-filled per the specifications in the Variable_Names.csv file.

ecopipeline.transform.flag_dhw_outage(df: DataFrame, daily_df: DataFrame, dhw_outlet_column: str, supply_temp: int = 110, consecutive_minutes: int = 15) DataFrame

Detect DHW outage events and return an alarm event dataframe.

Identifies periods where DHW outlet temperature falls below supply_temp for at least consecutive_minutes consecutive minutes, then records an ALARM event for each affected day.

Parameters:
dfpd.DataFrame

Pandas dataframe of sensor data on minute intervals.

daily_dfpd.DataFrame

Pandas dataframe of sensor data on daily intervals.

dhw_outlet_columnstr

Name of the column in df that contains the DHW temperature supplied to building occupants.

supply_tempint, optional

Minimum acceptable DHW supply temperature in °F. Defaults to 110.

consecutive_minutesint, optional

Number of consecutive minutes below supply_temp required to qualify as a DHW outage. Defaults to 15.

Returns:
pd.DataFrame

Dataframe indexed by start_time_pt containing 'ALARM' events for each day on which a DHW outage occurred.

ecopipeline.transform.gas_valve_diff(df: DataFrame, site: str, config: ConfigManager) DataFrame

Function takes in the site dataframe and the site name. If the site has gas heating, take the lagged difference to get per minute values.

Parameters:
dfpd.DataFrame

Dataframe for site

sitestr

site name as string

configecopipeline.ConfigManager

The ConfigManager object that holds configuration data for the pipeline

Returns:
pd.DataFrame:

modified Pandas Dataframe

ecopipeline.transform.gather_outdoor_conditions(df: DataFrame, site: str) DataFrame

Function takes in a site dataframe and site name as a string. Returns a new dataframe that contains time_utc, <site>_ODT, and <site>_ODRH for the site.

Parameters:
dfpd.DataFrame

Pandas Dataframe

sitestr

site name as string

Returns:
pd.DataFrame:

new Pandas Dataframe

ecopipeline.transform.generate_event_log_df(config: ConfigManager)

Create an event log dataframe from a user-submitted Event_log.csv file.

Parameters:
configConfigManager

The ConfigManager object that holds configuration data for the pipeline. Points to the Event_log.csv file via config.get_event_log_path().

Returns:
pd.DataFrame

Dataframe indexed by start_time_pt and formatted from the events in Event_log.csv. Returns an empty dataframe with the expected columns if the file cannot be read.

ecopipeline.transform.get_cfm_values(df, site_cfm, site_info, site)
ecopipeline.transform.get_cop_values(df: DataFrame, site_info: DataFrame)
ecopipeline.transform.get_energy_by_min(df: DataFrame) DataFrame

Energy is recorded cummulatively. Function takes the lagged differences in order to get a per/minute value for each of the energy variables.

Parameters:
dfpd.DataFrame

Pandas dataframe

Returns:
pd.DataFrame:

Pandas dataframe

ecopipeline.transform.get_hvac_state(df: DataFrame, site_info: Series) DataFrame
ecopipeline.transform.get_refrig_charge(df: DataFrame, site: str, config: ConfigManager) DataFrame

Function takes in a site dataframe, its site name as a string, the path to site_info.csv as a string, the path to superheat.csv as a string, and the path to 410a_pt.csv, and calculates the refrigerant charge per minute?

Parameters:
dfpd.DataFrame

Pandas Dataframe

sitestr

site name as a string

configecopipeline.ConfigManager

The ConfigManager object that holds configuration data for the pipeline

Returns:
pd.DataFrame:

modified Pandas Dataframe

ecopipeline.transform.get_site_cfm_info(site: str, config: ConfigManager) DataFrame

Returns a dataframe of the site cfm information for the given site NOTE: The parsing is necessary as the first row of data are comments that need to be dropped.

Parameters:
sitestr

The site name

configecopipeline.ConfigManager

The ConfigManager object that holds configuration data for the pipeline

Returns:
dfpd.DataFrame

The DataFrame of the site cfm information

ecopipeline.transform.get_site_info(site: str, config: ConfigManager) Series

Returns a dataframe of the site information for the given site

Parameters:
sitestr

The site name

configecopipeline.ConfigManager

The ConfigManager object that holds configuration data for the pipeline

Returns:
dfpd.Series

The Series of the site information

ecopipeline.transform.get_storage_gals120(df: DataFrame, location: Series, gals: int, total: int, zones: Series) DataFrame

Function that creates and appends the Gals120 data onto the Dataframe

Parameters:
dfpd.Series

A Pandas Dataframe

location (pd.Series)
galsint
totalint
zonespd.Series
Returns:
pd.DataFrame:

a Pandas Dataframe

ecopipeline.transform.get_temp_zones120(df: DataFrame) DataFrame

Function that keeps track of the average temperature of each zone. for this function to work, naming conventions for each parrallel tank must include ‘Temp1’ as the tempature at the top of the tank, ‘Temp5’ as that at the bottom of the tank, and ‘Temp2’-‘Temp4’ as the tempatures in between.

Parameters:
dfpd.Series

A Pandas Dataframe

Returns:
pd.DataFrame:

a Pandas Dataframe

ecopipeline.transform.heat_output_calc(df: DataFrame, flow_var: str, hot_temp: str, cold_temp: str, heat_out_col_name: str, return_as_kw: bool = True) DataFrame

Calculate heat output from flow rate and supply/return temperatures.

Uses the formula Heat (BTU/hr) = 500 * flow (gal/min) * delta_T (°F) and clips negative values to zero. Optionally converts the result to kW.

Parameters:
dfpd.DataFrame

Pandas dataframe with minute-to-minute data.

flow_varstr

Column name of the flow variable. Units must be gal/min.

hot_tempstr

Column name of the hot (supply) temperature variable. Units must be °F.

cold_tempstr

Column name of the cold (return) temperature variable. Units must be °F.

heat_out_col_namestr

Name for the new heat output column added to the dataframe.

return_as_kwbool, optional

If True, the new column will be in kW. If False, it will be in BTU/hr. Defaults to True.

Returns:
pd.DataFrame

Dataframe with the new heat output column of the specified name.

ecopipeline.transform.join_to_daily(daily_data: DataFrame, cop_data: DataFrame) DataFrame

Left-join COP data onto the daily dataframe.

Parameters:
daily_datapd.DataFrame

Daily sensor dataframe.

cop_datapd.DataFrame

COP values dataframe to join.

Returns:
pd.DataFrame

Daily dataframe left-joined with the COP dataframe.

ecopipeline.transform.join_to_hourly(hourly_data: DataFrame, noaa_data: DataFrame, oat_column_name: str = 'OAT_NOAA') DataFrame

Left-join weather data onto the hourly dataframe.

Parameters:
hourly_datapd.DataFrame

Hourly sensor dataframe.

noaa_datapd.DataFrame

Weather (e.g. NOAA) dataframe to join.

oat_column_namestr, optional

Name of the outdoor air temperature column in noaa_data. Defaults to 'OAT_NOAA'.

Returns:
pd.DataFrame

Hourly dataframe left-joined with the weather dataframe. Returns hourly_data unchanged if the OAT column in noaa_data contains no non-null values.

ecopipeline.transform.lbnl_pressure_conversions(df: DataFrame) DataFrame
ecopipeline.transform.lbnl_sat_calculations(df: DataFrame) DataFrame
ecopipeline.transform.lbnl_temperature_conversions(df: DataFrame) DataFrame
ecopipeline.transform.merge_indexlike_rows(df: DataFrame) DataFrame

Merges index-like rows together ensuring that all relevant information for a certain timestamp is stored in one row - not in multiple rows. It also rounds the timestamps to the nearest minute.

Parameters:
file_pathstr

The file path to the data.

Returns:
dfpd.DataFrame

The DataFrame with all index-like rows merged.

ecopipeline.transform.nclarity_csv_to_df(csv_filenames: List[str]) DataFrame

Function takes a list of csv filenames containing nclarity data and reads all files into a singular dataframe.

Parameters:
csv_filenamesList[str]

List of filenames

Returns:
pd.DataFrame:

Pandas Dataframe containing data from all files

ecopipeline.transform.nclarity_filter_new(date: str, filenames: List[str]) List[str]

Function filters the filenames list to only those from the given date or later.

Parameters:
datestr

target date

filenamesList[str]

List of filenames to be filtered

Returns:
List[str]:

Filtered list of filenames

ecopipeline.transform.nullify_erroneous(original_df: DataFrame, config: ConfigManager) DataFrame

Replace known error-sentinel values in a dataframe with NaN.

Parameters:
original_dfpd.DataFrame

Pandas dataframe that needs to be filtered for error values.

configConfigManager

The ConfigManager object that holds configuration data for the pipeline. Points to a file called Variable_Names.csv in the pipeline’s input folder. The CSV must have at least two columns:

  • variable_name: names of columns that may contain error values.

  • error_value: the sentinel error value for each variable, or null if no error value applies.

Returns:
pd.DataFrame

Dataframe with known error-sentinel values replaced by NaN.

ecopipeline.transform.process_ls_signal(df: DataFrame, hourly_df: DataFrame, daily_df: DataFrame, load_dict: dict = {1: 'normal', 2: 'loadUp', 3: 'shed'}, ls_column: str = 'ls', drop_ls_from_df: bool = False)

Add load-shift signals to hourly and daily aggregated dataframes.

Parameters:
dfpd.DataFrame

Timestamp-indexed Pandas dataframe of minute-by-minute values.

hourly_dfpd.DataFrame

Timestamp-indexed Pandas dataframe of hourly average values.

daily_dfpd.DataFrame

Timestamp-indexed Pandas dataframe of daily average values.

load_dictdict, optional

Mapping from integer load-shift signal values to descriptive string labels. Defaults to {1: "normal", 2: "loadUp", 3: "shed"}.

ls_columnstr, optional

Name of the load-shift column in df. Defaults to 'ls'.

drop_ls_from_dfbool, optional

If True, drops ls_column from df after processing. Defaults to False.

Returns:
dfpd.DataFrame

Minute-by-minute dataframe with ls_column removed if drop_ls_from_df is True.

hourly_dfpd.DataFrame

Hourly dataframe with an added 'system_state' column containing the load-shift command label from load_dict for each hour. Values are mapped from the rounded mean of ls_column within each hour; hours whose rounded mean is not a key in load_dict will be null.

daily_dfpd.DataFrame

Daily dataframe with an added boolean 'load_shift_day' column that is True on days containing at least one non-normal load-shift command in hourly_df.

ecopipeline.transform.remove_outliers(original_df: DataFrame, config: ConfigManager, site: str = '') DataFrame

Remove outliers from a dataframe by replacing out-of-bounds values with NaN.

Reads bound information from Variable_Names.csv via config and sets any values outside the defined lower_bound/upper_bound range to NaN.

Parameters:
original_dfpd.DataFrame

Pandas dataframe for which outliers need to be removed.

configConfigManager

The ConfigManager object that holds configuration data for the pipeline. Points to a file called Variable_Names.csv in the pipeline’s input folder. The CSV must have at least three columns: variable_name, lower_bound, and upper_bound.

sitestr, optional

Site name to filter bounds data by. Leave as an empty string if not applicable.

Returns:
pd.DataFrame

Dataframe with outliers replaced by NaN.

ecopipeline.transform.remove_partial_days(df, hourly_df, daily_df, complete_hour_threshold: float = 0.8, complete_day_threshold: float = 1.0, partial_day_removal_exclusion: list = [])

Remove hourly and daily rows that are derived from insufficient minute-level data.

Parameters:
dfpd.DataFrame

Pandas dataframe of minute-by-minute sensor data.

hourly_dfpd.DataFrame

Aggregated hourly dataframe.

daily_dfpd.DataFrame

Aggregated daily dataframe.

complete_hour_thresholdfloat, optional

Fraction of minutes in an hour required to count as a complete hour, expressed as a float (e.g. 80% = 0.8). Defaults to 0.8.

complete_day_thresholdfloat, optional

Fraction of hours in a day required to count as a complete day, expressed as a float (e.g. 80% = 0.8). Defaults to 1.0.

partial_day_removal_exclusionlist, optional

Column names to skip when evaluating completeness. Defaults to [].

Returns:
hourly_dfpd.DataFrame

Hourly dataframe with incomplete hours removed and sparse columns nullified.

daily_dfpd.DataFrame

Daily dataframe with incomplete days removed and sparse columns nullified.

Raises:
Exception

If complete_hour_threshold or complete_day_threshold is not between 0 and 1.

ecopipeline.transform.rename_sensors(original_df: DataFrame, config: ConfigManager, site: str = '', system: str = '')

Rename sensor columns from their raw aliases to their true names.

Reads the Variable_Names.csv file via config, renames columns from variable_alias to variable_name, drops columns with no matching true name, and optionally filters by site and/or system.

Parameters:
original_dfpd.DataFrame

A dataframe containing data labeled by raw variable names to be renamed.

configConfigManager

The ConfigManager object that holds configuration data for the pipeline. Points to a file called Variable_Names.csv in the pipeline’s input folder. The CSV must have at least two columns: variable_alias (the raw name to change from) and variable_name (the name to change to). Columns without a corresponding variable_name are dropped.

sitestr, optional

Site name to filter by. If provided, only rows whose site column matches this value are retained. Leave as an empty string if not applicable.

systemstr, optional

System name to filter by. If provided, only rows whose system column contains this string are retained. Leave as an empty string if not applicable.

Returns:
pd.DataFrame

Dataframe filtered by site and system (if applicable) with column names matching those specified in Variable_Names.csv.

Raises:
Exception

If the Variable_Names.csv file is not found at the path provided by config.

ecopipeline.transform.replace_humidity(df: DataFrame, od_conditions: DataFrame, date_forward: datetime, site_name: str) DataFrame

Function replaces all humidity readings for a given site after a given datetime.

Parameters:
dfpd.DataFrame

Dataframe containing the raw sensor data.

od_conditionspd.DataFrame

DataFrame containing outdoor confitions measured by field sensors.

date_forwarddt.datetime

Datetime containing the time after which all humidity readings should be replaced.

site_namestr

String containing the name of the site for which humidity values are to be replaced.

Returns:
pd.DataFrame:

Modified DataFrame where the Humidity_ODRH column contains the field readings after the given datetime.

ecopipeline.transform.round_time(df: DataFrame)

Round a dataframe’s DatetimeIndex down to the nearest minute, in place.

Parameters:
dfpd.DataFrame

A dataframe indexed by datetimes. All timestamps will be floored to the nearest minute.

Returns:
bool

True if the index has been rounded down, False if the operation failed (e.g. if df was empty).

ecopipeline.transform.sensor_adjustment(df: DataFrame, config: ConfigManager) DataFrame

Apply sensor adjustments from adjustments.csv to the dataframe.

Deprecated since version This: function is scheduled for removal. Use a more explicit adjustment approach instead.

Parameters:
dfpd.DataFrame

Dataframe to be adjusted.

configConfigManager

The ConfigManager object that holds configuration data for the pipeline. Points to a file called adjustments.csv in the pipeline’s input folder (e.g. "full/path/to/pipeline/input/adjustments.csv").

Returns:
pd.DataFrame

Adjusted dataframe.

ecopipeline.transform.shift_accumulative_columns(df: DataFrame, column_names: list = [])

Convert accumulative columns to period-difference (non-cumulative) values.

Parameters:
dfpd.DataFrame

Pandas dataframe of sensor data.

column_nameslist, optional

Names of columns to convert from cumulative-sum data to non-cumulative difference data. If an empty list is provided, all columns are converted. Defaults to [].

Returns:
pd.DataFrame

Dataframe with the specified columns (or all columns) converted from cumulative to period-difference values.

ecopipeline.transform.site_specific(df: DataFrame, site: str) DataFrame

Does Site Specific Calculations for LBNL. The site name is searched using RegEx

Parameters:
dfpd.DataFrame

dataframe of data

sitestr

site name as a string

Returns:
pd.DataFrame:

modified dataframe

ecopipeline.transform.verify_power_energy(df: DataFrame, config: ConfigManager)

Verifies that for each timestamp, corresponding power and energy variables are consistent with one another. Power ~= energy * 60. Margin of error TBD. Outputs to a csv file any rows with conflicting power and energy variables.

Prereq:

Input dataframe MUST have had get_energy_by_min() called on it previously

Parameters:
dfpd.DataFrame

Pandas dataframe

configecopipeline.ConfigManager

The ConfigManager object that holds configuration data for the pipeline

Returns:
None