Transform Documentation¶
- ecopipeline.transform.add_local_time(df: DataFrame, site_name: str, config: ConfigManager) DataFrame ¶
Function adds a column to the dataframe with the local time.
- Parameters:
- df :pd.DataFrame
Dataframe
- site_namestr
site name
- configecopipeline.ConfigManager
The ConfigManager object that holds configuration data for the pipeline
- Returns:
- pd.DataFrame
- ecopipeline.transform.add_relative_humidity(df: DataFrame, temp_col: str = 'airTemp_F', dew_point_col: str = 'dewPoint_F', degree_f: bool = True)¶
Add a column for relative humidity to the DataFrame.
- Parameters:
- dfpd.DataFrame
DataFrame containing air temperature and dew point temperature.
- temp_colstr
Column name for air temperature.
- dew_point_colstr
Column name for dew point temperature.
- degree_fbool
True if temperature columns are in °F, false if in °C
- Returns:
- pd.DataFrame:
DataFrame with an added column for relative humidity.
- ecopipeline.transform.aggregate_df(df: ~pandas.core.frame.DataFrame, ls_filename: str = '', complete_hour_threshold: float = 0.8, complete_day_threshold: float = 1.0, remove_partial: bool = True) -> (<class 'pandas.core.frame.DataFrame'>, <class 'pandas.core.frame.DataFrame'>)¶
Function takes in a pandas dataframe of minute data, aggregates it into hourly and daily dataframes, appends ‘load_shift_day’ column onto the daily_df and the ‘system_state’ column to hourly_df to keep track of the loadshift schedule for the system, and then returns those dataframes. The function will only trim the returned dataframes such that only averages from complete hours and complete days are returned rather than agregated data from partial datasets.
- Parameters:
- dfpd.DataFrame
Single pandas dataframe of minute-by-minute sensor data.
- ls_filenamestr
Path to csv file containing load shift schedule (e.g. “full/path/to/pipeline/input/loadshift_matrix.csv”), There should be at least four columns in this csv: ‘date’, ‘startTime’, ‘endTime’, and ‘event’
- complete_hour_thresholdfloat
Default to 0.8. percent of minutes in an hour needed to count as a complete hour. Percent as a float (e.g. 80% = 0.8) Only applicable if remove_partial set to True
- complete_day_thresholdfloat
Default to 1.0. percent of hours in a day needed to count as a complete day. Percent as a float (e.g. 80% = 0.8) Only applicable if remove_partial set to True
- remove_partialbool
Default to True. Removes parial days and hours from aggregated dfs
- Returns:
- daily_dfpd.DataFrame
agregated daily dataframe that contains all daily information as well as the ‘load_shift_day’ column if relevant to the data set.
- hourly_dfpd.DataFrame
agregated hourly dataframe that contains all hourly information as well as the ‘system_state’ column if relevant to the data set.
- ecopipeline.transform.aggregate_values(df: DataFrame, thermo_slice: str) DataFrame ¶
Gets daily average of data for all relevant varibles.
- Parameters:
- dfpd.DataFrame
Pandas DataFrame of minute by minute data
- thermo_slicestr
indicates the time at which slicing begins. If none no slicing is performed. The format of the thermo_slice string is “HH:MM AM/PM”.
- Returns:
- pd.DataFrame:
Pandas DataFrame which contains the aggregated hourly data.
- ecopipeline.transform.apply_equipment_cop_derate(df: DataFrame, equip_cop_col: str, r_val: int = 16) DataFrame ¶
Function derates equipment method system COP based on R value R12 - R16 : 12 % R16 - R20 : 10% R20 - R24 : 8% R24 - R28 : 6% R28 - R32 : 4% > R32 : 2%
- Parameters:
- dfpd.DataFrame
dataframe
- equip_cop_colstr
name of COP column to derate
- r_valint
R value, defaults to 16
- Returns:
- pd.DataFrame
df with equip_cop_col derated
- ecopipeline.transform.aqsuite_filter_new(last_date: str, filenames: List[str], site: str, config: ConfigManager) List[str] ¶
Function filters the filenames list to only those newer than the last date.
- Parameters:
- last_datestr
latest date loaded prior to current runtime
- filenamesList[str]
List of filenames to be filtered
- sitestr
site name
- configecopipeline.ConfigManager
The ConfigManager object that holds configuration data for the pipeline
- Returns:
- List[str]:
Filtered list of filenames
- ecopipeline.transform.aqsuite_prep_time(df: DataFrame) DataFrame ¶
Function takes an aqsuite dataframe and converts the time column into datetime type and sorts the entire dataframe by time. Prereq:
Input dataframe MUST be an aqsuite Dataframe whose columns have not yet been renamed
- Parameters:
- dfpd.DataFrame)
Aqsuite DataFrame
- Returns:
- pd.DataFrame:
Pandas Dataframe containing data from all files
- ecopipeline.transform.avg_duplicate_times(df: DataFrame, timezone: str) DataFrame ¶
Function will take in a dataframe and look for duplicate timestamps (ususally due to daylight savings or rounding). The dataframe will be altered to just have one line for the timestamp, takes the average values between the duplicate timestamps for the columns of the line.
- Parameters:
- df: pd.DataFrame
Pandas dataframe to be altered
- timezone: str
The timezone for the indexes in the output dataframe as a string. Must be a string recognized as a time stamp by the pandas tz_localize() function https://pandas.pydata.org/docs/reference/api/pandas.Series.tz_localize.html
- Returns:
- pd.DataFrame:
Pandas dataframe with all duplicate timestamps compressed into one, averegaing data values
- ecopipeline.transform.calculate_cop_values(df: DataFrame, heatLoss_fixed: int, thermo_slice: str) DataFrame ¶
Performs COP calculations using the daily aggregated data.
- Parameters:
- dfpd.DataFrame
Pandas DataFrame to add COP columns to
- heatloss_fixedfloat
fixed heatloss value
- thermo_slicestr
the time at which slicing begins if we would like to thermo slice.
- Returns:
- pd.DataFrame:
Pandas DataFrame with the added COP columns.
- ecopipeline.transform.change_ID_to_HVAC(df: DataFrame, site_info: Series) DataFrame ¶
Function takes in a site dataframe along with the name and path of the site and assigns a unique event_ID value whenever the system changes state.
- Parameters:
- dfpd.DataFrame
Pandas Dataframe
- site_infopd.Series
site_info.csv as a pd.Series
- Returns:
- pd.DataFrame:
modified Pandas Dataframe
- ecopipeline.transform.concat_last_row(df: DataFrame, last_row: DataFrame) DataFrame ¶
This function takes in a dataframe with new data and a second data frame meant to be the last row from the database the new data is being processed for. The two dataframes are then concatenated such that the new data can later be forward filled from the info the last row
- Parameters:
- dfpd.DataFrame
dataframe with new data that needs to be forward filled from data in the last row of a database
- last_rowpd.DataFrame
last row of the database to forward fill from in a pandas dataframe
- Returns:
- pd.DataFrame:
Pandas dataframe with last row concatenated
- ecopipeline.transform.condensate_calculations(df: DataFrame, site: str, site_info: Series) DataFrame ¶
Calculates condensate values for the given dataframe
- Parameters:
- dfpd.DataFrame
dataframe to be modified
- sitestr
name of site
- site_infopd.Series
Series of site info
- Returns:
- pd.DataFrame:
modified dataframe
- ecopipeline.transform.convert_c_to_f(df: DataFrame, column_names: list) DataFrame ¶
Function takes in a pandas dataframe of data and a list of column names to convert from degrees Celsius to Farenhiet.
- Parameters:
- dfpd.DataFrame
Single pandas dataframe of sensor data.
- column_nameslist of stings
list of columns with data currently in Celsius that need to be converted to Farenhiet
- Returns:
- pd.DataFrame: Dataframe with specified columns converted from Celsius to Farenhiet.
- ecopipeline.transform.convert_l_to_g(df: DataFrame, column_names: list) DataFrame ¶
Function takes in a pandas dataframe of data and a list of column names to convert from Liters to Gallons.
- Parameters:
- dfpd.DataFrame
Single pandas dataframe of sensor data.
- column_nameslist of stings
list of columns with data currently in Liters that need to be converted to Gallons
- Returns:
- pd.DataFrame: Dataframe with specified columns converted from Liters to Gallons.
- ecopipeline.transform.convert_on_off_col_to_bool(df: DataFrame, column_names: list) DataFrame ¶
Function takes in a pandas dataframe of data and a list of column names to convert from the strings “ON” and “OFF” to boolean values True and False resperctively.
- Parameters:
- dfpd.DataFrame
Single pandas dataframe of sensor data.
- column_nameslist of stings
list of columns with data currently in strings “ON” and “OFF” that need to be converted to boolean values
- Returns:
- pd.DataFrame: Dataframe with specified columns converted from Celsius to Farenhiet.
- ecopipeline.transform.convert_time_zone(df: DataFrame, tz_convert_from: str = 'UTC', tz_convert_to: str = 'America/Los_Angeles') DataFrame ¶
converts a dataframe’s indexed timezone from tz_convert_from to tz_convert_to.
- Parameters:
- dfpd.DataFrame
Single pandas dataframe of sensor data.
- tz_convert_fromstr
String value of timezone data is currently in
- tz_convert_tostr
String value of timezone data should be converted to
- Returns:
- pd.DataFrame:
The dataframe with it’s index converted to the appropriate timezone.
- ecopipeline.transform.cop_method_1(df: DataFrame, recircLosses, heatout_primary_column: str = 'HeatOut_Primary', total_input_power_column: str = 'PowerIn_Total') DataFrame ¶
Performs COP calculation method 1 (original AWS method).
- Parameters:
- df: pd.Dataframe
Pandas dataframe representing daily averaged values from datastream to add COP columns to. Adds column called ‘COP_DHWSys_1’ to the dataframe in place The dataframe needs to already have two columns, ‘HeatOut_Primary’ and ‘PowerIn_Total’ to calculate COP_DHWSys_1
- recircLosses: float or pd.Series
If fixed tempurature maintanance reciculation loss value from spot measurement, this should be a float. If reciculation losses measurements are in datastream, this should be a column of df. Units should be in kW.
- heatout_primary_columnstr
Name of the column that contains the output power of the primary system in kW. Defaults to ‘HeatOut_Primary’
- total_input_power_columnstr
Name of the column that contains the total input power of the system in kW. Defaults to ‘PowerIn_Total’
- Returns:
- pd.DataFrame: Dataframe with added column for system COP called COP_DHWSys_1
- ecopipeline.transform.cop_method_2(df: DataFrame, cop_tm, cop_primary_column_name) DataFrame ¶
Performs COP calculation method 2 as defined by Scott’s whiteboard image COP = COP_primary(ELEC_primary/ELEC_total) + COP_tm(ELEC_tm/ELEC_total)
- Parameters:
- df: pd.DataFrame
Pandas DataFrame to add COP columns to. The dataframe needs to have a column for the COP of the primary system (see cop_primary_column_name) as well as a column called ‘PowerIn_Total’ for the total system power and columns prefixed with ‘PowerIn_HPWH’ or ‘PowerIn_SecLoopPump’ for power readings taken for HPWHs/primary systems and columns prefixed with ‘PowerIn_SwingTank’ or ‘PowerIn_ERTank’ for power readings taken for Temperature Maintenance systems
- cop_tm: float
fixed COP value for temputure Maintenece system
- cop_primary_column_name: str
Name of the column used for COP_Primary values
- Returns:
- pd.DataFrame: Dataframe with added column for system COP called COP_DHWSys_2
- ecopipeline.transform.create_data_statistics_df(df: DataFrame) DataFrame ¶
Function must be called on the raw minute data df after the rename_varriables() and before the ffill_missing() function has been called. The function returns a dataframe indexed by day. Each column will expanded to 3 columns, appended with ‘_missing_mins’, ‘_avg_gap’, and ‘_max_gap’ respectively. the columns will carry the following statisctics: _missing_mins -> the number of minutes in the day that have no reported data value for the column _avg_gap -> the average gap (in minutes) between collected data values that day _max_gap -> the maximum gap (in minutes) between collected data values that day
- Parameters:
- dfpd.DataFrame
minute data df after the rename_varriables() and before the ffill_missing() function has been called
- Returns:
- daily_data_statspd.DataFrame
new dataframe with the columns descriped in the function’s description
- ecopipeline.transform.create_fan_curves(cfm_info: DataFrame, site_info: Series) DataFrame ¶
Create fan curves for each site.
- Parameters:
- cfm_infopd.DataFrame
DataFrame of fan curve information.
- site_infopd.Series
Series containing the site information.
- Returns:
- pd.DataFrame:
Dataframe containing the fan curves for each site.
- ecopipeline.transform.create_summary_tables(df: DataFrame)¶
Revamped version of “aggregate_data” function. Creates hourly and daily summary tables.
- Parameters:
- dfpd.DataFrame
Single pandas dataframe of minute-by-minute sensor data.
- Returns:
- pd.DataFrame:
Two pandas dataframes, one of by the hour and one of by the day aggregated sensor data.
- ecopipeline.transform.delete_erroneous_from_time_pt(df: DataFrame, time_point: Timestamp, column_names: list, new_value=None) DataFrame ¶
Function will take a pandas dataframe and delete specified erroneous values at a specified time point.
- Parameters:
- df: pd.DataFrame
Timestamp indexed Pandas dataframe that needs to have an erroneous value removed
- time_pointpd.Timestamp
The timepoint index the erroneous value takes place in
- column_nameslist
list of column names as strings that contain erroneous values at this time stamp
- new_valueany
new value to populate the erroneous columns at this timestamp with. If set to None, will replace value with NaN
- Returns:
- pd.DataFrame:
Pandas dataframe with error values replaced with new value
- ecopipeline.transform.elev_correction(site_name: str, config: ConfigManager) DataFrame ¶
Function creates a dataframe for a given site that contains site name, elevation, and the corrected elevation.
- Parameters:
- site_namestr
site’s name
- configecopipeline.ConfigManager
The ConfigManager object that holds configuration data for the pipeline
- Returns:
- pd.DataFrame:
new Pandas dataframe
- ecopipeline.transform.ffill_missing(original_df: DataFrame, config: ConfigManager, previous_fill: DataFrame = None) DataFrame ¶
Function will take a pandas dataframe and forward fill select variables with no entry.
- Parameters:
- original_df: pd.DataFrame
Pandas dataframe that needs to be forward filled
- configecopipeline.ConfigManager
The ConfigManager object that holds configuration data for the pipeline. Among other things, this object will point to a file called Varriable_Names.csv in the input folder of the pipeline (e.g. “full/path/to/pipeline/input/Variable_Names.csv”). There should be at least three columns in this csv: “variable_name”, “changepoint”, “ffill_length”. The variable_name column should contain the name of each variable in the dataframe that requires forward filling. The changepoint column should contain one of three values:
“0” if the variable should be forward filled to a certain length (see ffill_length). “1” if the varrible should be forward filled completely until the next change point. null if the variable should not be forward filled.
The ffill_length contains the number of rows which should be forward filled if the value in the changepoint is “0”
- previous_fill: pd.DataFrame (default None)
A pandas dataframe with the same index type and at least some of the same columns as original_df (usually taken as the last entry from the pipeline that has been put into the destination database). The values of this will be used to forward fill into the new set of data if applicable.
- Returns:
- pd.DataFrame:
Pandas dataframe that has been forward filled to the specifications detailed in the vars_filename csv
- ecopipeline.transform.flag_dhw_outage(df: DataFrame, daily_df: DataFrame, dhw_outlet_column: str, supply_temp: int = 110, consecutive_minutes: int = 15) DataFrame ¶
- Parameters:
- dfpd.DataFrame
Single pandas dataframe of sensor data on minute intervals.
- daily_dfpd.DataFrame
Single pandas dataframe of sensor data on daily intervals.
- dhw_outlet_columnstr
Name of the column in df and daily_df that contains temperature of DHW supplied to building occupants
- supply_tempint
the minimum DHW temperature acceptable to supply to building occupants
- consecutive_minutesint
the number of minutes in a row that DHW is not delivered to tenants to qualify as a DHW Outage
- Returns:
- event_dfpd.DataFrame
Dataframe with ‘ALARM’ events on the days in which there was a DHW Outage.
- ecopipeline.transform.gas_valve_diff(df: DataFrame, site: str, config: ConfigManager) DataFrame ¶
Function takes in the site dataframe and the site name. If the site has gas heating, take the lagged difference to get per minute values.
- Parameters:
- dfpd.DataFrame
Dataframe for site
- sitestr
site name as string
- configecopipeline.ConfigManager
The ConfigManager object that holds configuration data for the pipeline
- Returns:
- pd.DataFrame:
modified Pandas Dataframe
- ecopipeline.transform.gather_outdoor_conditions(df: DataFrame, site: str) DataFrame ¶
Function takes in a site dataframe and site name as a string. Returns a new dataframe that contains time_utc, <site>_ODT, and <site>_ODRH for the site.
- Parameters:
- dfpd.DataFrame
Pandas Dataframe
- sitestr
site name as string
- Returns:
- pd.DataFrame:
new Pandas Dataframe
- ecopipeline.transform.generate_event_log_df(config: ConfigManager)¶
Creates an event log df based on user submitted events in an event log csv :param config: The ConfigManager object that holds configuration data for the pipeline. :type config: ecopipeline.ConfigManager
- Returns:
- event_dfpd.DataFrame
Dataframe formatted from events in Event_log.csv for pipeline.
- ecopipeline.transform.get_cfm_values(df, site_cfm, site_info, site)¶
- ecopipeline.transform.get_cop_values(df: DataFrame, site_info: DataFrame)¶
- ecopipeline.transform.get_energy_by_min(df: DataFrame) DataFrame ¶
Energy is recorded cummulatively. Function takes the lagged differences in order to get a per/minute value for each of the energy variables.
- Parameters:
- dfpd.DataFrame
Pandas dataframe
- Returns:
- pd.DataFrame:
Pandas dataframe
- ecopipeline.transform.get_hvac_state(df: DataFrame, site_info: Series) DataFrame ¶
- ecopipeline.transform.get_refrig_charge(df: DataFrame, site: str, config: ConfigManager) DataFrame ¶
Function takes in a site dataframe, its site name as a string, the path to site_info.csv as a string, the path to superheat.csv as a string, and the path to 410a_pt.csv, and calculates the refrigerant charge per minute?
- Parameters:
- dfpd.DataFrame
Pandas Dataframe
- sitestr
site name as a string
- configecopipeline.ConfigManager
The ConfigManager object that holds configuration data for the pipeline
- Returns:
- pd.DataFrame:
modified Pandas Dataframe
- ecopipeline.transform.get_site_cfm_info(site: str, config: ConfigManager) DataFrame ¶
Returns a dataframe of the site cfm information for the given site NOTE: The parsing is necessary as the first row of data are comments that need to be dropped.
- Parameters:
- sitestr
The site name
- configecopipeline.ConfigManager
The ConfigManager object that holds configuration data for the pipeline
- Returns:
- dfpd.DataFrame
The DataFrame of the site cfm information
- ecopipeline.transform.get_site_info(site: str, config: ConfigManager) Series ¶
Returns a dataframe of the site information for the given site
- Parameters:
- sitestr
The site name
- configecopipeline.ConfigManager
The ConfigManager object that holds configuration data for the pipeline
- Returns:
- dfpd.Series
The Series of the site information
- ecopipeline.transform.get_storage_gals120(df: DataFrame, location: Series, gals: int, total: int, zones: Series) DataFrame ¶
Function that creates and appends the Gals120 data onto the Dataframe
- Parameters:
- dfpd.Series
A Pandas Dataframe
- location (pd.Series)
- galsint
- totalint
- zonespd.Series
- Returns:
- pd.DataFrame:
a Pandas Dataframe
- ecopipeline.transform.get_temp_zones120(df: DataFrame) DataFrame ¶
Function that keeps track of the average temperature of each zone. for this function to work, naming conventions for each parrallel tank must include ‘Temp1’ as the tempature at the top of the tank, ‘Temp5’ as that at the bottom of the tank, and ‘Temp2’-‘Temp4’ as the tempatures in between.
- Parameters:
- dfpd.Series
A Pandas Dataframe
- Returns:
- pd.DataFrame:
a Pandas Dataframe
- ecopipeline.transform.heat_output_calc(df: DataFrame, flow_var: str, hot_temp: str, cold_temp: str, heat_out_col_name: str, return_as_kw: bool = True) DataFrame ¶
Function will take a flow varriable and two temperature inputs to calculate heat output
- Parameters:
- df: pd.DataFrame
Pandas dataframe with minute-to-minute data
- flow_varstr
The column name of the flow varriable for the calculation. Units of the column should be gal/min
- hot_tempstr
The column name of the hot temperature varriable for the calculation. Units of the column should be degrees F
- cold_tempstr
The column name of the cold temperature varriable for the calculation. Units of the column should be degrees F
- heat_out_col_namestr
The new column name for the heat output calculated from the varriables
- return_as_kwbool
Set to true for new heat out column to have kW units. Set to false to return column as BTU/hr
- Returns:
- pd.DataFrame:
Pandas dataframe with new heat output column of specified name.
- ecopipeline.transform.join_to_daily(daily_data: DataFrame, cop_data: DataFrame) DataFrame ¶
Function left-joins the the daily data and COP data.
- Parameters:
- daily_datapd.DataFrame
Daily dataframe
- cop_datapd.DataFrame
cop_values dataframe
- Returns:
- pd.DataFrame
A single, joined dataframe
- ecopipeline.transform.join_to_hourly(hourly_data: DataFrame, noaa_data: DataFrame) DataFrame ¶
Function left-joins the weather data to the hourly dataframe.
- Parameters:
- hourly_datapd.DataFrame
Hourly dataframe
- noaa_datapd.DataFrame
noaa dataframe
- Returns:
- pd.DataFrame:
A single, joined dataframe
- ecopipeline.transform.lbnl_pressure_conversions(df: DataFrame) DataFrame ¶
- ecopipeline.transform.lbnl_sat_calculations(df: DataFrame) DataFrame ¶
- ecopipeline.transform.lbnl_temperature_conversions(df: DataFrame) DataFrame ¶
- ecopipeline.transform.merge_indexlike_rows(df: DataFrame) DataFrame ¶
Merges index-like rows together ensuring that all relevant information for a certain timestamp is stored in one row - not in multiple rows. It also rounds the timestamps to the nearest minute.
- Parameters:
- file_pathstr
The file path to the data.
- Returns:
- dfpd.DataFrame
The DataFrame with all index-like rows merged.
- ecopipeline.transform.nclarity_csv_to_df(csv_filenames: List[str]) DataFrame ¶
Function takes a list of csv filenames containing nclarity data and reads all files into a singular dataframe.
- Parameters:
- csv_filenamesList[str]
List of filenames
- Returns:
- pd.DataFrame:
Pandas Dataframe containing data from all files
- ecopipeline.transform.nclarity_filter_new(date: str, filenames: List[str]) List[str] ¶
Function filters the filenames list to only those from the given date or later.
- Parameters:
- datestr
target date
- filenamesList[str]
List of filenames to be filtered
- Returns:
- List[str]:
Filtered list of filenames
- ecopipeline.transform.nullify_erroneous(original_df: DataFrame, config: ConfigManager) DataFrame ¶
Function will take a pandas dataframe and make erroneous values NaN.
- Parameters:
- original_df: pd.DataFrame
Pandas dataframe that needs to be filtered for error values
- configecopipeline.ConfigManager
The ConfigManager object that holds configuration data for the pipeline. Among other things, this object will point to a file called Varriable_Names.csv in the input folder of the pipeline (e.g. “full/path/to/pipeline/input/Variable_Names.csv”). There should be at least two columns in this csv: “variable_name” and “error_value” The variable_name should contain the names of all columns in the dataframe that need to have there erroneous values removed The error_value column should contain the error value of each variable_name, or null if there isn’t an error value for that variable
- Returns:
- pd.DataFrame:
Pandas dataframe with error values replaced with NaNs
- ecopipeline.transform.remove_outliers(original_df: DataFrame, config: ConfigManager, site: str = '') DataFrame ¶
Function will take a pandas dataframe and location of bounds information in a csv, store the bounds data in a dataframe, then remove outliers above or below bounds as designated by the csv. Function then returns the resulting dataframe.
- Parameters:
- original_df: pd.DataFrame
Pandas dataframe for which outliers need to be removed
- configecopipeline.ConfigManager
The ConfigManager object that holds configuration data for the pipeline. Among other things, this object will point to a file called Varriable_Names.csv in the input folder of the pipeline (e.g. “full/path/to/pipeline/input/Variable_Names.csv”). The file must have at least three columns which must be titled “variable_name”, “lower_bound”, and “upper_bound” which should contain the name of each variable in the dataframe that requires the removal of outliers, the lower bound for acceptable data, and the upper bound for acceptable data respectively
- site: str
string of site name if processing a particular site in a Variable_Names.csv file with multiple sites. Leave as an empty string if not aplicable.
- Returns:
- pd.DataFrame:
Pandas dataframe with outliers removed and replaced with nans
- ecopipeline.transform.remove_partial_days(df, hourly_df, daily_df, complete_hour_threshold: float = 0.8, complete_day_threshold: float = 1.0, partial_day_removal_exclusion: list = [])¶
Helper function for removing daily and hourly values that are calculated from incomplete data.
- Parameters:
- dfpd.DataFrame
Single pandas dataframe of minute-by-minute sensor data.
- daily_dfpd.DataFrame
agregated daily dataframe that contains all daily information.
- hourly_dfpd.DataFrame
agregated hourly dataframe that contains all hourly information.
- complete_hour_thresholdfloat
Default to 0.8. percent of minutes in an hour needed to count as a complete hour. Percent as a float (e.g. 80% = 0.8)
- complete_day_thresholdfloat
Default to 1.0. percent of hours in a day needed to count as a complete day. Percent as a float (e.g. 80% = 0.8)
- partial_day_removal_exclusionlist[str]
List of column names to ignore when searching through columns to remove sections without enough data
- ecopipeline.transform.rename_sensors(original_df: DataFrame, config: ConfigManager, site: str = '', system: str = '')¶
Function will take in a dataframe and a string representation of a file path and renames sensors from their alias to their true name. Also filters the dataframe by site and system if specified.
- Parameters:
- original_df: pd.DataFrame
A dataframe that contains data labeled by the raw varriable names to be renamed.
- configecopipeline.ConfigManager
The ConfigManager object that holds configuration data for the pipeline. Among other things, this object will point to a file called Varriable_Names.csv in the input folder of the pipeline (e.g. “full/path/to/pipeline/input/Variable_Names.csv”) The csv this points to should have at least 2 columns called “variable_alias” (the raw name to be changed from) and “variable_name” (the name to be changed to). All columns without a cooresponding variable_name will be dropped from the dataframe.
- site: str
If the pipeline is processing data for a particular site with a dataframe that contains data from multiple sites that need to be prossessed seperatly, fill in this optional varriable to drop data from all other sites in the returned dataframe. Appropriate varriables in your Variable_Names.csv must have a matching substring to this varriable in a column called “site”.
- system: str
If the pipeline is processing data for a particular system with a dataframe that contains data from multiple systems that need to be prossessed seperatly, fill in this optional varriable to drop data from all other systems in the returned dataframe. Appropriate varriables in your Variable_Names.csv must have a matching string to this varriable in a column called “system”
- Returns:
- df: pd.DataFrame
Pandas dataframe that has been filtered by site and system (if either are applicable) with column names that match those specified in Variable_Names.csv.
- ecopipeline.transform.replace_humidity(df: DataFrame, od_conditions: DataFrame, date_forward: datetime, site_name: str) DataFrame ¶
Function replaces all humidity readings for a given site after a given datetime.
- Parameters:
- dfpd.DataFrame
Dataframe containing the raw sensor data.
- od_conditionspd.DataFrame
DataFrame containing outdoor confitions measured by field sensors.
- date_forwarddt.datetime
Datetime containing the time after which all humidity readings should be replaced.
- site_namestr
String containing the name of the site for which humidity values are to be replaced.
- Returns:
- pd.DataFrame:
Modified DataFrame where the Humidity_ODRH column contains the field readings after the given datetime.
- ecopipeline.transform.round_time(df: DataFrame)¶
Function takes in a dataframe and rounds dataTime index down to the nearest minute. Works in place
- Parameters:
- dfpd.DataFrame
a dataframe indexed by datetimes. These date times will all be rounded down to the nearest minute.
- Returns:
- boolean
Returns True if the indexes have been rounded down. Returns False if the fuinction failed (e.g. if df was empty)
- ecopipeline.transform.sensor_adjustment(df: DataFrame, config: ConfigManager) DataFrame ¶
TO BE DEPRICATED – Reads in input/adjustments.csv and applies necessary adjustments to the dataframe
- Parameters:
- dfpd.DataFrame
DataFrame to be adjusted
- configecopipeline.ConfigManager
The ConfigManager object that holds configuration data for the pipeline. Among other things, this object will point to a file called adjustments.csv in the input folder of the pipeline (e.g. “full/path/to/pipeline/input/adjustments.csv”)
- Returns:
- pd.DataFrame:
Adjusted Dataframe
- ecopipeline.transform.shift_accumulative_columns(df: DataFrame, column_names: list = [])¶
converts a dataframe’s accumulative columns to non accumulative difference values.
- Parameters:
- dfpd.DataFrame
Single pandas dataframe of sensor data.
- column_nameslist
The names of columns that need to be changed from accumulative sum data to non-accumulative data. Will do this to all columns if set to an empty list
- Returns:
- pd.DataFrame:
The dataframe with aappropriate columns changed from accumulative sum data to non-accumulative data.
- ecopipeline.transform.site_specific(df: DataFrame, site: str) DataFrame ¶
Does Site Specific Calculations for LBNL. The site name is searched using RegEx
- Parameters:
- dfpd.DataFrame
dataframe of data
- sitestr
site name as a string
- Returns:
- pd.DataFrame:
modified dataframe
- ecopipeline.transform.verify_power_energy(df: DataFrame, config: ConfigManager)¶
Verifies that for each timestamp, corresponding power and energy variables are consistent with one another. Power ~= energy * 60. Margin of error TBD. Outputs to a csv file any rows with conflicting power and energy variables.
- Prereq:
Input dataframe MUST have had get_energy_by_min() called on it previously
- Parameters:
- dfpd.DataFrame
Pandas dataframe
- configecopipeline.ConfigManager
The ConfigManager object that holds configuration data for the pipeline
- Returns:
- None