hypehd.data_manipulation

Module Contents

Functions

handle_null(input_data, col, impute_type[, by_vars])

Replace missing values with a descriptive statistic.

change_type(df, col, col_type)

Changes the data type of the specified column of a data frame.

data_selection(input_data[, cond, keep_col, drop_col, ...])

Return a new data frame of given selection crateria.

derive_baseline(input_data, base_visit, by_vars, value)

Return a new data frame of derived baseline and related variables. Function for longitudinal

derive_extreme_flag(input_data, by_vars, sort_var, ...)

Add a variable flagging the specified observation within each by_vars group. Function for longitudinal

time_to_event(input_data, start_date, end_date, ...)

Add a variable flagging the specified observation within each by_vars group. Function for survival

read(source[, path, sheet_name, sql, con])

Reads data from specified path into a pandas dataframe.

check_bias(df[, col, real_dist, n_marg, marg])

Checks data for two types of bias. Too many null values and improper distribution.

numeric_to_categorical(df, col, bounds[, add])

Changes numeric data to categories. If add option if True the data will be added

categorical_to_numeric(df, col, bounds[, add])

Changes categories to numbers. If add option if True the data will be added

hypehd.data_manipulation.handle_null(input_data, col, impute_type, by_vars=None)

Replace missing values with a descriptive statistic.

input_datapd.DataFrame

Input dataset name.

colstr

Variable name need to be imputed.

by_varsstr or list

Grouping variables uniquely identifying a set of records for computing descriptive statistic .

impute_typestr, select from (mean, max, min, median, remove-remove rows with null)

The imputation method.

input_data : pd.DataFrame Same dataset after imputation.

> df = pd.DataFrame() > df[‘C0’] = [0.2601,0.2358,0.1429,0.1259,0.7526,0.7341,0.4546,0.1426,0.1490,0.2500] > df[‘C1’] = [0.7154,np.nan,0.2615,0.5846,np.nan,0.8308,0.4962,np.nan,0.5340,0.6731] > handle_null(input_data=df, col=”C1”, impute_type=”median”)

hypehd.data_manipulation.change_type(df, col, col_type)

Changes the data type of the specified column of a data frame.

dfpd.DataFrame, mandatory

The dataset that will be changed.

colstr, mandatory

Name of the chosen column.

col_typedata type, mandatory

Type to change to. The available options are int, float and str.

dfpd.DataFrame

The dataset with changed column type.

> data = change_type(data, ‘sbp’, int)

hypehd.data_manipulation.data_selection(input_data: pandas.DataFrame, cond=None, keep_col=None, drop_col=None, sort_by=None, merge_data=None, merge_by=None, merge_keep_col=None, sort_asc=True, rename=None)

Return a new data frame of given selection crateria.

input_datapd.DataFrame

Input dataset name.

condstr, optional

The query string to filter input_data.

keep_collist or str, optional

The variable names in input_data which user want to keep.

drop_collist or str, optional

The variable names in input_data which user want to drop.

sort_bylist or str, optional

The variable names in input_data or merge_data after keeping and dropping which user want to sort the data frame.

merge_datapd.DataFrame or Series, optional

Merge DataFrame or named Series objects with a database-style join. Default type of merge is left.

merge_bylist or str, optional

Column names to join on.

merge_keep_collist or str, optional

The variable names in merge_data which user want to keep.

sort_ascbool or list of bool, default True, optional

Sort ascending vs. descending. Specify list for multiple sort orders.

renamedict

Change columns labels. Specify the original column names and alter names in the dict object.

output_data : pd.DataFrame Dataset with the selection creteria applied.

hypehd.data_manipulation.derive_baseline(input_data, base_visit, by_vars: list, value, chg=True, pchg=True)

Return a new data frame of derived baseline and related variables. Function for longitudinal data analysis.

input_datapd.DataFrame

Input dataset name.

base_visitstr

The query string to specify the baseline visit. (e.g. ‘visit==0’).

by_varslist

Grouping variables uniquely identifying a set of records for baseline and related variables.

valuestr

The variable names from which to extract the baseline value.

chgbool, default to True

If True, return change from baseline (chg) variable as value - base.

pchgbool, default to True

If True, return percent change from baseline (chg) variable as (value - base)/base.

output_data : pd.DataFrame Dataset with derived baseline and related variables.

derive_extreme_flag

> derive_baseline(input_data=data, base_visit=’visit==0’, by_vars=[“patient”,”lab test”], value=value, > chg=True, pchg=True)

hypehd.data_manipulation.derive_extreme_flag(input_data, by_vars: list, sort_var: list, new_var, mode, value_var=None)

Add a variable flagging the specified observation within each by_vars group. Function for longitudinal data analysis.

input_datapd.DataFrame

Input dataset name.

by_varslist

Grouping variables uniquely identifying a set of records for flags.

sort_varlist

Sort variables used to sort the dataset which help find the first/last.

new_varstr

The name of variable to add. It is set to “Y” for the observation (depending on the mode) of each by group.

modestr, select from (last, first, max, min)

Determines of the first/last/max/min observation is flagged.

value_varstr

The variable names from which to extract the specified value.

output_data : pd.DataFrame Dataset with derived extreme variables.

> derive_extreme_flag(input_data=data, by_vars=[“patient”,”lab_test”], > sort_var=[“patient”,”lab test”, “test_value”], new_var=”first_flag”, > mode=”first”, value_var=”test_value”)

hypehd.data_manipulation.time_to_event(input_data, start_date, end_date, censor_date, new_var, unit)

Add a variable flagging the specified observation within each by_vars group. Function for survival data analysis.

input_datapd.DataFrame

Input dataset name.

start_datestr

Variable name of time to event origin date.

end_datestr

Variable name of time to event happened date.

censor_datestr

Variable name of time to event censoring date.

new_varstr

The name of variable to add.

unitstr, select from (day, week, month, year)

The unit of time to event duration.

output_data : pd.DataFrame Dataset with derived extreme variables.

> derive_extreme_flag(input_data=data, by_vars=[“patient”,”lab_test”], > sort_var=[“patient”,”lab test”, “test_value”], new_var=”first_flag”, > mode=”first”, value_var=”test_value”)

hypehd.data_manipulation.read(source, path=None, sheet_name=None, sql=None, con=None)

Reads data from specified path into a pandas dataframe.

sourcestr, mandatory

The type of the data source. Available options are csv, tsv, excel, sql, json, html and xml.

pathstr, optional

The path to the data.

sheet_namestr, optional

Name of the excel sheet. The path to the excel file need to be specified for this option.

sqlstr/SQLAlchemy Selectable, optional

The sql command for getting the data.

con: SQLAlchemy connectable/str/sqlite3 connection, optional

Connection to database.

dfpd.DataFrame

The dataset.

> from sqlite3 import connect > conn = connect(‘:memory:’) > data = read(source=’sql’, sql=’SELECT int_column, date_column FROM test_data’, con=conn)

hypehd.data_manipulation.check_bias(df, col=None, real_dist=None, n_marg=10, marg=5)

Checks data for two types of bias. Too many null values and improper distribution. If no column is specified, only the amount of null values will be checked. The function prints the column names with too many null values along with the number of null values and The names of columns with skewed distribution along with their distribution.

dfpd.DataFrame, mandatory

The dataset containing the column of interest.

colstr, optional

The name of the column to check.

real_distlist of lists, mandatory

A list containing two item lists of the values and their proper distribution.

n_margint, optional

The percentage of null values that is allowed. The default is 10.

margint, optional

The amount of deviance allowed from the proper distribution.

too_nul : list of columns that have too many null values skew : list of columns with skewed distribution

> check_bias(df=data, col=’Blood_cell_type’, real_dist=[[‘Red’, 37],[‘White’, 53]],

n_marg=50)

hypehd.data_manipulation.numeric_to_categorical(df, col: str, bounds, add=False)

Changes numeric data to categories. If add option if True the data will be added into a separate column and if it is False the categorical values will replace the numerical values of the column.

dfpd.DataFrame, mandatory

The data source.

colstr, mandatory

The name of the column to change.

boundslist of lists, mandatory

A list containing two item lists of the upper bound of each category (int) and the category name (str).

addstr, optional

Choice of adding results as an additional column or replacing the current column. The default is False (replacement).

dfpd.DataFrame

The changed dataset.

> data = numeric_to_categorical(data, ‘sbp’, [[9,’low’],[1000000,’high’]],True)

hypehd.data_manipulation.categorical_to_numeric(df, col: str, bounds, add=False)

Changes categories to numbers. If add option if True the data will be added into a separate column and if it is False the categorical values will replace the categorical values of the column.

dfpd.DataFrame, mandatory

The data source.

colstr, mandatory

The name of the column to change.

boundslist of lists, mandatory

A list containing two item lists of the name of the categorical column and the number to replace it with.

addstr, optional

Choice of adding results as an additional column or replacing the current column. The default is False (replacement).

dfpd.DataFrame

The changed dataset.

> data = categorical_to_numeric(data, ‘sex’, [[‘Female’, 0],[‘Male’, 1]],True)