`hypehd.data_manipulation`

Module Contents

Functions

`handle_null`(input_data, col, impute_type[, by_vars])	Replace missing values with a descriptive statistic.
`change_type`(df, col, col_type)	Changes the data type of the specified column of a data frame.
`data_selection`(input_data[, cond, keep_col, drop_col, ...])	Return a new data frame of given selection crateria.
`derive_baseline`(input_data, base_visit, by_vars, value)	Return a new data frame of derived baseline and related variables. Function for longitudinal
`derive_extreme_flag`(input_data, by_vars, sort_var, ...)	Add a variable flagging the specified observation within each by_vars group. Function for longitudinal
`time_to_event`(input_data, start_date, end_date, ...)	Add a variable flagging the specified observation within each by_vars group. Function for survival
`read`(source[, path, sheet_name, sql, con])	Reads data from specified path into a pandas dataframe.
`check_bias`(df[, col, real_dist, n_marg, marg])	Checks data for two types of bias. Too many null values and improper distribution.
`numeric_to_categorical`(df, col, bounds[, add])	Changes numeric data to categories. If add option if True the data will be added
`categorical_to_numeric`(df, col, bounds[, add])	Changes categories to numbers. If add option if True the data will be added

hypehd.data_manipulation.handle_null(input_data, col, impute_type, by_vars=None)

Replace missing values with a descriptive statistic.

input_datapd.DataFrame
Input dataset name.

colstr
Variable name need to be imputed.

by_varsstr or list
Grouping variables uniquely identifying a set of records for computing descriptive statistic .

impute_typestr, select from (mean, max, min, median, remove-remove rows with null)
The imputation method.

input_data : pd.DataFrame Same dataset after imputation.

> df = pd.DataFrame() > df[‘C0’] = [0.2601,0.2358,0.1429,0.1259,0.7526,0.7341,0.4546,0.1426,0.1490,0.2500] > df[‘C1’] = [0.7154,np.nan,0.2615,0.5846,np.nan,0.8308,0.4962,np.nan,0.5340,0.6731] > handle_null(input_data=df, col=”C1”, impute_type=”median”)

hypehd.data_manipulation.change_type(df, col, col_type)

Changes the data type of the specified column of a data frame.

dfpd.DataFrame, mandatory
The dataset that will be changed.

colstr, mandatory
Name of the chosen column.

col_typedata type, mandatory
Type to change to. The available options are int, float and str.

dfpd.DataFrame
The dataset with changed column type.

> data = change_type(data, ‘sbp’, int)

hypehd.data_manipulation.data_selection(input_data: pandas.DataFrame, cond=None, keep_col=None, drop_col=None, sort_by=None, merge_data=None, merge_by=None, merge_keep_col=None, sort_asc=True, rename=None)

Return a new data frame of given selection crateria.

input_datapd.DataFrame
Input dataset name.

condstr, optional
The query string to filter input_data.

keep_collist or str, optional
The variable names in input_data which user want to keep.

drop_collist or str, optional
The variable names in input_data which user want to drop.

sort_bylist or str, optional
The variable names in input_data or merge_data after keeping and dropping which user want to sort the data frame.

merge_datapd.DataFrame or Series, optional
Merge DataFrame or named Series objects with a database-style join. Default type of merge is left.

merge_bylist or str, optional
Column names to join on.

merge_keep_collist or str, optional
The variable names in merge_data which user want to keep.

sort_ascbool or list of bool, default True, optional
Sort ascending vs. descending. Specify list for multiple sort orders.

renamedict
Change columns labels. Specify the original column names and alter names in the dict object.

output_data : pd.DataFrame Dataset with the selection creteria applied.

hypehd.data_manipulation.derive_baseline(input_data, base_visit, by_vars: list, value, chg=True, pchg=True)

Return a new data frame of derived baseline and related variables. Function for longitudinal data analysis.

input_datapd.DataFrame
Input dataset name.

base_visitstr
The query string to specify the baseline visit. (e.g. ‘visit==0’).

by_varslist
Grouping variables uniquely identifying a set of records for baseline and related variables.

valuestr
The variable names from which to extract the baseline value.

chgbool, default to True
If True, return change from baseline (chg) variable as value - base.

pchgbool, default to True
If True, return percent change from baseline (chg) variable as (value - base)/base.

output_data : pd.DataFrame Dataset with derived baseline and related variables.

derive_extreme_flag

> derive_baseline(input_data=data, base_visit=’visit==0’, by_vars=[“patient”,”lab test”], value=value, > chg=True, pchg=True)

hypehd.data_manipulation.derive_extreme_flag(input_data, by_vars: list, sort_var: list, new_var, mode, value_var=None)

Add a variable flagging the specified observation within each by_vars group. Function for longitudinal data analysis.

input_datapd.DataFrame
Input dataset name.

by_varslist
Grouping variables uniquely identifying a set of records for flags.

sort_varlist
Sort variables used to sort the dataset which help find the first/last.

new_varstr
The name of variable to add. It is set to “Y” for the observation (depending on the mode) of each by group.

modestr, select from (last, first, max, min)
Determines of the first/last/max/min observation is flagged.

value_varstr
The variable names from which to extract the specified value.

output_data : pd.DataFrame Dataset with derived extreme variables.

> derive_extreme_flag(input_data=data, by_vars=[“patient”,”lab_test”], > sort_var=[“patient”,”lab test”, “test_value”], new_var=”first_flag”, > mode=”first”, value_var=”test_value”)

hypehd.data_manipulation.time_to_event(input_data, start_date, end_date, censor_date, new_var, unit)

Add a variable flagging the specified observation within each by_vars group. Function for survival data analysis.

input_datapd.DataFrame
Input dataset name.

start_datestr
Variable name of time to event origin date.

end_datestr
Variable name of time to event happened date.

censor_datestr
Variable name of time to event censoring date.

new_varstr
The name of variable to add.

unitstr, select from (day, week, month, year)
The unit of time to event duration.

output_data : pd.DataFrame Dataset with derived extreme variables.

> derive_extreme_flag(input_data=data, by_vars=[“patient”,”lab_test”], > sort_var=[“patient”,”lab test”, “test_value”], new_var=”first_flag”, > mode=”first”, value_var=”test_value”)

hypehd.data_manipulation.read(source, path=None, sheet_name=None, sql=None, con=None)

Reads data from specified path into a pandas dataframe.

sourcestr, mandatory
The type of the data source. Available options are csv, tsv, excel, sql, json, html and xml.

pathstr, optional
The path to the data.

sheet_namestr, optional
Name of the excel sheet. The path to the excel file need to be specified for this option.

sqlstr/SQLAlchemy Selectable, optional
The sql command for getting the data.

con: SQLAlchemy connectable/str/sqlite3 connection, optional
Connection to database.

dfpd.DataFrame
The dataset.

> from sqlite3 import connect > conn = connect(‘:memory:’) > data = read(source=’sql’, sql=’SELECT int_column, date_column FROM test_data’, con=conn)

hypehd.data_manipulation.check_bias(df, col=None, real_dist=None, n_marg=10, marg=5)

Checks data for two types of bias. Too many null values and improper distribution. If no column is specified, only the amount of null values will be checked. The function prints the column names with too many null values along with the number of null values and The names of columns with skewed distribution along with their distribution.

dfpd.DataFrame, mandatory
The dataset containing the column of interest.

colstr, optional
The name of the column to check.

real_distlist of lists, mandatory
A list containing two item lists of the values and their proper distribution.

n_margint, optional
The percentage of null values that is allowed. The default is 10.

margint, optional
The amount of deviance allowed from the proper distribution.

too_nul : list of columns that have too many null values skew : list of columns with skewed distribution

> check_bias(df=data, col=’Blood_cell_type’, real_dist=[[‘Red’, 37],[‘White’, 53]],
n_marg=50)

hypehd.data_manipulation.numeric_to_categorical(df, col: str, bounds, add=False)

Changes numeric data to categories. If add option if True the data will be added into a separate column and if it is False the categorical values will replace the numerical values of the column.

dfpd.DataFrame, mandatory
The data source.

colstr, mandatory
The name of the column to change.

boundslist of lists, mandatory
A list containing two item lists of the upper bound of each category (int) and the category name (str).

addstr, optional
Choice of adding results as an additional column or replacing the current column. The default is False (replacement).

dfpd.DataFrame
The changed dataset.

> data = numeric_to_categorical(data, ‘sbp’, [[9,’low’],[1000000,’high’]],True)

hypehd.data_manipulation.categorical_to_numeric(df, col: str, bounds, add=False)

Changes categories to numbers. If add option if True the data will be added into a separate column and if it is False the categorical values will replace the categorical values of the column.

dfpd.DataFrame, mandatory
The data source.

colstr, mandatory
The name of the column to change.

boundslist of lists, mandatory
A list containing two item lists of the name of the categorical column and the number to replace it with.

addstr, optional
Choice of adding results as an additional column or replacing the current column. The default is False (replacement).

dfpd.DataFrame
The changed dataset.

> data = categorical_to_numeric(data, ‘sex’, [[‘Female’, 0],[‘Male’, 1]],True)

hypehd.data_manipulation

Module Contents

Functions

`hypehd.data_manipulation`