Examples 2

Some additional examples to demonstrate the use of the HYPEHD package. The test dataset is open source data from https://github.com/insightsengineering/scda.2022 website.

%%capture
%pip install --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple hypehd

Imports

from hypehd import visualization as vis
from hypehd import data_manipulation as da
import hypehd

Reading the data

# read into dataframe dm from package
my_file = hypehd.PACKAGEDIR / 'data' / 'demographic.csv'
dm=da.read("csv", my_file)
dm.head()

	Unnamed: 0	STUDYID	USUBJID	SUBJID	SITEID	AGE	AGEU	SEX	RACE	ETHNIC	...	DCSREAS	DTHDT	DTHCAUS	DTHCAT	LDDTHELD	LDDTHGR1	LSTALVDT	DTHADY	ADTHAUT	study_duration_secs
0	1	AB12345	AB12345-CHN-3-id-128	id-128	CHN-3	32	YEARS	M	ASIAN	HISPANIC OR LATINO	...	DEATH	2022-03-06	ADVERSE EVENT	ADVERSE EVENT	22.0	<=30	2022-03-06	1106.0	Yes	63113904
1	2	AB12345	AB12345-CHN-15-id-262	id-262	CHN-15	35	YEARS	M	BLACK OR AFRICAN AMERICAN	NOT HISPANIC OR LATINO	...	NaN	NaN	NaN	NaN	NaN	NaN	2022-03-17	NaN	NaN	63113904
2	3	AB12345	AB12345-RUS-3-id-378	id-378	RUS-3	30	YEARS	F	ASIAN	NOT HISPANIC OR LATINO	...	NaN	NaN	NaN	NaN	NaN	NaN	2022-03-11	NaN	NaN	63113904
3	4	AB12345	AB12345-CHN-11-id-220	id-220	CHN-11	26	YEARS	F	ASIAN	NOT HISPANIC OR LATINO	...	NaN	NaN	NaN	NaN	NaN	NaN	2022-03-26	NaN	NaN	63113904
4	5	AB12345	AB12345-CHN-7-id-267	id-267	CHN-7	40	YEARS	M	ASIAN	NOT HISPANIC OR LATINO	...	NaN	NaN	NaN	NaN	NaN	NaN	2022-03-15	NaN	NaN	63113904

5 rows × 57 columns

Checking data null bias

You will need to check the bias in your data sometimes and some forms of bias checking are to be done manually but our package offers two forms of automated bias checking. To find columns with too much missing data and to check to see if the distribution of discrete data is as intended.
Here, we have checked the columns for too much missing data, in other words, columns that have too many null values. This is done using the check_bias function.

too_null_1 = da.check_bias(dm)

Columns with too many null values: 
 [['TRTEDTM', 73], ['TRT01EDTM', 73], ['TRT02SDTM', 73], ['TRT02EDTM', 73], ['AP01EDTM', 73], ['AP02SDTM', 73], ['AP02EDTM', 73], ['EOSDT', 73], ['EOSDY', 73], ['DCSREAS', 280], ['DTHDT', 330], ['DTHCAUS', 330], ['DTHCAT', 330], ['LDDTHELD', 330], ['LDDTHGR1', 330], ['LSTALVDT', 73], ['DTHADY', 330], ['ADTHAUT', 343]]

Dealing with too many null values

For this you can either remove the column with too many null values which is done later in this example file, or you can impute the missing data using the handle_null function which also gives the option to remove the rows with the missing values.
Here, we have removed the row with null values in columns that had too many nulls.

# make copy of unhandled version for future use
dm_uncleaned = dm.copy()

# remove rows with null values from columns that have too many nulls
for null in too_null_1[0]:
    dm = da.handle_null(dm, null[0], impute_type="remove")
dm.head()

	Unnamed: 0	STUDYID	USUBJID	SUBJID	SITEID	AGE	AGEU	SEX	RACE	ETHNIC	...	DCSREAS	DTHDT	DTHCAUS	DTHCAT	LDDTHELD	LDDTHGR1	LSTALVDT	DTHADY	ADTHAUT	study_duration_secs
0	1	AB12345	AB12345-CHN-3-id-128	id-128	CHN-3	32	YEARS	M	ASIAN	HISPANIC OR LATINO	...	DEATH	2022-03-06	ADVERSE EVENT	ADVERSE EVENT	22.0	<=30	2022-03-06	1106.0	Yes	63113904
5	6	AB12345	AB12345-CHN-15-id-201	id-201	CHN-15	49	YEARS	M	ASIAN	NOT HISPANIC OR LATINO	...	DEATH	2022-02-22	ADVERSE EVENT	ADVERSE EVENT	3.0	<=30	2022-02-22	1085.0	Yes	63113904
12	13	AB12345	AB12345-RUS-1-id-52	id-52	RUS-1	40	YEARS	F	ASIAN	NOT HISPANIC OR LATINO	...	DEATH	2022-02-20	DISEASE PROGRESSION	PROGRESSIVE DISEASE	7.0	<=30	2022-02-20	1070.0	Yes	63113904
16	17	AB12345	AB12345-BRA-11-id-9	id-9	BRA-11	40	YEARS	M	ASIAN	NOT HISPANIC OR LATINO	...	DEATH	2022-03-20	DISEASE PROGRESSION	PROGRESSIVE DISEASE	36.0	>30	2022-03-20	1091.0	Yes	63113904
18	19	AB12345	AB12345-CHN-15-id-245	id-245	CHN-15	34	YEARS	F	WHITE	NOT HISPANIC OR LATINO	...	DEATH	2022-02-18	DISEASE PROGRESSION	PROGRESSIVE DISEASE	1.0	<=30	2022-02-18	1057.0	Yes	63113904

5 rows × 57 columns

Adding categories for a numerical column

You can change unumerical values to categorical ones useing the numeric_to_categorical function. Here we have changed AGE into categories. You can replace the previous column or create a new column for your categorical data, we have used True to indicate a new column called AGE_group.

dm = da.numeric_to_categorical(dm, 'AGE', [[30, '(,30]'], [35, '(30,35]'], 
                                        [40, '(35,40]'], [45, '(40,45]'],
                                        [50, '(45,50]']], True)
dm.head()

	Unnamed: 0	STUDYID	USUBJID	SUBJID	SITEID	AGE	AGEU	SEX	RACE	ETHNIC	...	DTHDT	DTHCAUS	DTHCAT	LDDTHELD	LDDTHGR1	LSTALVDT	DTHADY	ADTHAUT	study_duration_secs	AGE_group
0	1	AB12345	AB12345-CHN-3-id-128	id-128	CHN-3	32	YEARS	M	ASIAN	HISPANIC OR LATINO	...	2022-03-06	ADVERSE EVENT	ADVERSE EVENT	22.0	<=30	2022-03-06	1106.0	Yes	63113904	(30,35]
5	6	AB12345	AB12345-CHN-15-id-201	id-201	CHN-15	49	YEARS	M	ASIAN	NOT HISPANIC OR LATINO	...	2022-02-22	ADVERSE EVENT	ADVERSE EVENT	3.0	<=30	2022-02-22	1085.0	Yes	63113904	(45,50]
12	13	AB12345	AB12345-RUS-1-id-52	id-52	RUS-1	40	YEARS	F	ASIAN	NOT HISPANIC OR LATINO	...	2022-02-20	DISEASE PROGRESSION	PROGRESSIVE DISEASE	7.0	<=30	2022-02-20	1070.0	Yes	63113904	(35,40]
16	17	AB12345	AB12345-BRA-11-id-9	id-9	BRA-11	40	YEARS	M	ASIAN	NOT HISPANIC OR LATINO	...	2022-03-20	DISEASE PROGRESSION	PROGRESSIVE DISEASE	36.0	>30	2022-03-20	1091.0	Yes	63113904	(35,40]
18	19	AB12345	AB12345-CHN-15-id-245	id-245	CHN-15	34	YEARS	F	WHITE	NOT HISPANIC OR LATINO	...	2022-02-18	DISEASE PROGRESSION	PROGRESSIVE DISEASE	1.0	<=30	2022-02-18	1057.0	Yes	63113904	(30,35]

5 rows × 58 columns

Plotting the distribution of a demographic feature

You can create a pie chart to show the distribution a discrete variable using the pie function. If you want to save it, specify the path. If you use ”.” it will be saved in the same directory as your code file.

vis.pie(df = dm, col = "AGE_group",path = ".", name = "age_pie_chart")

([<matplotlib.patches.Wedge at 0x7fa3ddc85650>,
  <matplotlib.patches.Wedge at 0x7fa3ddc85ed0>,
  <matplotlib.patches.Wedge at 0x7fa3ddc9e750>,
  <matplotlib.patches.Wedge at 0x7fa3ddc9e450>,
  <matplotlib.patches.Wedge at 0x7fa3dbc2b790>],
 [Text(0.5320908144466621, 0.9627457427490853, '(35,40]'),
  Text(-1.084458100654201, 0.18425696167440575, '(,30]'),
  Text(-0.30451921808270677, -1.0570090093363902, '(30,35]'),
  Text(0.7778173500684048, -0.7778175685419846, '(40,45]'),
  Text(1.084458098497778, -0.18425697436619265, '(45,50]')],
 [Text(0.2902313533345429, 0.525134041499501, '33.9%\n19'),
  Text(-0.5915226003568368, 0.10050379727694858, '26.8%\n15'),
  Text(-0.16610139168147642, -0.57655036872894, '19.6%\n11'),
  Text(0.42426400912822076, -0.42426412829562793, '14.3%\n8'),
  Text(0.5915225991806061, -0.10050380419974143, '5.4%\n3')])

_images/7ee2a5b0445f7d616ed09182d6dc44c7fce7d90de7509f92871c963293ef96a6.png

Plotting a boxplot grid

You can plot a boxplot grid of either three features (one numeric and two categorical) or of all numeric features using the boxplot_grid function.
Here we have plotted AGE by SEX and COUNTRY.

vis.boxplot_grid(dm, col1='COUNTRY', col2='SEX', col3='AGE')

/home/docs/checkouts/readthedocs.org/user_builds/hypehd/envs/latest/lib/python3.7/site-packages/seaborn/axisgrid.py:712: UserWarning: Using the boxplot function without specifying `order` is likely to produce an incorrect plot.
  warnings.warn(warning)

_images/d1ac19be715c56e63d95d457acb7a9306d7705f17e62cd4ea3898e5b4d7fffde.png

<seaborn.axisgrid.FacetGrid at 0x7fa3db3e6290>

Generate demographic plots

There is another function can help users exploring the distribution and description statics of both continuous and discrete variables simultaneously using demo_graph() function. Here generates plots of AGE, SEX by different treatment group.

vis.demo_graph(var=["AGE", "SEX"], input_data=dm, group='TRT01P')

([<Figure size 1500x1000 with 1 Axes>, <Figure size 1500x1000 with 1 Axes>],
 [<AxesSubplot:title={'center':'Plot and summary table for Age'}, ylabel='AGE'>,
  <AxesSubplot:title={'center':'Plot and summary table for Sex'}, ylabel='SEX'>])

_images/ba11cd0a503a72d9090ba48cbeb1e356f67aed82de6568b42997a448f8f38a0e.png

_images/575137907b58f348f9cdebc9bb6e9ab71c2fc680f4a0e4b9b7571f892dcb004d.png

Plotting 3D clusters

You can find and plot the clusters between three numerical features using the cluster_3d function and you can choose your prefered clustering method out of the options available.
Here, we have plotted a three dimentional cluster graph of AGE, BMRKR1 and study_duration_secs using k-means clustering.
If you only want a 3D graph without the clustering, you can use the graph_3d function.

vis.cluster_3d(df=dm_uncleaned, cols=['AGE','BMRKR1','study_duration_secs'],
               lab1 = 'Age', lab3='Study duration (s)', c_type="k-means", legend=True)

(<Figure size 1200x1200 with 1 Axes>,
 <Axes3DSubplot:xlabel='Age', ylabel='BMRKR1'>)

_images/4d79f0052d12b58c8f2c246f0838fe90c94df6d4bf60403539304ca652d9f87b.png

Plotting 2D clusters

For two dimentional clustering you can use the cluster_2d function in a similar fashion to the three dimentional function.
We have plotted a two dimentional cluster graph of AGE and BMRKR1 with DBSCAN clustering.

vis.cluster_2d(df=dm_uncleaned, cols=['AGE','BMRKR1'], c_type="dbscans", min_sample=10)

(<Figure size 1200x1200 with 1 Axes>,
 <AxesSubplot:xlabel='AGE', ylabel='BMRKR1'>)

_images/70a45a3e20944138deae9517f69d42ad84d4eb13ae075ffe51dceddad896927f.png

Change data type

You can change a column’s data type between String, Integer and Float types using the change_type function.
Here we have changed BMRKR1 from float to int.

dm_uncleaned = da.change_type(dm_uncleaned, 'BMRKR1', int)

Add a numeric coded column from a categorical column

The categorical_to_numeric function changes numerical data columns to categorical ones. Here we changed the SEX column in the dm_uncleaned dataset to 0 and 1 and added the values into a separate column (SEX_goup) by setting add to True, if we use False it will replace the previous values within the SEX column.

dm_uncleaned = da.categorical_to_numeric(dm_uncleaned, 'SEX', [['F', 0],['M', 1]], True)

Plotting the relationship between some cloumns

You can plot the relationship between columns using the relation function. It will return one or two heatmaps depending on the input. For categorical data, it will plot a heatmap of chi-square values. For numerical data, it will show the correlation.
Here, we have removed columns with too many null values and have used the data-selection function to select some columns to plot and then plotted the relationship heatmaps.

# removing all columns with too many null values
for null in too_null_1[0]:
    dm_no_too_null = dm_uncleaned.drop(null[0], axis=1)

# choosing columns to plot
dm_no_too_null = da.data_selection(keep_col=["STRATA1", "ARMCD", "STRATA2", "SEX_num", "EOSDY",
                                             "REGION1", "ACTARM", "DCSREAS", "DTHADY", "BMRKR1",
                                             "DTHDT", "DTHFL", "SEX", "COUNTRY", "LDDTHELD",
                                             "ETHNIC", "DTHCAT", "DTHCAUS", "AGE"],
                                              input_data=dm_no_too_null)
# drawing the relationship plot 
vis.relation(dm_no_too_null, path='.')

_images/97bd2fdc0bb8daad97344a0965c210a725b77060f4481778e0e8f0f122de3bfb.png

_images/ad9511884ea19c1464cbde72b9698d0dc7687e9a0f7d883a5e74a9cb93b216c3.png

[[<Figure size 1400x1400 with 2 Axes>, <AxesSubplot:>], <AxesSubplot:>]