Examples 2

Some additional examples to demonstrate the use of the HYPEHD package. The test dataset is open source data from https://github.com/insightsengineering/scda.2022 website.

%%capture
%pip install --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple hypehd

Imports

from hypehd import visualization as vis
from hypehd import data_manipulation as da
import hypehd

Reading the data

# read into dataframe dm from package
my_file = hypehd.PACKAGEDIR / 'data' / 'demographic.csv'
dm=da.read("csv", my_file)
dm.head()
Unnamed: 0 STUDYID USUBJID SUBJID SITEID AGE AGEU SEX RACE ETHNIC ... DCSREAS DTHDT DTHCAUS DTHCAT LDDTHELD LDDTHGR1 LSTALVDT DTHADY ADTHAUT study_duration_secs
0 1 AB12345 AB12345-CHN-3-id-128 id-128 CHN-3 32 YEARS M ASIAN HISPANIC OR LATINO ... DEATH 2022-03-06 ADVERSE EVENT ADVERSE EVENT 22.0 <=30 2022-03-06 1106.0 Yes 63113904
1 2 AB12345 AB12345-CHN-15-id-262 id-262 CHN-15 35 YEARS M BLACK OR AFRICAN AMERICAN NOT HISPANIC OR LATINO ... NaN NaN NaN NaN NaN NaN 2022-03-17 NaN NaN 63113904
2 3 AB12345 AB12345-RUS-3-id-378 id-378 RUS-3 30 YEARS F ASIAN NOT HISPANIC OR LATINO ... NaN NaN NaN NaN NaN NaN 2022-03-11 NaN NaN 63113904
3 4 AB12345 AB12345-CHN-11-id-220 id-220 CHN-11 26 YEARS F ASIAN NOT HISPANIC OR LATINO ... NaN NaN NaN NaN NaN NaN 2022-03-26 NaN NaN 63113904
4 5 AB12345 AB12345-CHN-7-id-267 id-267 CHN-7 40 YEARS M ASIAN NOT HISPANIC OR LATINO ... NaN NaN NaN NaN NaN NaN 2022-03-15 NaN NaN 63113904

5 rows × 57 columns

Checking data null bias

You will need to check the bias in your data sometimes and some forms of bias checking are to be done manually but our package offers two forms of automated bias checking. To find columns with too much missing data and to check to see if the distribution of discrete data is as intended.
Here, we have checked the columns for too much missing data, in other words, columns that have too many null values. This is done using the check_bias function.

too_null_1 = da.check_bias(dm)
Columns with too many null values: 
 [['TRTEDTM', 73], ['TRT01EDTM', 73], ['TRT02SDTM', 73], ['TRT02EDTM', 73], ['AP01EDTM', 73], ['AP02SDTM', 73], ['AP02EDTM', 73], ['EOSDT', 73], ['EOSDY', 73], ['DCSREAS', 280], ['DTHDT', 330], ['DTHCAUS', 330], ['DTHCAT', 330], ['LDDTHELD', 330], ['LDDTHGR1', 330], ['LSTALVDT', 73], ['DTHADY', 330], ['ADTHAUT', 343]]

Dealing with too many null values

For this you can either remove the column with too many null values which is done later in this example file, or you can impute the missing data using the handle_null function which also gives the option to remove the rows with the missing values.
Here, we have removed the row with null values in columns that had too many nulls.

# make copy of unhandled version for future use
dm_uncleaned = dm.copy()

# remove rows with null values from columns that have too many nulls
for null in too_null_1[0]:
    dm = da.handle_null(dm, null[0], impute_type="remove")
dm.head()
Unnamed: 0 STUDYID USUBJID SUBJID SITEID AGE AGEU SEX RACE ETHNIC ... DCSREAS DTHDT DTHCAUS DTHCAT LDDTHELD LDDTHGR1 LSTALVDT DTHADY ADTHAUT study_duration_secs
0 1 AB12345 AB12345-CHN-3-id-128 id-128 CHN-3 32 YEARS M ASIAN HISPANIC OR LATINO ... DEATH 2022-03-06 ADVERSE EVENT ADVERSE EVENT 22.0 <=30 2022-03-06 1106.0 Yes 63113904
5 6 AB12345 AB12345-CHN-15-id-201 id-201 CHN-15 49 YEARS M ASIAN NOT HISPANIC OR LATINO ... DEATH 2022-02-22 ADVERSE EVENT ADVERSE EVENT 3.0 <=30 2022-02-22 1085.0 Yes 63113904
12 13 AB12345 AB12345-RUS-1-id-52 id-52 RUS-1 40 YEARS F ASIAN NOT HISPANIC OR LATINO ... DEATH 2022-02-20 DISEASE PROGRESSION PROGRESSIVE DISEASE 7.0 <=30 2022-02-20 1070.0 Yes 63113904
16 17 AB12345 AB12345-BRA-11-id-9 id-9 BRA-11 40 YEARS M ASIAN NOT HISPANIC OR LATINO ... DEATH 2022-03-20 DISEASE PROGRESSION PROGRESSIVE DISEASE 36.0 >30 2022-03-20 1091.0 Yes 63113904
18 19 AB12345 AB12345-CHN-15-id-245 id-245 CHN-15 34 YEARS F WHITE NOT HISPANIC OR LATINO ... DEATH 2022-02-18 DISEASE PROGRESSION PROGRESSIVE DISEASE 1.0 <=30 2022-02-18 1057.0 Yes 63113904

5 rows × 57 columns

Adding categories for a numerical column

You can change unumerical values to categorical ones useing the numeric_to_categorical function. Here we have changed AGE into categories. You can replace the previous column or create a new column for your categorical data, we have used True to indicate a new column called AGE_group.

dm = da.numeric_to_categorical(dm, 'AGE', [[30, '(,30]'], [35, '(30,35]'], 
                                        [40, '(35,40]'], [45, '(40,45]'],
                                        [50, '(45,50]']], True)
dm.head()
Unnamed: 0 STUDYID USUBJID SUBJID SITEID AGE AGEU SEX RACE ETHNIC ... DTHDT DTHCAUS DTHCAT LDDTHELD LDDTHGR1 LSTALVDT DTHADY ADTHAUT study_duration_secs AGE_group
0 1 AB12345 AB12345-CHN-3-id-128 id-128 CHN-3 32 YEARS M ASIAN HISPANIC OR LATINO ... 2022-03-06 ADVERSE EVENT ADVERSE EVENT 22.0 <=30 2022-03-06 1106.0 Yes 63113904 (30,35]
5 6 AB12345 AB12345-CHN-15-id-201 id-201 CHN-15 49 YEARS M ASIAN NOT HISPANIC OR LATINO ... 2022-02-22 ADVERSE EVENT ADVERSE EVENT 3.0 <=30 2022-02-22 1085.0 Yes 63113904 (45,50]
12 13 AB12345 AB12345-RUS-1-id-52 id-52 RUS-1 40 YEARS F ASIAN NOT HISPANIC OR LATINO ... 2022-02-20 DISEASE PROGRESSION PROGRESSIVE DISEASE 7.0 <=30 2022-02-20 1070.0 Yes 63113904 (35,40]
16 17 AB12345 AB12345-BRA-11-id-9 id-9 BRA-11 40 YEARS M ASIAN NOT HISPANIC OR LATINO ... 2022-03-20 DISEASE PROGRESSION PROGRESSIVE DISEASE 36.0 >30 2022-03-20 1091.0 Yes 63113904 (35,40]
18 19 AB12345 AB12345-CHN-15-id-245 id-245 CHN-15 34 YEARS F WHITE NOT HISPANIC OR LATINO ... 2022-02-18 DISEASE PROGRESSION PROGRESSIVE DISEASE 1.0 <=30 2022-02-18 1057.0 Yes 63113904 (30,35]

5 rows × 58 columns

Plotting the distribution of a demographic feature

You can create a pie chart to show the distribution a discrete variable using the pie function. If you want to save it, specify the path. If you use ”.” it will be saved in the same directory as your code file.

vis.pie(df = dm, col = "AGE_group",path = ".", name = "age_pie_chart")
([<matplotlib.patches.Wedge at 0x7fa3ddc85650>,
  <matplotlib.patches.Wedge at 0x7fa3ddc85ed0>,
  <matplotlib.patches.Wedge at 0x7fa3ddc9e750>,
  <matplotlib.patches.Wedge at 0x7fa3ddc9e450>,
  <matplotlib.patches.Wedge at 0x7fa3dbc2b790>],
 [Text(0.5320908144466621, 0.9627457427490853, '(35,40]'),
  Text(-1.084458100654201, 0.18425696167440575, '(,30]'),
  Text(-0.30451921808270677, -1.0570090093363902, '(30,35]'),
  Text(0.7778173500684048, -0.7778175685419846, '(40,45]'),
  Text(1.084458098497778, -0.18425697436619265, '(45,50]')],
 [Text(0.2902313533345429, 0.525134041499501, '33.9%\n19'),
  Text(-0.5915226003568368, 0.10050379727694858, '26.8%\n15'),
  Text(-0.16610139168147642, -0.57655036872894, '19.6%\n11'),
  Text(0.42426400912822076, -0.42426412829562793, '14.3%\n8'),
  Text(0.5915225991806061, -0.10050380419974143, '5.4%\n3')])
_images/7ee2a5b0445f7d616ed09182d6dc44c7fce7d90de7509f92871c963293ef96a6.png

Plotting a boxplot grid

You can plot a boxplot grid of either three features (one numeric and two categorical) or of all numeric features using the boxplot_grid function.
Here we have plotted AGE by SEX and COUNTRY.

vis.boxplot_grid(dm, col1='COUNTRY', col2='SEX', col3='AGE')
/home/docs/checkouts/readthedocs.org/user_builds/hypehd/envs/latest/lib/python3.7/site-packages/seaborn/axisgrid.py:712: UserWarning: Using the boxplot function without specifying `order` is likely to produce an incorrect plot.
  warnings.warn(warning)
_images/d1ac19be715c56e63d95d457acb7a9306d7705f17e62cd4ea3898e5b4d7fffde.png
<seaborn.axisgrid.FacetGrid at 0x7fa3db3e6290>

Generate demographic plots

There is another function can help users exploring the distribution and description statics of both continuous and discrete variables simultaneously using demo_graph() function. Here generates plots of AGE, SEX by different treatment group.

vis.demo_graph(var=["AGE", "SEX"], input_data=dm, group='TRT01P')
([<Figure size 1500x1000 with 1 Axes>, <Figure size 1500x1000 with 1 Axes>],
 [<AxesSubplot:title={'center':'Plot and summary table for Age'}, ylabel='AGE'>,
  <AxesSubplot:title={'center':'Plot and summary table for Sex'}, ylabel='SEX'>])
_images/ba11cd0a503a72d9090ba48cbeb1e356f67aed82de6568b42997a448f8f38a0e.png _images/575137907b58f348f9cdebc9bb6e9ab71c2fc680f4a0e4b9b7571f892dcb004d.png

Plotting 3D clusters

You can find and plot the clusters between three numerical features using the cluster_3d function and you can choose your prefered clustering method out of the options available.
Here, we have plotted a three dimentional cluster graph of AGE, BMRKR1 and study_duration_secs using k-means clustering.
If you only want a 3D graph without the clustering, you can use the graph_3d function.

vis.cluster_3d(df=dm_uncleaned, cols=['AGE','BMRKR1','study_duration_secs'],
               lab1 = 'Age', lab3='Study duration (s)', c_type="k-means", legend=True)
(<Figure size 1200x1200 with 1 Axes>,
 <Axes3DSubplot:xlabel='Age', ylabel='BMRKR1'>)
_images/4d79f0052d12b58c8f2c246f0838fe90c94df6d4bf60403539304ca652d9f87b.png

Plotting 2D clusters

For two dimentional clustering you can use the cluster_2d function in a similar fashion to the three dimentional function.
We have plotted a two dimentional cluster graph of AGE and BMRKR1 with DBSCAN clustering.

vis.cluster_2d(df=dm_uncleaned, cols=['AGE','BMRKR1'], c_type="dbscans", min_sample=10)
(<Figure size 1200x1200 with 1 Axes>,
 <AxesSubplot:xlabel='AGE', ylabel='BMRKR1'>)
_images/70a45a3e20944138deae9517f69d42ad84d4eb13ae075ffe51dceddad896927f.png

Change data type

You can change a column’s data type between String, Integer and Float types using the change_type function.
Here we have changed BMRKR1 from float to int.

dm_uncleaned = da.change_type(dm_uncleaned, 'BMRKR1', int)

Add a numeric coded column from a categorical column

The categorical_to_numeric function changes numerical data columns to categorical ones. Here we changed the SEX column in the dm_uncleaned dataset to 0 and 1 and added the values into a separate column (SEX_goup) by setting add to True, if we use False it will replace the previous values within the SEX column.

dm_uncleaned = da.categorical_to_numeric(dm_uncleaned, 'SEX', [['F', 0],['M', 1]], True)

Plotting the relationship between some cloumns

You can plot the relationship between columns using the relation function. It will return one or two heatmaps depending on the input. For categorical data, it will plot a heatmap of chi-square values. For numerical data, it will show the correlation.
Here, we have removed columns with too many null values and have used the data-selection function to select some columns to plot and then plotted the relationship heatmaps.

# removing all columns with too many null values
for null in too_null_1[0]:
    dm_no_too_null = dm_uncleaned.drop(null[0], axis=1)

# choosing columns to plot
dm_no_too_null = da.data_selection(keep_col=["STRATA1", "ARMCD", "STRATA2", "SEX_num", "EOSDY",
                                             "REGION1", "ACTARM", "DCSREAS", "DTHADY", "BMRKR1",
                                             "DTHDT", "DTHFL", "SEX", "COUNTRY", "LDDTHELD",
                                             "ETHNIC", "DTHCAT", "DTHCAUS", "AGE"],
                                              input_data=dm_no_too_null)
# drawing the relationship plot 
vis.relation(dm_no_too_null, path='.')
_images/97bd2fdc0bb8daad97344a0965c210a725b77060f4481778e0e8f0f122de3bfb.png _images/ad9511884ea19c1464cbde72b9698d0dc7687e9a0f7d883a5e74a9cb93b216c3.png
[[<Figure size 1400x1400 with 2 Axes>, <AxesSubplot:>], <AxesSubplot:>]