CTMeth - Convolutional-Transformer Methylation Analysis Framework

Introduction

CTMeth python library, which is the subject of dissertation titled"ANALIZA METYLACJI SEKWENCJI CPG W OPARCIU O UCZENIE MASZYNOWE I SIECI NEURONOWE" by Tomasz Falgowski, requires primary data processed from raw data from an idat file to β values in a comma-separated values (CSV) file. Analysis parameters are set for ease and reproducibility of research in a separate YAML file. This is a typical format used in configuration files and consistent with human-readable data-serialization language. The input data for the algorithm is a table containing β values. Rows represent individual CpG sequences, while columns correspond to samples. The mentioned YAML file contains information about dividing samples into a control group and an experimental group. A script using the described neural network classifies each CpG sequence separately for the control and experimental groups into one of three quantified labels (0, 1, 2). Labels correspond to three methylation states (hypermethylated, hypomethylated, undefined/partially methylated). The 'undefined' label is assigned to CpG sequences whose assessment regarding the status of methylation in a group cannot be clearly defined. In the next step, differently classified sequences are filtered out between the control and experimental groups according to one of two variants - CTMeth-hh and CTMeth-hhi. The first variant indicates CpG with different labels in the context of hypermethylation and hypomethylation, excluding those that were assessed as undefined (CTMeth-hh). The second variant, on the other hand, also includes sequences where a label 'undefined' was assigned for a given group, e.g., a hypermethylated control group and an undefined experimental group (CTMeth-hhi). The output data (results) contain selected CpG sequences, β values of individual samples, and the degree of confidence in the network, i.e., a parameter evaluating how well the data fit into a given category according to the developed algorithm. Data obtained in this way can be easily subjected to further analysis using other methods, such as those that constitute additional modules of the CTMeth library.

Installation and demo

  1. Download the Archive: CTMeth.zip Begin by downloading the necessary archive file from the provided link.

  2. Extract the Archive: Once the download is complete, unzip the archive to access the contained files.

  3. Execute the Demo Script: Navigate to the extracted folder and execute the script demo.py, which presents basic syntax and functionality of library

  4. Customize Settings: Adjust the example_settings.yaml or use your own with own script according to your specific requirements. Detailed guidance on configuration and usage options is available below.

Requirements

Before using please make sure you have installed dependencies listed in requirements.txt. This code was tested with pytorch 1.13.1+cu117, windows 10. Should also work on ubuntu 20.04 LTS.

pip install -r requirements.txt

Functions

evaluate()

Evaluation

This is primary function designed to facilitate methylation analysis.

Requirements

This function requires the user to provide a path to the settings file, which contains all necessary parameters, including input and output data. For more information on settings, please refer to the documentation below.

Additional Options

Additionally, users have the flexibility to:

  1. Utilize different neural network states

  2. Adjust training settings (if applicable) for specific state configurations

Example result:

cpgLabel ALabel BA confidenceB confidenceMean confidenceConfidence sumGSM4056740GSM4056718GSM4056710
cg00822007108.1009130477905271.39441585540771484.7476644515991219.4953289031982420.3398020.061090.123447
cg01944137109.8913237.7724447250366218.83188438415527317.6637687683105470.0340670.0408490.062983
cg01792749105.1393151283264162.0890035629272463.6141593456268317.2283186912536620.0921270.40022875799999990.046828
cg01752041107.5714893341064458.6322145462036138.10185241699218816.2037048339843750.43048358299999990.0655780.322883
cg00743717011.9031881.98172330856323241.9424555301666263.8849110603332520.7615930.7954360.452464
cg02243276109.8244409561157238.6825370788574229.25348854064941418.5069770812988280.0332650.0410070.4590376379999999
cg01151584106.4364247322082527.2845239639282236.86047458648681613.7209491729736330.058570.0425340.047261

training()

Advanced Usage

This function is intended for experienced users who wish to utilize advanced features.

Retraining and Fine-Tuning This tool can be used for retraining or fine-tuning neural networks. By modifying the training_settings.yaml file, users can adjust basic parameters of the training process.

Current Limitations At present, this function only supports training with synthetic generators. Future development is planned to expand its capabilities.

 

hiercluster()

Hierarchical Clustering Module This module provides hierarchical clustering functionality.

Usage This module should be used after evaluation to further analyse and visualize the results.

Future Development Future development plans include expanding the capabilities of this module to accommodate custom data inputs.

Output The module generates hierarchical clustering heatmaps and calculates various scores, including:

  1. Rand index

  2. Mutual information-based scores

  3. Homogeneity

  4. Completeness

  5. V-measure

  6. Fowlkes-Mallow score

Example result:

image-20240501151759101

MethodValues
Real labels[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1]
Cluster labels[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0]
Cophentic correlation distance0.8965135190572778
Rand index1
Adjusted rand index1
Mutual information score0.2711893730418441
Adjusted MIS1
Normalzed MIS1
Homogenity1
Completness1
V-measure1
Fowlkes_mallow_score1

pca2d()

2D Principal Component Analysis (PCA) Module This module performs dimensionality reduction using 2D Principal Component Analysis.

Output Options The module allows users to either display the results as a plot or generate an image file for further analysis and visualization.

Usage This module is intended for use after evaluation, allowing users to gain deeper insights into their data.

Future Development Future development plans include expanding the capabilities of this module to accommodate custom data inputs.

Example result:

image-20240501151605891

pca3d ()

3D Principal Component Analysis (PCA) Module This module performs dimensionality reduction using 3D Principal Component Analysis.

Output Options The module allows users to either display the results as a plot or generate an image file for further analysis and visualization.

Usage This module is intended for use after evaluation, allowing users to gain deeper insights into their data by visualizing high-dimensional relationships in 3D space.

Future Development Future development plans include expanding the capabilities of this module to accommodate custom data inputs, enabling users to tailor the analysis to their specific needs.

Example result:

image-20240501151641041

heatmap()

  1. Here is a rewritten version of the text in a professional tone:

    Simple Heatmap Module This module generates a basic heatmap visualization.

    Usage The heatmap module is intended for use after evaluation, allowing users to visualize and explore relationships between variables.

    Future Development Future development plans include expanding the capabilities of this module to accommodate custom data inputs, enabling users to tailor the analysis to their specific needs.

 

download_databases()

Required Datasets To utilize tool, you will need to download the following datasets:

  1. Biogrid Dataset: This dataset provides essential information for y analysis.

  2. Methylation Annotation Dataset: This dataset contains annotations necessary for methylation analysis.

Downloading Required Databases The download_databases() function is available to facilitate the downloading of these required databases, ensuring that you have the necessary data for a successful analysis.

 

yaml_generator ()

YAML Editor GUI A simple graphical user interface (GUI) is available for editing YAML files. This tool allows users to modify and customize their settings without requiring extensive programming knowledge.

Please note that this GUI is designed specifically for editing YAML files. Widgets are generated automatically, when loaded new YAML file, so user can use it with independently. However this functionality is under development.

paths (cpg-gene-gene-cpg module)

Module for Analyzing CGP Interactions This module performs analysis based on the connections between CpGs, which are annotated to specific genes and their interactions.

Example For instance, if two CpGs (Cpg1 and Cpg2) are identified, they may be annotated to genes Gene1 and Gene2, respectively. If these genes interact with each other, the module can identify the pathways involved in this interaction.

This module provides valuable insights into the complex relationships between CpGs, genes, and their interactions, enabling a deeper understanding of epigenetic regulation and its impact on biological processes.

Example result:

cg00151768-3-_short_paths

YAML Settings Description

The following YAML settings are used to configure the project and its output. These settings enable users to customize their project and output files according to their specific needs.

Heatmap Settings

Clustering Settings

PCA 2D Settings

PCA 3D Settings

Paths - cpg-gene-gene-cpg module settings

Methylation Annotation Dataset Settings

Biogrid Dataset Settings

Roadmap

  1. integrating idat files analysis

  2. adding proper GUI

  3. further tools development and making script as automated as possible

Files and folders structure

root
neural
utils
__init__.py
workflow_control.py
CTmeth_neural.py
CTMeth.pth
distribution_generator.py
SynMethDatasetLoader.py
CTMeth_neural_training.py
training_settings.py
training_settings.yaml
training_module.py
apca2d.py
apca3.py
heatmaper.py
hierclusterer.py
paths_generator.py
read_write_csv.py
yaml_operator.py

 

Software Components CTMeth methylation analysis framework consists of several software components, each with its own specific functions.

  1. workflow_control.py: This script contains methods and calls for methods used to manage the workflow of methylation analysis framework.

  2. CTMeth_neural.py: This script contains the neural network model for CTMeth, a key component in framework.

  3. CTMeth.pth: This is the default state for the CTMeth neural network model.

  4. hierclusterer.py: This script contains the hierarchical clustering module, which enables users to analyse and visualize their data.

  5. heatmaper.py: This script contains a simple heatmapping module, allowing users to create visualizations of their data.

  6. apca3d.py and apca2d.py: These scripts contain PCA (Principal Component Analysis) modules for 3D and 2D dimensionality reduction, respectively.

  7. yaml_operator.py: This script provides basic tools for manipulating YAML files, which are used to store settings and configurations in framework.

  8. read_write_csv.py: This script contains basic reading and writing tools for CSV files, which are used internally within framework.

  9. paths_generator.py: This script contains the cpg-gene-gene-cpg module, which enables users to analyse the relationships between CpGs, genes, and their interactions.

  10. distribution_generator.py, and SynMethDatasetLoader.py: These scripts are used to generate synthetic training data for neural network model.

  11. training.settings.yaml: This file contains draft training settings for framework.

  12. training_module.py: This module is responsible for training and retraining neural network model.

These software components work together to provide a comprehensive methylation analysis framework, enabling users to analyse and visualize their data with ease.

General view of code interaction

CTmeth
int
workflow_control
evaluate
start - gui ToDO
heatmap
hiercluser
gen_paths
pca2d
pca3d
return_params
download_databases
yaml_generator
training
SynMethDataset

##

License:

MIT License

Copyright (c) 2023 Tomasz Falgowski

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Results mentioned in dissertation titled "ANALIZA METYLACJI SEKWENCJI CPG W OPARCIU O UCZENIE MASZYNOWE I SIECI NEURONOWE"

Here are links to results mentioned in dissertation titled "ANALIZA METYLACJI SEKWENCJI CPG W OPARCIU O UCZENIE MASZYNOWE I SIECI NEURONOWE"

 

 

Files List