Alternative splicing is an inherent gene regulatory mechanism, allowing for a single gene to code for a multitude of proteins. Some 95% of human genes are alternatively spliced, a process that can occur in multiple exon combinations. Intron retention is a class of alternative splicing variation that has largely eluded our understanding. Intron retention occurs when a region of DNA intended to be spliced out and removed before proteins are constructed is instead included in the final mRNA transcript, at times drastically modifying the final protein construct. There is currently no established method for identifying intron retention events. This project aimed to investigate potential intron retention events to reveal hallmarks of the splicing process. RNA-seq samples from lymphoblastoid cell lines generated as part of the GEUVADIS population variation project were analyzed, and features indicative of intron retention events were explored. In particular, the interplay between the portion of the intron contained in alignments and the sequencing read coverage was characterized. Suspect features were then assembled into a multi-dimensional matrix, dimensionality was reduced through principle component analysis (PCA) and samples were clustered using a K-means clustering algorithm implemented using scikit learn in Python. This analysis revealed that read coverage and the portion of the intron aligned can be used to identify and differentiate intron retention events.
Cover photo by Raymond Huffman