One of the classes of data considered in order to support equivalence of a generic to a reference listed drug is the comparison of amino-acid chain distributions. Sequences of amino-acids with certain molar ratio characteristics are used to explore novel comparison approaches, for these distributions. Different similarity measures, such as Tanimoto distances can produce a similarity matrix comparing the sequences. These measures will be compared based on their performance. Furthermore, we should search for important characteristics (features) that produce a meaningful separation of the sequences into clusters. This can be accomplished using weighted sampling, K-means and self-organizing maps (SOM). Additionally, clustering can be explored through building probability profiles for sequences of fixed lengths. In all these cases, a population of thousands of peptide chains from a single simulation resulted in hundreds of thousands of residue sequences. Data cleaning/ organizing and pattern identification through these sequences of equal length, is computationally intensive and is carried using string detection functions such as ‘str_detect’ from the R-package ‘stringr’.
When the circumstances necessitate cleavage of the amino-acid sequences at a certain residue, it is important to develop efficient coding, in order to investigate the properties of the distributions of the cleaved sequences and their molecular weights. The cleavage and sequencing of such immense size - data sets, is efficiently handled by the ‘rstring’ and ‘Biostrings’ R-packages and storage container functions such as ‘AAStringSet’. This group of functions also facilitates the task of building empirical probability distributions of all unique amino acid sequences of a specified length.
The performance of different metrics will be assessed and all approaches will be discussed in the context of using similarity of the amino-acid sequences, in order to demonstrate bioequivalence between a complex-molecule drug and its generic version. Furthermore, the issue of seeking computationally efficient pathways for dealing with such data sets will be addressed.