ABSTRACT
The extracellular matrix (ECM) is a complex meshwork of proteins that forms the scaffold of all tissues in multicellular organisms. It plays crucial roles in all aspects of life – from orchestrating cell migration during development, to supporting tissue repair. It also plays critical roles in the etiology or progression of diseases. To study this compartment, we have previously defined the compendium of all genes encoding ECM and ECM-associated proteins for multiple organisms. We termed this compendium the ‘matrisome’ and further classified matrisome components into different structural or functional categories. This nomenclature is now largely adopted by the research community to annotate ‘-omics’ datasets and has contributed to advance both fundamental and translational ECM research. Here, we report the development of Matrisome AnalyzeR, a suite of tools including a web-based application and an R package. The web application can be used by anyone interested in annotating, classifying and tabulating matrisome molecules in large datasets without requiring programming knowledge. The companion R package is available to more experienced users, interested in processing larger datasets or in additional data visualization options.
INTRODUCTION
The extracellular matrix (ECM) is a complex meshwork of proteins that forms the scaffold of all multicellular organisms (Hynes and Naba, 2012). It plays critical roles in all aspects of life – from orchestrating cell migration and differentiation during development (Dzamba and DeSimone, 2018; Walma and Yamada, 2020), to supporting tissue growth and repair. It also plays critical roles in the etiology or progression of diseases (Theocharis et al., 2019).
‘Omic’ technologies (e.g. transcriptomics, proteomics and glycomics) have emerged as powerful approaches to profile at large scale, and often in an unbiased manner, the biomolecular landscape of cell and tissue states. However, to extract meaningful information and generate novel hypotheses, we need to develop comprehensive annotations and analytical methods to mine these complex inputs. Hence, to study the ECM using ‘-omic’ technologies, we first needed a compendium of all potential ECM components. Using de novo sequence analysis and unique features of ECM proteins, such as the presence of a signal peptide and of characteristic protein domains and motifs (Gebauer and Naba, 2020; Naba et al., 2012b, 2016), we have predicted the ‘matrisome’ of multiple organisms, including human (Naba et al., 2012a), mouse (Naba et al., 2012a), zebrafish (Nauroy et al., 2018), fruit fly (Davis et al., 2019) and nematode (Teuscher et al., 2019). We further classified matrisome genes into: (1) ‘core matrisome’ genes, which are the genes encoding structural components of the ECM including ECM glycoproteins, collagens and proteoglycans, and (2) ‘matrisome-associated’ genes, which are the genes encoding non-structural components of the ECM that either share structural similarities with core matrisome components (we termed these ‘ECM-affiliated proteins’) or are capable of modulating the structure (‘ECM regulators’) or signaling (‘secreted factors’) functions of the ECM proper (Table 1).
The matrisome lists have been deployed via different platforms to support data analysis, including the Molecular Signature Database (Subramanian et al., 2005; https://www.gsea-msigdb.org/gsea/msigdb), the Zebrafish Information Network (Bradford et al., 2022; https://zfin.org/) and FlyBase, the database of Drosophila Genes and Genomes (Gramates et al., 2022; https://flybase.org/). Used to annotate transcriptomic datasets, these matrisome lists have contributed, for example, to help identify the diverse cell populations expressing ECM genes in health and diseases (Bergmeier et al., 2018; Etich et al., 2019; Nauroy et al., 2017; Pietilä et al., 2021; Wietecha et al., 2020) and to identify networks of ECM genes characteristic of disease stages or of prognostic value (Izzi et al., 2018). When used to annotate proteomic datasets, these lists have enabled the definition of the ECM composition of tissues and organs across the pathophysiological spectrum (Naba, 2023; Randles et al., 2017; Shao et al., 2023).
To facilitate the use of the matrisome classification, we previously developed a web application capable of handling human and murine proteomic datasets (Naba et al., 2017). The previous iteration required users to extensively format their input datasets to be amenable, which hindered its diffusion to recently growing methodologies such as single-cell RNA-seq (sc-RNA-seq). Here, we report the development of Matrisome AnalyzeR, an augmented suite of versatile tools that includes a web-based Shiny application (https://sites.google.com/uic.edu/matrisome/tools/matrisome-analyzer) and a companion R package (https://github.com/Matrisome/MatrisomeAnalyzeR). The new intuitive web-based application can be used by anyone to obtain the annotation, classification and tabulation of matrisome molecules from any -omic datasets (e.g. genomic, transcriptomic, proteomic). In the Matrisome AnalyzeR application, results appear on screen in seconds and change dynamically in response to user actions, through a user-friendly, point-and-click interface requiring no programming knowledge. The companion Matrisome AnalyzeR package is available to more advanced users interested in processing larger files (>100 MB) and provides additional data visualization options and possibilities for integration with more complex pipelines. In their current versions, the Matrisome AnalyzeR web application and R package are capable of processing data from the following organisms: Homo sapiens (human), Mus musculus (mouse), Danio rerio (zebrafish), Drosophila melanogaster (fruit fly) and Caenorhabditis elegans (roundworm).
RESULTS
The web-based Matrisome AnalyzeR Shiny application
Data input
Fig. 1A illustrates the data input process. Matrisome AnalyzeR can handle a variety of data files including tab- or comma-separated (.tsv, .txt, .csv, .tabular) files as well as raw skyline (.sky) proteomic files and R Data Serialization (.rds) files. File specifications include column headers and a size not exceeding 100 MB. If a data file exceeds this limit, we recommend using the Matrisome AnalyzeR package (see below). Importantly, thresholding (such as excluding peptides or proteins not meeting a given false-discovery rate or proteins detected with less than two peptides in proteomic datasets) should be performed prior to inputting datasets to Matrisome AnalyzeR.
To help users familiarize themselves with the functionalities of the Matrisome AnalyzeR app, we are providing a gallery of test files accessible via the Matrisome AnalyzeR page of the Matrisome Project website (https://sites.google.com/uic.edu/matrisome/tools/matrisome-analyzer). These test files are also available for download via the web-based Shiny application (Fig. 1A). These and additional examples are also included with the R package available on GitHub (https://github.com/Matrisome/MatrisomeAnalyzeR; see below).
For illustration purpose, we will use a label-free quantitative proteomic dataset adapted from one of our previous studies (Renner et al., 2022) and that includes, for each protein entry and sample, total spectral counts, unique spectral counts and unique peptide numbers (Table S1).
Upon file upload, Matrisome AnalyzeR will automatically recognize number format, but we encourage using dots, and not commas, for decimals and avoiding formatting thousands. Matrisome AnalyzeR will also automatically populate the first box with column headers. Users will be asked to select, from the next two drop-down menus, the column containing the molecule identifiers to be used for the annotation and the species of interest. The tool is currently designed to accept gene symbols, NCBI gene (formerly Entrez Gene) and UniProt IDs (The UniProt Consortium, 2023) for all species. Additionally, Matrisome AnalyzeR accepts Ensembl Gene IDs for human and murine datasets, ZFIN IDs for zebrafish datasets, FlyBase ID for Drosophila datasets, and WormBase and Common Gene Name for C. elegans datasets. In the eventuality that no identifiers map to the application's database, an error message will prompt users to review input choices. After input selection, users will then select the workflow to process their data. Help buttons have been implemented to further facilitate data input (Fig. 1A).
Data annotation
The ‘Annotate’ workflow annotates the input file with matrisome divisions (i.e. core matrisome, matrisome-associated or non-matrisome) and categories (i.e. ECM glycoproteins, collagens, proteoglycans, ECM-affiliated, ECM regulators, secreted factors or non-matrisome). The output provides a .csv file that corresponds to the original input file, where column A lists the identifiers used for the annotation in alphabetical order, column B, the ‘Annotated Matrisome Division’, and column C, the ‘Annotated Matrisome Category’ (Table S2). To help users identify the nature of non-matrisome components present in their samples, we are also providing Gene Ontology annotations on Cellular Components (GO:CC) as part of the annotation workflow (Table S2, column D).
The output table is also visible and browsable on the main page upon completion of the Annotate workflow (Fig. 1B). Users can customize the number of entries displayed in the main window and can search the table using the search box (Fig. 1B). In addition, the output includes a .pdf file with bar graphs (or ‘matribars’) representing the total numbers of matrisome molecules (e.g. genes and proteins) classified according to matrisome divisions and categories across the entire dataset (Fig. S1). These bar graphs are also displayed on the main page upon completion of the Annotate workflow (Fig. 1B). Note that the output can change dynamically in response to user actions, through the user-friendly point-and-click interface.
Data analysis
The ‘Annotate+Analyze’ workflow does the above and then tabulates and sums the content of each numerical column in the input by matrisome divisions and categories. Here, the output is a .csv file where each row corresponds to a matrisome classification, where column A lists ‘Matrisome Annotations’, and where each subsequent columns report the tabulation of the numerical data according to these annotations (Table S3). This workflow allows users to evaluate, at a glance, the relative ECM content of each of their samples (e.g. number of reads if inputting RNA-seq data or number of peptides or spectra if inputting proteomic data, as shown in the example provided in Table S1). Users can then input these data in other statistical analysis software or data visualization software to pursue their analysis.
We caution users that not all tabulations might be relevant: for example, the test file provided contains representative proteomic data, listing in addition to quantitative metrics (e.g. total spectrum count or exclusive spectrum count), the molecular mass of each protein or protein identification probabilities that are numerical values if no unit is appended (e.g. kDa or %) and, thus, are tabulated by Matrisome AnalyzeR.
Importantly, the Matrisome AnalyzeR application implements a strict session-specific data policy: data uploaded by users are neither stored in our server, nor can the data leak through sessions. User data are purged upon user disconnection or at session timeout.
The Matrisome AnalyzeR package
For users familiar with R programming and wishing to analyze larger datasets (>100 MB, the limit imposed on data upload to the web application) or interested in additional data visualization options, as well as in the possibility of integrating matrisome annotation and analysis with other existing analysis pipelines, we have developed the Matrisome AnalyzeR R package, available at https://github.com/Matrisome/MatrisomeAnalyzeR.
The Matrisome AnalyzeR GitHub repository includes all the functions required to run the data processing workflow from the annotations to the tabulations as described above for the web application. Additional data visualization options are available such as donut charts (‘matrirings’), polar bar charts (‘matristars’) and alluvial charts (‘matriflows’). Fig. 2A shows examples of these visualization options for the data provided in the test file (Table S1).
The GitHub repository also features additional case studies to demonstrate the breadth of Matrisome AnalyzeR. In a second example, we show how the Matrisome AnalyzeR package can be applied to the analysis of whole-exome sequencing data obtained from the cBioPortal (Cerami et al., 2012; Gao et al., 2013) to identify matrisome genes presenting a high mutation frequency across four different breast cancer subtypes. Processing of the dataset using Matrisome AnalyzeR results in donut charts representing, in our example, the 250 genes presenting the highest mutation frequencies for each breast cancer subtype and their classification into matrisome categories (Fig. 2B). By selecting the donut chart representation (or matriring), users can easily visualize the contribution of matrisome genes to the query and for example identify a switch from the predominantly ECM-affiliated-proteins-rich profile for invasive ductal carcinoma to a more ECM glycoproteins- and ECM regulators-rich profile for invasive lobular carcinoma (Fig. 2B).
In a third example, we used single-cell RNA-seq data obtained from 2700 single peripheral blood mononuclear cells publicly available from 10× Genomics and used in the Seurat tutorial (Stuart et al., 2019). Pre-processing of the datasets identified nine clusters corresponding to the following cell types: B cells, memory CD4T cells, naïve CD4T, CD8T cells, CD14+ monocytes, FCGR3A+ monocytes, NK cells, dendritic cells and platelets. Tying the Seurat pipeline into Matrisome AnalyzeR enables the computation of the average expression of each gene for each single cell cluster, and the display as polar bar charts (or matristars) allows users to easily visualize the different matrisome categories arranged in a polar coordinate system, with the differences between categories being visualized through the length of their segments and the height of their bars (Fig. 2C). Users can, at a glance, appreciate the differential matrisome gene expression pattern across the different cell clusters, with B cells having the lowest number of ECM expressed genes (Fig. 2C, left panel) and platelets expressing a larger number of ECM genes encoding proteins involved in clotting (Fig. 2C, right panel). The complete analyzed dataset is available on the home page of the GitHub repository.
Upon completion of the workflow, users can extend their data analysis by using the output of the matriannotate and matrianalyze workflows to conduct comparative statistical analysis using the programs of their choice.
DISCUSSION
The identification of genes or proteins belonging to the same functional compartment provides important information about the processes happening in cells and tissues and is a critical step in the analysis of large -omic datasets. Here, we report the deployment of a suite of versatile tools to annotate, classify and tabulate ECM molecules in a variety of -omic datasets. Our goal was to develop tools accessible to non-ECM and ECM specialists alike, as well as novice and experts in big data analysis.
The current Matrisome AnalyzeR is designed to process data generated on the matrisomes of the five organisms the Naba laboratory and collaborators have predicted. In recent years, others have predicted the avian (Huss et al., 2019), planarian (Cote et al., 2019; Sonpho et al., 2021) and bovine (Listrat et al., 2023) matrisomes. It is our goal to test the robustness of these predictions and evaluate their adoption by the scientific community. Should the number of -omic datasets on samples from these organisms increase, we will release augmented versions of Matrisome AnalyzeR to include these organisms as well.
Importantly, the field of ‘matrisomics’ has significantly expanded in recent years, and we and others have developed additional tools to mine matrisomic datasets (Naba, 2023), such as MatrixDB, the database reporting ECM component interactions (http://matrixdb.univ-lyon1.fr/; Berthollier et al., 2021; Clerc et al., 2019), MatriNet, the database designed to explore network-scale changes in the ECM in pathophysiological conditions (https://www.matrinet.org/; Kontio et al., 2022) and the ECM proteomics database MatrisomeDB (https://matrisomedb.org; Shao et al., 2023). It is our goal to deploy, in the future, releases of Matrisome AnalyzeR that will create output that can directly be input to such databases to further advance ECM research and accelerate ECM biomarker discovery efforts.
MATERIALS AND METHODS
Matrisome lists of model organisms
The list of matrisome genes for the following model organisms were retrieved from their original publications: Homo sapiens (Naba et al., 2012a), Mus musculus (Naba et al., 2012a), Danio rerio (Nauroy et al., 2017), Drosophila melanogaster (Davis et al., 2019) and Caenorhabditis elegans (Teuscher et al., 2019). The lists are also available via the Matrisome Project website at https://sites.google.com/uic.edu/matrisome. The original gene identifiers were programmatically used to derive other general (NCBI gene, formerly Entrez Gene, and UniProt IDs) and species-specific identifiers [Ensembl Gene IDs for human and murine datasets, ZFIN IDs for zebrafish (Bradford et al., 2022), FlyBase ID for drosophila (Gramates et al., 2022), and WormBase and Common Gene Name for C. elegans datasets (Davis et al., 2022)], using the annotation packages ‘org.Hs.eg.db’, ‘org.Mm.eg.db’, ‘org.Dr.eg.db’, ‘org.Ce.eg.db’ and ‘org.Dm.eg.db’. The retrieved IDs were finally manually reviewed and curated.
Input file format stipulation
The only formatting requirement to files uploaded to the Matrisome AnalyzeR application is that they should contain column headers in their top row. Matrisome AnalyzeR accepts tab- and comma-separated (.tsv, .txt, .csv, .tabular) as well as R Data Serialization (.rds) and proteomics Skyline (.sky) files, and can automatically recognize number format, although we encourage using dots for decimals and avoiding formatting thousands. The file size limit is 100 MB. If a file exceeds 100 MB, we recommend using the Matrisome AnalyzeR package. If processing files using the Matrisome AnalyzeR package, the input format is a data.frame; the function will stop and issue a warning otherwise.
Algorithms
The Matrisome AnalyzeR Shiny application and package are produced with the R Project for Statistical Computing and Shiny language (https://shiny.rstudio.com/), and share a common set of functions and ‘logic’. Users are expected to input a tabular dataset (typically, a high-throughput or -omic dataset) and identify a column with gene or protein identifiers and species information. Upon inputting the information, the first function of the pipeline (matriannotate) compares the input against a large database of matrisome annotations including gene symbols, NCBI gene (formerly Entrez Gene) and UniProt IDs for all species, as well as species-specific annotations such as Ensembl Gene IDs for human and murine datasets, ZFIN IDs for zebrafish datasets, FlyBase ID for Drosophila datasets, and WormBase and Common Gene Name for C. elegans datasets. Matching gene, protein or other ID are then enriched with matrisome divisions and categories (Naba et al., 2012a,b), and non-matching values are returned as ‘non-matrisome’.
The output is organized to have the gene, protein or ID in the first column, followed by the annotated matrisome divisions, annotated matrisome categories and the rest of the columns from the input file in their original order. This output is the base for the second function of the pipeline, matrianalyze, which takes in any numerical value in the dataset and sums them column-wise and by matrisome annotation. Note that formats not directly coercible (e.g. percentages) will be excluded. The result is a per-column (typically, per-sample) table of the quantity (e.g. number of reads, protein abundance and spectral counts) of any matrisome division and category across the entire dataset, which can be further used, for example, for statistical testing. The results from the matriannotate function are also the base for the graphical functions of both the application and package.
Output file format
In the Matrisome AnalyzeR web application, the output on screen comprises a graphical and a tabular part. The graphical part is a bar chart, internally produced with the library ggplot2 (https://github.com/tidyverse/ggplot2) and customized to apply the color codes assigned to matrisome divisions and categories independently of the molecule IDs and species.
The tabular part is a browsable, scrollable and searchable data table, internally produced with the library DT (https://rstudio.github.io/DT/). Upon completion of the matriannotate and/or matrianalyze functions, four download buttons appear in the navigation bar pointing to a single, zipped bundle including the tabular output in .csv format and the plot as a .pdf, or each of the outputs individually.
In the Matrisome AnalyzeR package, additional graphical functions are provided. These include donut charts (matrirings), polar bar chart (matristars) and Sankey/alluvial charts (matriflows). All graphs are internally produced with the library ggplot2 and with ggalluvial for matriflows. All graphical functions plot to the screen by default, but this behavior can be changed by setting the ‘print.plot’ parameter to FALSE. In this case, the underlying ggplot2 objects are returned instead, allowing further customization, integration with other pipelines, for example, printing to non-standard graphical devices. All tabular results are returned as data.frame.
Acknowledgements
The authors would like to thank all the members of the Izzi and Naba laboratories for their feedback on Matrisome AnalyzeR and Monica Bassignana (www.monicabassignana.com) for her help with data visualization and the preparation of the graphs presented in the manuscript.
Footnotes
Author contributions
Conceptualization: V.I., A.N.; Methodology: P.B.P., V.I., A.N.; Software: P.B.P., V.I., A.N.; Validation: J.M.C.; Formal analysis: V.I.; Resources: P.B.P., J.M.C., V.I., A.N.; Data curation: J.M.C., V.I., A.N.; Writing - original draft: P.B.P., J.M.C., V.I., A.N.; Writing - review & editing: A.N.; Visualization: J.M.C., V.I., A.N.; Supervision: V.I., A.N.; Project administration: A.N.; Funding acquisition: V.I., A.N.
Funding
This work was supported in part by the National Institutes of Health (U01HG012680, R21CA261642 and R01CA232517 to A.N.) and by a start-up fund from the Department of Physiology and Biophysics of the University of Illinois Chicago (A.N.). This research is connected to the DigiHealth-project, a strategic profiling project at the University of Oulu (V.I.) and the Infotech Institute (V.I., P.B.P.). The project is supported by the Academy of Finland (DECISION 326291 to V.I.), the Cancer Foundation Finland (V.I.), the Finnish Cancer Institute, and K. Albin Johansson Cancer Research Fellowship fund (V.I.). Open Access funding provided by National Institutes of Health. Deposited in PMC for immediate release.
Peer review history
The peer review history is available online at https://journals.biologists.com/jcs/lookup/doi/10.1242/jcs.261255.reviewer-comments.pdf.
References
Competing interests
The authors declare no competing or financial interests.