The three-dimensional organisation of the genome plays a crucial role in developmental gene regulation. In recent years, techniques to investigate this organisation have become more accessible to labs worldwide due to improvements in protocols and decreases in the cost of high-throughput sequencing. However, the resulting datasets are complex and can be challenging to analyse and interpret. Here, we provide a guide to visualisation approaches that can aid the interpretation of such datasets and the communication of biological results.
The organisation of chromatin within the nucleus is emerging as an important factor in gene regulation, with consequences for development and disease (Stadhouders et al., 2019). Genomic changes that disrupt this organisation can affect the expression of developmental transcription factors and thus are associated with developmental defects and congenital disease, including limb malformation disorders and Fragile X syndrome (Franke et al., 2016; Ibn-Salem et al., 2014; Lupiáñez et al., 2015; Spielmann et al., 2018; Sun et al., 2018). For example, mutations that affect the cohesin complex, one of the key factors determining genome organisation, cause Cornelia de Lange syndrome and Roberts syndrome (Banerji et al., 2017). These developmental disorders have symptoms ranging from limb malformations and cardiac defects to developmental delay and intellectual disability (Kline et al., 2018; Skibbens et al., 2013). Defects in the nuclear lamina, which acts as a scaffold for the nucleus and has roles in chromatin organisation, can also cause premature ageing syndromes (Vidak and Foisner, 2016). Disruption of normal genome organisation is also associated with cancer (Flavahan et al., 2016; Hnisz et al., 2016). In addition, many studies have shown that changes in gene expression or chromatin state can impact different features of genome organisation (Bonev et al., 2017; Dixon et al., 2015; Hug et al., 2017; Joshi et al., 2015; Le Dily et al., 2014; Lin et al., 2012). There is therefore a clear link between chromatin conformation and gene regulation, making it an important factor to consider when studying developmental processes.
The past two decades have seen the rapid development of methods to study this organisation at high resolution. While the original chromosome conformation capture (3C) method that was developed in 2002 (Dekker et al., 2002) allowed analysis of interactions between pairs of genomic regions, combining this approach with high-throughput sequencing (Hi-C) now allows researchers to study the conformation of entire genomes (Lieberman-Aiden et al., 2009). Additional variations on this method (see Box 1) include 4C-seq (Splinter et al., 2011), which detects all interactions of a reference region of interest, and enrichment approaches such as ChIA-PET and HiChIP, which allow analysis of all interactions associated with the binding of a particular target protein (Fullwood et al., 2009; Mumbach et al., 2016). More recently, techniques have also been developed to specifically investigate multi-locus contacts (Allahyar et al., 2018; Ay et al., 2015; Jiang et al., 2016; Olivares-Chauvet et al., 2016; Oudelaar et al., 2018; Quinodoz et al., 2018), and to infer chromatin conformation from locus colocalisation in ultrathin cryosectioned slices of nuclei (Beagrie et al., 2017). In parallel, advances in labelling approaches and super-resolution microscopy techniques are increasing the resolution and number of loci that can be studied using imaging approaches (Bintu et al., 2018; Cardozo Gizzi et al., 2019; Fraser et al., 2015; Nir et al., 2018; Ou et al., 2017).
Chromosome conformation capture (3C) techniques are based on the principle of fragmenting the genome using a restriction enzyme, then ligating the cut ends of DNA to each other. Regions of the genome that are in close proximity within the nucleus are ligated together, producing chimeric pieces of DNA that originate from multiple different restriction fragments. The frequency of a chimeric fragment between regions A and B is proportional to the frequency that A and B are in proximity across the population of input cells. These chimeric fragments can be measured in different ways. 3C uses PCR to amplify specific chimeric fragments corresponding to pairs of regions of interest (Dekker et al., 2002), while Hi-C measures all fragments using sequencing (Lieberman-Aiden et al., 2009). Approaches such as 4C-seq (Splinter et al., 2011) and capture Hi-C (Franke et al., 2016) measure interactions of one or more regions of interest with all other regions in the genome. 5C interrogates interactions between specific pairs of regions (Dostie et al., 2006), typically allowing high-resolution analysis of a specific region of the genome. Approaches such as ChIA-PET and HiChIP (Fullwood et al., 2009; Mumbach et al., 2016) first enrich for chimeric DNA fragments associated with proteins of interest, then assay all interactions associated with these factors using sequencing. Multi-locus contacts can also be identified by sequencing chimeric fragments that contain three or more restriction fragments (Allahyar et al., 2018; Ay et al., 2015; Jiang et al., 2016; Olivares-Chauvet et al., 2016; Oudelaar et al., 2018). Alternatively, SPRITE identifies multi-locus contacts by barcoding clusters of crosslinked chimeric fragments (Quinodoz et al., 2018), while GAM measures colocalisation in ultra-thin slices of nuclei (Beagrie et al., 2017).
Since the development of Hi-C in 2009, the cost of the sequencing required for high-resolution analysis using Hi-C-based approaches has dramatically decreased (Scott, 2016). Techniques such as capture Hi-C and 5C also make it possible to target specific genomic regions of interest (Dostie et al., 2006; Franke et al., 2016), further reducing sequencing costs. New techniques for single-cell and low-input Hi-C have allowed researchers to apply these methods to primary human tissue and to early embryos with very low cell numbers (Díaz et al., 2018; Du et al., 2017; Flyamer et al., 2017; Hug et al., 2017; Ke et al., 2017), allowing chromatin organisation in rare and transient developmental cell types or diseased cells to be studied. Overall, these advances are opening up the possibilities of chromatin conformation analysis to a wider range of researchers and biological questions.
However, the analysis and interpretation of these data is not straightforward. Although many analysis tools are available (reviewed by Ay and Noble, 2015; Pal et al., 2019), they require computational expertise and significant resources to handle large datasets. Formats for 3C data are not yet standardised, which limits the interoperability of tools and the ease of sharing and reusing data (Marti-Renom et al., 2018). Visualisation, which is a crucial step in the analysis, interpretation and communication of data, is also a challenge. Researchers use visualisation throughout all stages of a research project, from initial quality control to sharing results with collaborators and communicating them to others in the form of figures in presentations and publications (see Box 2). Importantly, inspecting data visualisations can reveal unexpected features of datasets and provoke new hypotheses, as well as allowing researchers to assess the presence of expected features and compare biological conditions.
Researchers use visualisation throughout all stages of a research project, and for different analysis tasks. This list is partially based on the task lists found in Lekschas et al. (2018) and Shneiderman (1996). Nusrat et al. (2019), provide a comprehensive overview of visualisation tasks and techniques for different types of genomic data.
Overview of the data
Zoom in to a locus of interest
Inspect known structural features
Search for novel features
Integrate with other data types
Share with collaborators
Visualisation of Hi-C datasets can be particularly challenging due to the large size of sequencing datasets and the aforementioned requirements for computational resources and expertise. Many tools for visualisation of genome organisation include pre-processed public datasets, but visualising one's own datasets is not possible or requires additional computational processing and specialised file formats. An additional challenge of visualising 3C data is that the scale of genome organisation features varies widely: features of interest might range from kilobase-scale loops to inter-chromosomal translocations. Moreover, interaction frequency varies along with the size and genomic distances between these features.
In this Review, we describe and discuss various approaches for visualising datasets from sequencing-based 3C techniques (summarised in Tables 1 and 2). Although we focus on Hi-C data, many of these approaches are applicable to data from other 3C approaches. Although these datasets represent three-dimensional (3D) genome organisation, they are typically visualised in two dimensions (2D) to allow easy viewing on a computer screen and incorporation into publications. We therefore focus on how 3D genome organisation data can be represented by 2D figures. These data can also be projected into 3D models; however, these are difficult for humans to interpret accurately, and virtual reality systems and interactive figures are not yet commonly used in academic publications. Therefore, we do not discuss the construction and visualisation of 3D models of genome organisation, which requires specialised tools and has been reviewed elsewhere (for example by Lin et al., 2019). Imaging approaches to study genome organisation, which are commonly used to look at features ranging from chromosome territories to specific looping interactions (Bintu et al., 2018; Nir et al., 2018; Szabo et al., 2018; Williamson et al., 2016), are also not covered; these techniques provide orthogonal data types to sequencing-based methods and come with their own visualisation methods and challenges.
First, we provide an overview of the key features of genome organisation that can be analysed using 3C approaches. We then describe different ways of visualising data produced by Hi-C and related techniques, and discuss the advantages and limitations of these approaches, highlighting the features and types of data each is especially suited to. Finally, we discuss future opportunities to improve tools and approaches in order to create visualisations that facilitate the biological interpretation of 3C data and the effective communication of results.
Features of genome organisation
Various imaging and 3C-based approaches have revealed that chromatin organisation within the nucleus is structured at several scales (Fig. 1), and that the structural features of genome organisation are largely conserved across species. These topics have been extensively reviewed elsewhere (e.g. by Eagen, 2018; Fraser et al., 2015; Yu and Ren, 2017), but will be briefly described here to provide context.
At the whole-genome scale, chromosomes are organised into chromosome territories (Fig. 1A; Cremer and Cremer, 2010). Inter-chromosomal and long-range intra-chromosomal interactions reflect preferential interactions between regions with similar chromatin states, segregating the genome into active (‘A’) and inactive (‘B’) compartments (Fig. 1B; Lieberman-Aiden et al., 2009). Continuous regions of the genome that are assigned to the same compartment tend to be hundreds of kilobases to multiple megabases long. These regions have also been further classified into two types of active and four types of inactive sub-compartments, which exhibit distinct patterns of histone modifications (Rao et al., 2014).
Locally enriched interactions can organise chromatin into self-interacting domains, i.e. those that exhibit stronger interactions within themselves than with their neighbouring regions (Fig. 1C). These types of domains have been referred to variously as topological domains, topologically associating domains (TADs), contact domains or physical domains (Dixon et al., 2012; Nora et al., 2012; Rao et al., 2014; Sexton et al., 2012). These domains are typically smaller than compartments, although some studies suggest that compartmentalisation and domains occur on similar length scales (Rowley et al., 2017; Schwarzer et al., 2017).
Enriched interactions between specific non-neighbouring regions of the genome, often referred to as loops, can also occur. These can either be specific contacts between genomic regions mediated by protein complexes (Fig. 1D) or can represent preferential colocalisation across a cell population rather than direct contact. However, both types of interaction appear as enriched peaks in heatmap representations of Hi-C data. In mammals, the anchors of loops detected in Hi-C data are frequently associated with binding of the architectural protein CCCTC-binding factor (CTCF) (Ong and Corces, 2014); in particular, they exhibit CTCF binding motifs arranged in a convergent orientation (Rao et al., 2014). While CTCF-CTCF loops can be tens to hundreds of kilobases, the motifs that define their anchor points are only 14 bp. In contrast, Polycomb domains, which are bound by the Polycomb repressive complexes PRC1 and PRC2, and are enriched for H3K27me3, can be tens of kilobases long and also show enriched interactions in Hi-C data (Denholtz et al., 2013; Eagen et al., 2017; Joshi et al., 2015; Rhodes et al., 2019 preprint; Schoenfelder et al., 2015). In addition, high-resolution Hi-C and 4C-seq analysis has identified interactions between regulatory elements and their target genes, which typically lie within the same domain. These interactions can be associated with Polycomb binding or paused transcriptional machinery and are stable over different developmental stages (Cruz-Molina et al., 2017; Ghavi-Helm et al., 2014), or are associated with cell type-specific transcription factor binding and correlate with gene activation (Bonev et al., 2017).
Regions of the genome that are close together on the same chromosome have a higher interaction probability than those that are distant, because they are tethered together by the chromatin fibre. Therefore, features that span a shorter genomic distance, such as interactions within domains or at CTCF-CTCF loop peaks, are frequent in comparison with interactions over longer distances. Compartmentalisation interactions arise from the colocalisation of broad regions of active or inactive chromatin with other regions of a similar chromatin state. These regions can be far apart or on different chromosomes, and the pairs of regions that colocalise vary from cell to cell (Nagano et al., 2013; Stevens et al., 2017; Wang et al., 2016). Therefore, interaction frequencies between any two regions in the same compartment are low. However, it is important to note that typical 3C datasets represent average interactions across a population of hundreds of thousands or millions of cells. Imaging studies and single-cell Hi-C analyses suggest that even regions with a relatively high population Hi-C interaction frequency do not colocalise in all cells, depending on the distance threshold used to define colocalisation (Cattoni et al., 2017; Finn et al., 2017; Rao et al., 2014; Rhodes et al., 2019 preprint; Stevens et al., 2017).
Hi-C data are most often visualised as square matrices or heatmaps (Fig. 2A). In these visualisations, the x and y axes represent genomic position, and the bins of the heatmap represent the interaction frequency between each pair of genomic regions across a population of nuclei. Interaction frequency is mapped to colour. Interactions between neighbouring genomic regions produce a strong diagonal pattern. Domains appear as blocks of increased interaction frequency along the diagonal, while loops appear as defined off-diagonal enrichments. As the interaction matrix is symmetric, it is also common to visualise half of the matrix as a triangle or trapezoid (Fig. 2B). Heatmap-based visualisation of an entire genome or chromosome is frequently used to provide an overview of a dataset for quality control purposes (Box 2, visualisation task 1; Shneiderman, 1996), but is also an effective way of visualising specific regions of interest (Box 2, visualisation task 2). Heatmaps showing local regions are often used to visualise domains and loops (highlighted later in Fig. 4), while whole-genome or inter-chromosomal interaction heatmaps can highlight known genomic rearrangements (Díaz et al., 2018; Harewood et al., 2017) or features (Box 2, visualisation task 3), such as the Rabl configuration of chromosomes or the clustering of telomeres and centromeres in Plasmodium falciparum (Ay et al., 2014; Mizuguchi et al., 2015). Importantly, heatmap visualisation raises the possibility of discovering novel structural features of chromosome conformation data (Box 2, visualisation task 4), although visualisation parameter choices affect the genomic and interaction strength scale of features that can be detected.
Although heatmaps are popular because they are effective at different genomic scales, the choice of colour scale can strongly influence the visibility of different aspects of the data and influence the interpretation of the figure. As the interaction frequency of Hi-C data varies over orders of magnitude, a linear colour scale can only effectively show a subset of the range of interaction frequency. Depending on the choice of scale and the minimum and maximum values, details will be discernible nearer or further from the diagonal, but not both simultaneously (Fig. 2A,B). In contrast, a colour coding that scales with the log10 of the interaction frequency allows simultaneous inspection of features over a broader range of distances (Fig. 2C). For example, a logarithmic colour scale will highlight the ‘checkerboard’ pattern of compartmentalisation interactions.
Another way to emphasise specific features of a dataset is to plot a transformation of the interaction frequency. For example, the expected interaction frequency depending on genomic distance can be calculated, and plotting the ratio of observed to expected interactions highlights both compartments and domains (Fig. 2D). Correlation matrices further emphasise compartmentalisation and are used to identify compartments (Lieberman-Aiden et al., 2009).
A common task in visualisation is comparison between two or more conditions (Box 2, visualisation task 5). Multiple heatmaps are often displayed side by side for this purpose, although humans have a relatively poor ability to judge quantitative differences in colour intensity (Cleveland and McGill, 1987). Therefore, these may require additional annotations to draw the viewer's eye towards changes of interest. Alternatively, changes can be visualised by showing the log2 ratio or the difference in interactions between two datasets. This can serve to highlight differences in genome organisation between different biological conditions. However, these differences are difficult to interpret without context: an increase in interaction strength within a domain or between two neighbouring domains will likely have different interpretations. Therefore, visualisations should also show the untransformed interaction frequencies for one or both conditions.
An important step in visualisation and data interpretation is comparison of Hi-C data with other data types, such as transcription factor binding data, chromatin modification data and genomic annotations (Box 2, visualisation task 6). This enables the exploration of structural features in the context of previous knowledge, which helps to understand the roles that these features may play in biological processes. To achieve this, triangular heatmap representations can be used to place the diagonal of the matrix, where many features of interest are located, close to other linear genomic tracks. This makes it easier to compare Hi-C features with these tracks. However, both square and triangular heatmap representations take up large amounts of vertical space, which may explain their limited adoption by mainstream genome browsers (Goodstadt and Marti-Renom, 2017). Specialised genome browsers have been developed for viewing Hi-C datasets, such as the 3D Genome Browser (Wang et al., 2018), and ‘Hi-C-centric’ tools such as HiGlass and Juicebox have also been developed (Durand et al., 2016b; Kerpedjiev et al., 2018). In addition, other approaches to Hi-C visualisation (discussed below) can facilitate combination with linear genomic tracks.
Finally, as heatmaps use colour to encode interaction frequency, the type of colour scale used should be carefully chosen. Colour scales that are not perceptually uniform, such as rainbow colour scales, have been shown to introduce artificial transitions in data and should be avoided (Wong, 2010). The simplest approach is to choose a colour scale consisting of a single colour, where the saturation of the colour scales with interaction frequency (Gehlenborg and Wong, 2012). Colour scales incorporating multiple hues, such as the one used in Fig. 1, can also be effective. When choosing scales with multiple colours, one should consider the substantial proportion of the human population with colour vision deficiencies (Wong, 2011). Where multiple heatmaps are shown for direct comparison, they should have the same colour scale, with the same maximum and minimum values, to facilitate comparisons.
Quantitative linear tracks
Quantitative linear tracks are often used to display ChIP-seq read coverage, e.g. for CTCF, or chromatin accessibility, e.g. as indicated by CTCF binding (Fig. 3A,B), and this approach can also be applied to 3C-based data. 4C-seq and capture Hi-C datasets are often represented as linear tracks displaying the interaction frequency of a specific region of interest with other genomic regions (Cruz-Molina et al., 2017; Ghavi-Helm et al., 2014; Lupiáñez et al., 2015; Mifsud et al., 2015). In addition, there are multiple ways of calculating a linear track from a 2D Hi-C matrix to represent different features of the data. For example, compartmentalisation can be shown in a linear track by displaying the first eigenvector (representing the major axis of variation) of the correlation matrix calculated from Hi-C data normalised to expected interaction frequency (Fig. 3C). Domain organisation can be visualised by calculating an insulation score for each genomic region (Fig. 3D) (Crane et al., 2015), where the minima of the insulation score occur at domain boundaries. Other approaches for abstracting domain organisation from two dimensions to one dimension include the directionality index (Dixon et al., 2012). Another way to represent matrix data in a linear track is to display the interaction frequency of a single genomic region, effectively creating a one-versus-all representation or ‘virtual 4C’ from an all-versus-all matrix (Fig. 3E). This type of track can be used to detect looping interactions (Cruz-Molina et al., 2017; Ghavi-Helm et al., 2014; Mifsud et al., 2015), domain boundaries (Lupiáñez et al., 2015) or even genomic rearrangements (Díaz et al., 2018).
An advantage of such linear tracks is that they can be viewed in a typical genome browser, such as the UCSC genome browser (Kent et al., 2002) or the Integrative Genomics Viewer (Thorvaldsdóttir et al., 2013), which facilitates interactive exploration of the data and integration with public datasets. This more compact representation also allows the simultaneous visualisation and comparison of multiple Hi-C datasets, such as those generated under different biological conditions. As human visual perception of size or area differences is more accurate than our judgement of differences in colour hue or saturation (Cleveland and McGill, 1987), linear tracks are better than heatmap representations for making quantitative comparisons.
However, it should be noted that linear tracks contain less information than heatmaps. Transforming a Hi-C matrix into a linear track requires choosing what type of track to calculate and, in some cases, what parameters to use. This has the potential to produce misleading results. For example, domains can be nested (Zufferey et al., 2018), such that calculating insulation scores using different window sizes would reveal different levels of domain organisation (Kruse et al., 2016). Moreover, a single linear track cannot show all features of a dataset. However, it is possible to simultaneously show multiple linear tracks representing different features or calculated with different parameters (as discussed below).
Discrete feature tracks
Discrete features, such as ChIP-seq peaks or gene annotations, are typically represented as blocks on a linear track. This approach is also used for some discrete features of genome organisation. In particular, the positions of domains, domain boundaries or compartments are often displayed in this way (Fig. 4A).
Another type of discrete genome organisation feature is significant interactions between two regions. These are typically represented by a thin line connecting two thicker bars defining the interacting regions (Fig. 4B). These typically represent looping interactions such as those derived from ChIA-PET or HiChIP data, significant interactions from capture Hi-C or 4C-seq, or CTCF-CTCF loops (Fullwood et al., 2009; Lopes Novo et al., 2018), but could also represent significant changes between two datasets. Some tools specialised for 3D genome data visualisation show these interactions as arcs connecting the interacting regions, where the height, thickness or colour of the arc can be mapped to additional variables in the data, such as interaction significance (Harmston et al., 2015; Schofield et al., 2016). Other tools display arcs between regions relative to a circular representation of the genome sequence, such as Circos plots (Krzywinski et al., 2009) and Rondo (Clark et al., 2016). The circular representation can be more compact than linear representation of large regions of a genome, and this approach can therefore be particularly helpful for visualising long-range or inter-chromosomal interactions.
Discrete features can also be visualised in two dimensions, e.g. as an additional layer on top of a heatmap (Durand et al., 2016b; Kerpedjiev et al., 2018). Domains and loops, for example, can be outlined by squares to highlight features of interest in a dataset or differences between datasets. However, adding such annotation layers can obscure data in the heatmap layer or over-emphasise the annotated features. For interactive visualisation, toggling the annotations on and off is therefore helpful. For static visualisation, one option is to show a symmetric heatmap with the annotation only on one half of the matrix (e.g. as in Fig. 2A and Rao et al., 2014).
It should be emphasised again, however, that discrete definitions of structural genomic features inherently contain less information than quantitative tracks, and the size and position of these features can vary widely depending on the parameters chosen (Kruse et al., 2016). Indeed, parameter choice is key to achieving meaningful feature definitions. As for quantitative tracks, a single set of definitions may not fully reflect the complexity of genome organisation. Nevertheless, these simplifications can be extremely useful for highlighting specific features of genomic organisation and for facilitating both visual and computational comparisons between datasets.
The visualisation of average interactions aggregated across a dataset is a recent addition to the toolbox of Hi-C visualisations. This approach has been widely applied to loops, and also to domains and compartments in single-cell Hi-C data (Flyamer et al., 2017; Gassler et al., 2017; Rao et al., 2014). For loops, these averages are constructed by taking subsets of the Hi-C matrix around each peak, and averaging the signal across all subsets (Fig. 5A). Aggregate plots of domain boundaries can be created in the same way, while aggregating across domains requires first normalising the subset matrices to a uniform size (Fig. 5B). Compartmentalisation aggregates, by contrast, are created by dividing the genome into bins based on eigenvector values, and then plotting the average interaction frequency between regions in each pair of bins. This reveals the tendency of active regions in the A compartment to interact with other regions in the A compartment, and vice versa for regions in the B compartment (Fig. 5C).
Averaging observed interactions across discrete feature definitions gives a global overview of compartmentalisation, domain organisation, boundary strength or loop strength, making it a useful tool for quality control and comparison of overall characteristics of datasets (Box 2, visualisation task 1; Díaz et al., 2018). This approach is particularly helpful for assessing the characteristics of low-resolution datasets such as single-cell Hi-C (Flyamer et al., 2017; Gassler et al., 2017). It can also be an especially effective way to visualise changes in organisation in response to biological conditions that result in a complete loss of a structural feature (Rao et al., 2017; Schwarzer et al., 2017), but can further be applied to subsets of features, such as interactions associated with a specific transcription factor of interest (de Wit et al., 2013; McLaughlin et al., 2019 preprint), or interactions between domain boundaries at different distances (Hug et al., 2017).
By definition, this approach requires a defined set of features to aggregate across. Increases or decreases in average signal relative to this reference set are easy to detect in this approach, but novel features cannot be detected. For example, the appearance of new domain boundaries in a treated sample will not be visible if the reference set of domain boundaries is defined using only the control sample. Therefore, care should be taken when choosing reference features and interpreting changes in interaction strength.
A further limitation of this approach at present is that it can be difficult to interpret in a quantitative way. While some implementations of this approach include a quantification of overall feature strength (Flyamer et al., 2017, 2019 preprint; Gassler et al., 2017; Rao et al., 2014), it is not clear how the magnitude of this quantification and changes in it should be biologically interpreted. One approach (used by Hug et al., 2017) is to compare interactions between features of interest with their interactions with appropriate control regions. This provides a reference point for interpretation of both the visualisation and the quantification of feature strength. Interpretation of changes in average strength is also aided by knowing how interaction strength varies across features, and across biological replicates, but neither are commonly shown alongside aggregate visualisations. One recent tool, HiPiler (Lekschas et al., 2018), allows exploration of variability in signal across groups of features, but is designed for interactive exploration rather than static presentation. Future implementations of aggregate feature analysis should consider how variability and appropriate controls can be displayed alongside average interaction frequency.
A different approach to aggregating interactions across a dataset is to simply calculate the average interaction frequency between regions at different genomic distances (Fig. 5D). This provides an overview of how the contact probability between two regions decreases as the distance between them increases, and reveals global properties of the dataset (Gassler et al., 2017; Naumova et al., 2013; Schwarzer et al., 2017). For example, compaction of chromatin during mitosis in early Drosophila embryos leads to an increase in interactions at a distance of ∼500 kb-5 Mb, compared with the interaction frequency seen in interphase (Fig. 5D). Plotting the derivative of the interaction frequency can highlight differences (Fig. 5E, Gassler et al., 2017).
Conclusions and perspectives
Current interpretation of 3D genome organisation data relies significantly on visualisation and qualitative assessment. However, visualisation of this data can be challenging because of the size of the data and the varied genomic scales of the features of interest. Nevertheless, numerous approaches are now available to help researchers visualise this data for exploration and for communication of results. In particular, transformations and abstractions of Hi-C data into linear tracks simplify visualisation and are especially helpful for comparisons between datasets. Aggregate visualisations provide useful summaries of selected properties of datasets, and the potential for quantitative analyses of these properties.
Recent tool development (Durand et al., 2016b; Kerpedjiev et al., 2018) has made it much easier to interactively explore Hi-C data, to inspect known features and generate hypotheses, and to share interactive visualisation sessions with collaborators (Box 2, visualisation task 7). Although some authors now include links from papers to interactive visualisations (e.g. Falk et al., 2019; Schwarzer et al., 2017), interactive figures cannot yet be incorporated directly into publications (Box 2, visualisation task 8). Interactive data exploration is also difficult to reproduce, a consideration that is becoming more important as reproducibility is seen as a tool for increasing confidence in scientific results (Peng, 2011; Sandve et al., 2013). Therefore, while interactive visualisation of Hi-C data will likely increase in popularity, there is still a need for tools for producing static visualisations of Hi-C data, and which can be incorporated into reproducible analysis pipelines.
As discussed above, future tools will also need to consider how to represent variability. Variation across features in a dataset provides an important reference point for interpreting the magnitude and biological significance of changes in genome organisation between samples. Future implementations of aggregate analyses could include visualisation of interaction strength variance across a set of features alongside their average interaction strength. Variation between biological replicates can provide a starting point for statistical analysis of Hi-C data. Quantifications, such as average feature strength, should be calculated and provided for individual biological replicates as well as merged datasets. Although there are quantitative approaches available for the analysis of discrete genomic structural features and quantitative tracks such as insulation score or compartmentalisation eigenvectors, many other analyses rely on qualitative visual comparisons or quantification of average interaction strength for features of interest. Similar to the idea of showing individual data points rather than using bar graphs to present summary statistics (Weissgerber et al., 2015), visualising variability in Hi-C data will allow researchers to better understand and critically evaluate the data. Future tools should consider how quantitative analyses can be integrated with qualitative visualisations. This will provide a foundation to further develop statistically robust frameworks for Hi-C analysis and effectively communicate the results of these analyses.
Additional remaining challenges include the need for standardised data exchange formats that efficiently store large datasets and allow fast access for use with interactive visualisation tools. While groups such as the 4D Nucleome Consortium have internally adopted the ‘cooler’ and ‘hic’ formats as standard file/storage formats (Abdennur and Mirny, 2019; Durand et al., 2016a), existing analysis and visualisation tools use and produce a wide range of formats. The wider community should endeavour to reach a consensus about which formats best fit their needs, and future analysis and visualisation tools should adopt these formats (Marti-Renom et al., 2018). In addition, current visualisation tools largely require the use of the command line, either to produce figures or to reformat data. This is in contrast to the genome browsers used widely to visualise other genomic data types, which accept standardised file formats and have graphical user interfaces, and are therefore accessible to a wider range of researchers. Although it is likely that Hi-C analysis will continue to require significant computational resources and some level of expertise, as 3C-based approaches become more widely used, future tools should make custom visualisations more accessible without command line skills.
Any new techniques that are developed will also require new approaches to aid their analysis and visualisation. Techniques to identify multi-locus contacts produce multi-dimensional data, which present further challenges for visualisation in two dimensions. Effective visualisation of these complex datasets will require development of new visualisation tools and approaches. In addition, as the number of individual cells analysed by single-cell Hi-C approaches increases (Nagano et al., 2017), there will be an increased need for visualisation tools that can provide overviews of organisation while preserving information about variation across the population.
Although visualisation of 3D genome organisation can be challenging, many approaches and tools are available to tackle these challenges and produce effective visualisations. As 3C techniques become more widely adopted, further developments in these tools and approaches will allow increasingly effective visualisation and integration with statistical analyses to answer biological questions and communicate results.
We thank K. Kruse and C.B. Hug for providing analytical and plotting software. We thank members of the Vaquerizas lab and VIZBI 2019 attendees for helpful discussions about Hi-C visualisation.
The authors’ research is funded by the Max-Planck-Gesellschaft and the Deutsche Forschungsgemeinschaft (DFG) Priority Programme SPP2202 Spatial Genome Architecture in Development and Disease (VA 1456/1). E.I.-S. is supported by a postdoctoral fellowship from the Alexander von Humboldt-Stiftung. J.M.V. is a member of the Deutsche Forschungsgemeinschaft Cells-in-Motion Cluster of Excellence (EXC 1003 – CiM), University of Muenster, Germany.
The authors declare no competing or financial interests.