In summary, two search categories are provided in KGD: gene search and batch query. The gene search option provides an interface for querying KGD with a gene ID or keyword associated with gene annotations. To facilitate the queries of genes and functional annotation data stored in KGD, we employed the Apache Solr search engine (http://lucene.apache.org/solr/) to build indexes for different sources of annotation information, including gene functions, GO terms, InterPro domains and homologs.
In addition to the gene search option under each genome page, a global search function is provided under the main menu of KGD. This function provides a quick query against all the records stored in the database and returns results in a tabular format including the gene ID, gene type, and gene description (Fig. 2a). From this table, users can browse the detailed feature page for each gene by clicking the corresponding gene link.
The batch query option allows users to retrieve sequences, annotations and other types of information (e.g., TFs and TRs) for a given list of genes. The batch query function in KGD was modified from the ‘Sequence Retrieval’ page of Tripal16.
To provide a homology search function, we implemented the Tripal BLAST UI extension module in KGD. All genome, mRNA, CDS and protein sequences of kiwifruit species stored in KGD are available for comparison through the BLAST program. To prevent users from selecting inompatible BLAST programs (BLASTN, BLASTP, BLASTX, tBLASTN and tBLASTX) for the corresponding databases, the list of BLAST programs is automatically set up according to the selected reference database (Fig. 2b). Options for filtering low-complexity sequences and selecting the maximum number of returned BLAST hits are provided. The BLAST function provides downloadable output files ordered by the expected values in three different formats, HTML, TSV and XML, and the results page lists all the hits, with each hit linked to a graphic output that shows the alignment coordinates between the query and the hit and a color-ranked bit score for the alignment (Fig. 2c).
In KGD, we implemented JBrowse30, a widely used genome browser, to display genome sequences, gene models, and expression profiles. Currently, all publicly available kiwifruit genomes, predicted gene models, and gene expression profiles derived from RNA-Seq data have been imported into JBrowse. The tracks of a given gene in a reference genome are also embedded in the gene features page to provide a graphical and informative view of its sequence and structure (Fig. 1a). Additionally, the genome browser can support other types of interesting data, such as single-base resolution genome variants, when they become available in the near future.
To view syntenic blocks and homologous gene pairs between different kiwifruit genome assemblies, we developed ‘SyntenyViewer’, an extension module of Tripal, in KGD. Syntenic blocks can be retrieved by selecting a query genome together with one or more subject genomes. ‘SyntenyViewer’ will draw circus plots to display syntenic blocks for every pair of query and subject genomes (Fig. 3a) and simultaneously generate a full list of the syntenic blocks. For a specific syntenic block, ‘SyntenyViewer’ creates an image to display the homologous gene pairs, and the view can be zoomed in or out as desired (Fig. 3b). The full list of genes included in the homologous gene pairs is provided with links to the detailed feature page of each gene (Fig. 1). In brief, the ‘SyntenyViewer’ module can not only reveal syntenic blocks between any two genome sequences but also connect homologous gene pairs in syntenic blocks. With this module, homologous members of interesting genes that are located in a specific region of one kiwifruit genome can be easily identified and intuitively viewed for the other kiwifruit genome.
a Syntenic blocks displayed in a Circos plot. The blue arc indicates the query chromosome, and the red arcs indicate the chromosomes of the compared genome. Gray lines between blue and red arcs indicate syntenic blocks identified between the two genomes. The lines of a syntenic block will become red when the user mouses over it. b Detailed view of a specific synteny block. The query and compared chromosomes of a specific synteny block are shown in orange and blue, respectively. The yellow and black lines within each chromosome indicate homologous gene pairs, which are connected by gray lines
Large-scale genomic studies typically result in large lists of interesting genes. Interpreting such gene lists to obtain biologically meaningful information is the basic premise for understanding the underlying regulatory mechanisms of important biological processes and biochemical pathways. Enrichment analysis is a powerful and frequently used method for identifying specific families or groups of genes that are overrepresented in a list of biological entries (e.g., GO terms and biochemical pathways). We previously developed two custom-built extension modules of Tripal, ‘GO tool’ and ‘Pathway tool’, based on the hypergeometric test29. These two modules were also implemented in KGD to identify significantly enriched GO terms and biochemical pathways from a list of user-provided genes.
RNA-Seq expression analysis
KGD not only stores gene expression profiles derived from RNA-Seq datasets but also provides an ‘RNA-Seq’ module to allow users to perform RNA-Seq data analyses, including the identification of differentially expressed genes (DEGs) and the visualization of gene expression profiles. The two most popular DEG identification tools, edgeR31 and DESeq32, were integrated into the ‘RNA-Seq’ module in KGD. The tools provide users the option of selecting their desired cutoff values for the gene expression fold change and adjusted P-value to determine the final DEGs. The result page for the DEG analysis includes the project description, parameter settings, top 100 DEGs ordered by adjusted P-values, and a download link to a file with all identified DEGs together with their relevant information (Fig. 4a). Furthermore, the result page provides links to other modules for many downstream analyses of the identified DEGs, such as BLAST, batch query, GO term and pathway enrichment analyses, and gene functional classification.