Plant materials and growth conditions
Seeds of cultivated melon accession, Harukei-3 (C. melo var. reticulatus), Natsukei-1 (var. reticulatus), Fuyukei (var. reticulatus), Honey dew (var. inodorus), Spicy (var. cantalupensis), Manshuu (var. makuwa), Ougon-9 (var. makuwa), Awamidori (var. conomon), and wild melon, JSS6 (C. agrestis), were obtained from the Genebank of the National Agriculture and Food Research Organization (NARO) in Japan. Melon plants were grown using the hydroponics method in the greenhouse of the University of Tsukuba in Japan as previously reported18. For genomic DNA sequencing, apexes of branched shoots were detached from plants and immediately frozen in liquid nitrogen. For tissue-wide RNA-seq study, tissues shown in Fig. 3a and Supplementary Fig. 6 were similarly obtained and frozen in liquid nitrogen. For field RNA-seq study, hole-punched leaf samples were also collected in a weekly manner from July to November as shown in Fig. 6. These samples were crushed to powdery frozen samples using the Multi-beads shocker instrument (Yasui Kikai Corporation, Osaka, Japan) and stored at −80 °C until use.
Genomic DNA isolation and DNA sequencing
Genomic DNA was isolated using Maxwell® 16 Tissue DNA Purification Kit (Code No. AS1030, Promega, Wisconsin, USA). Although this kit is designed to couple with an automated DNA extraction machine, we did not use this but manually isolated genomic DNA by hand to obtain long intact DNA. Isolated genomic DNA was further subjected to the Short read Eliminator XS kit (Circulomics, Maryland, USA) to remove short DNA fragments with <5 kb. For ONT sequencing, the DNA library was prepared using the Ligation sequencing kit (Code No. SQK-LSK109, ONT, Oxford, UK) according to the manufacturer’s protocol. DNA sequencing run was performed with a Nanopore Minion® device coupled with flow cell. For the genome sequencing of Harukei-3, both R9.4.1 and R10 flow cells were used, whereas only R9.4.1 was used in the sequencing of Natsukei-1, Fuyukei, Spicy, Honey dew, Awamidori, and JSS6. ONT flow cells were repetitively used at least twice with the same DNA library. To obtain DNA sequence data, basecalling was performed using a CUDA-enabled GPU server with ONT’s guppy ver. 3.3.0 software. For PacBio RSII and Illumina paired end (PE) sequencing, the outsourcing service of Macrogen Japan Co. Ltd (Kyoto, Japan). was used except the Illumina PE data of Harukei-3 that was obtained with a Hiseq-2000 sequencer at Cornel University. Illumina sequencing was performed with the 100 bp PE mode in Harukei-3 or the 150 bp PE mode in Honey dew, Spicy, Manshuu, Ougon-9, and JSS6. Illumina mate pair sequencing was performed with 5 kb and 10 kb insert libraries by using the outsourcing service of Hokkaido System Science Co. Ltd. (Sapporo, Japan).
RNA isolation and RNA-seq data acquisition
Total RNA was isolated using Maxwell® 16 LEV Plant RNA Kit (Code No. AS1430, Promega, Wisconsin, USA) according to the manufacturer’s protocol. For ONT direct RNA-seq and cDNA RNA-seq, libraries were prepared using the Direct RNA sequencing kit (Code No. SQK-RNA002, ONT, Oxford, UK) or the Direct cDNA sequencing kit (Code No. SQK-DCS108), respectively, according to the manufacturer’s protocol. To obtain RNA sequence data, basecalling was performed with ONT’s Guppy ver. 3.2.1 software. For Illumina RNA-seq with the 150 bp PE mode, the outsourcing service of Macrogen Japan Co. Ltd. was used.
Whole genome assembly
The procedures, datasets, and detailed parameters used for whole genome assembly of Harukei-3 are summarized in Supplementary Fig. 2. We used two kinds of sequence reads, ONT (R9.4.1 + R10 flow cells; 32.6 Gb) and PacBio RSII (19.5 Gb), for initial contig assembly. In the case of ONT, reads with ≥5 kb were selected and used. Contigs were separately assembled based on ONT or PacBio reads with the Canu ver. 1.8 pipeline54; then errors present in the contig sequences were corrected with Pion55 using 37 Gb of Illumina PE dataset. At this point, contig N50 values for ONT-based and PacBio-based contig assemblies were 8.6 Mb and 0.86 Mb, respectively. Then, scaffolds were assembled based on contigs by combining methods of Bionano Irys optical map and Illumina mate pair (Supplementary Fig. 2). For Bionano scaffolding, 86 Gb raw data were obtained using the outsourcing service of AS ONE Corp. (Osaka, Japan). They were first assembled to construct “cmap” with Irys solve ver. 3.2; then, cmap was used for both scaffolding and correction of chimeric contigs (incorrectly assembled contigs) using the same software. Scaffolding was also performed using 74 Gb of Illumina mate pair (5 kb and 10 kb insert sizes) with SSPACE ver. 3.056 at different “k” parameter values (from k = 160 to 80, 40, 20, 10, and 5). By using k values from larger to smaller in series, we tried to maximize the connections and minimize false positives. Chimeric scaffolds generated by the mate pair scaffolding were again corrected using Bionano cmap. At this point, scaffold N50 values for ONT-based and PacBio-based assemblies were 17.5 Mb and 11.4 Mb, respectively. ONT-based scaffolds were further updated using PacBio-based scaffolds as a hint. In this attempt, both scaffolds were first classified into each chromosome group by using linkage map information. Then, candidates of PacBio-based scaffolds that can connect two distinct ONT-based scaffolds were identified by BLAST-n search using the following conditions: p-value ≤ 1e–150, sequence identity ≥ 99%, blast score ≥ 1000, and alignment length ≥ 5000 bp (Supplementary Fig. 2). If both end of the PacBio-based scaffold had sequence alignments with distinct ONT-based scaffolds, it was used to connect them. Finally, the chromosome-scale pseudomolecule was constructed using 28 genomic scaffolds that were anchored and oriented by linkage map information16,30,57.
For the contig assembly in Natsukei-1, Fuyukei, Spicy, Honey dew, Ougon-9, Awamidori, and JSS6, ONT sequencing reads with ≥5 kb (R9.4.1 flow cell) were first subjected to the Canu ver. 1.8 pipeline. The resultant contig sequences were subjected to Racon58 and Medaka (https://nanoporetech.github.io/medaka/) to correct erroneous bases. To further determine the candidates of chimeric contigs, ONT reads were aligned to contig sequences with minmap259 using the following parameter: “-a -uf -k14 -A 2 -B 4 -O 4,24 -E 2,1”. Then, read depth data were obtained based on the read alignment information using the mpileup function of samtools60. Contigs were split at positions where the depth of ONT reads was less than four.
Procedures and datasets used for genome annotation are summarized in Supplementary Figs. 6, 7. For ONT-based gene prediction, datasets of 8.2 Gb ONT direct RNA-seq and 8.8 Gb ONT cDNA RNA-seq were combined and used. They were aligned to Harukei-3 ver. 1.41 genomic sequence with Minimap2 using the following parameters “-ax splice -uf -k14.” Then, transcript information with exon-intron structure was obtained by pinfish (https://github.com/nanoporetech/ont_tutorial_pinfish) using several “c” parameter values (c = 2, 3, 5, and 10; Supplementary Fig. 7). Predicted transcript sequences were obtained from the genomic sequence with gffread (https://ccb.jhu.edu/software/stringtie/gff.shtml#gffread) based on the General Feature Format (GFF) information of Pinfish; then, they were combined with transcript sequence information of DHL92 genome annotation CM4.0. Again, Minimap2 alignment and Pinfish prediction were performed based on the combined transcript sequences to obtain the merged GFF annotation. Next, the protein-coding open reading frame (ORF) was predicted in each transcript followed by hmmsearch. The best possible ORFs were kept by selecting those with the highest sum total hmmsearch scores. If no protein domain was found in any ORF candidates, the longest ORF was kept. Transcripts were further grouped into gene units based on the position of the exon(s) and the results of the self-BLAST search (both transcript and protein sequences); transcripts were grouped if both the transcript and protein sequences had homology to each other and the positions of exon(s) on the genome were consistent between them. To further select the best-possible ORF in each gene unit, hmmsearch as well as BLAST-p search against protein sequences of 9 plant genomes were performed again (a list of the genome references used for the purpose is shown in Supplementary Fig. 7d). The ONT-based method described above predicted 31,306 protein-coding genes. A perl pipeline designated “ONT4genepredict” was developed to automatically perform the information analysis described above. It is available in Melonet-DB (https://melonet-db.dna.affrc.go.jp/ap/dnl). To evaluate the completeness of predicted genes, we used BUSCO ver. 3.0 benchmark32. The protein BUSCO score for the ONT-based gene dataset was 1362 (94.6%) (Supplementary Fig. 7c). In addition to the ONT-based method, gene prediction was also performed with AUGUSTUS ver. 3.3.2 (ab initio method)61, Braker2 pipeline62, and Genome Threader63. For AUGUSTUS gene prediction, the parameter dataset was first trained and generated with the perl script autoAug.pl using the ONT-based annotation dataset described above. Then, genes were predicted with AUGUSTUS software using the default parameters. Braker2 was executed using the Illumina RNA-seq dataset of 45 tissue-wide samples (total 118 Gb; Fig. 3a). RNA-seq reads were first aligned to the Harukei-3 genome sequence, then the read alignment information was merged and used for Braker2 gene prediction. Transcript annotation was also obtained with StringTie64 software based on the read alignment information. Genome Threader was executed based on the protein sequences of 10 published plant genomes (listed in Supplementary Fig. 7d). Then, EvidenceModeler (EVM, https://evidencemodeler.github.io/) was used to integrate the results of StringTie, AUGUSTUS, Braker2, and Genome Threader with the weight score setting of 10, 8, 1, and 1. EVM produced the dataset of 59,613 protein-coding genes with complete BUSCO ver. 3.0 score = 1348 (93.6%). Finally, using the EVM-based dataset as a supplementary dataset, ONT-based annotation dataset was updated to obtain 33,829 protein-coding genes (40,363 transcripts, BUSCO ver. 3.0 score = 1,372 [95.3%]). InterProScan33 was also conducted to obtain GO and InterPro ID in each predicted protein amino acid sequence.
Identification of repetitive elements
Repetitive elements including DNA/RNA transposable elements were searched in both Harukei-3 ver. 1.41 and DHL92 CM3.6.1 genomes using RepeatModeler and RepeatMasker (http://www.repeatmasker.org/) using a repeat sequence dataset, dc20181026.
RNA-seq and co-expression data analysis
Alignment of Illumina RNA-seq paired end short reads was performed with HISAT265 using the following parameters: “–maxins 1000 – –score-min L,0,−0.12 –mp 2,2 –np 1 –rdg 1,1 –rfg 1,1.” Then, gene expression levels were calculated as FPKM values with StringTie. After removing non-expressing genes (e.g., FPKM < 0.1 in any of the samples), WGCNA was performed to obtain the co-expression dataset as described previously18. Pearson’s correlation coefficients were also calculated using R 3.2.3 (https://www.r-project.org/) based on FPKM values independently of WGCNA to distinguish positive and negative correlations. Co-expression datasets can be explored using the web-application tool “Co-expression viewer” in the Melonet-DB (https://melonet-db.dna.affrc.go.jp/ap/mds).
Three different melon genome references were used in this study: Harukei-3, DHL92 (CM3.6.1 genome sequence + CM4.0 annotation), and Payzawat (ASM976082v1 genome sequence). Illumina paired end short reads of Harukei-3, Honey dew, Spicy, Manshuu, Ougon-9, and JSS6 (see above) were aligned to genome sequence with bowtie266 using the following parameters, “–end-to-end –very-sensitive –score-min L,0,−0.12 –mp 2,2 –np 1 –rdg 1,1 –rfg 1,1.” Variant call was performed with the Genome analysis tool kit67, and mutation characterization was performed as described previously68.
Comparative genomics analysis
Genomic alignment was performed with LAST69 using the following parameters, “-e 25 -v -q 3 -j 4 -P 32 -a 1 -b 1 (for lastal)” and “-s 35 -v (for last-split)”. After file format conversion via maf-convert, plot graphs comparing distinct genomes were generated with R 3.2.3 based on the result of LAST alignment. To obtain information of orthologue partners, we performed bidirectional blast searches based on both transcript and protein sequences. Because it is difficult to determine one-to-one orthologue partners when genes are duplicated in either genome, we also used the information of transcript alignment obtained with blat42 as supporting information. When genes were paired in both bidirectional BLAST-n/p search and transcript alignment analysis, they were determined as one-to-one orthologue partners. Candidates of CNV and PAP were determined by integrating the results of bidirectional BLAST-n/p search with the information of gene position on the chromosome that could be obtained from genome annotation (GFF3 files). Information analysis described above has been automated with Perl scripts. Circos70 was used to visualize and compare genomes.
For PAP analysis of retrotransposon Gag-like sequences, DNA alignment analysis was performed for each sequence in nine melon genomes using the Harukei-3 genome as a standard reference. First, the genomic sequence containing the Gag-like sequence and its surrounding region (approx. 50–100 kb) was obtained from the Harukei-3 genome sequence. By using this sequence as a query, a BLAST-n search was performed against the whole genome sequence data of the target melon cultivar or accession. The specific region of contig or chromosome that showed the best homology to the query sequence was identified from the BLAST search result, then the DNA sequence of this region was obtained and further used for LAST alignment. Plot graphs comparing both sequences (Harukei-3 versus the target melon) were generated with R 3.2.3 based on the result of the LAST alignment. The PAP status of the Gag-like sequence in the target melon was also calculated as a numeric value based on the alignment ratio of the corresponding genomic region. For example, an alignment ratio of 1.0 means the complete presence of the Gag-like sequence in the target melon genome, while 0 means that the sequence is absent.
InterPro ID enrichment analysis
ID enrichment analysis was performed based on Fisher’s exact test using the R-exact2x2 module. To calculate q-value from p-value, R-qvalue package was used. This ID enrichment analysis is available in the “GO enrichment analysis tool” in Melonet-DB (https://melonet-db.dna.affrc.go.jp/ap/got).
Statistics and reproducibility
All statistical tests were performed using available softwares, packages, and online tools mentioned in the methods. Reproducibility can be accomplished using raw sequencing data deposited on public databases and the same command lines mentioned in the methods, where we used publicly available softwares for most of the analysis. Fisher’s exact test was used for testing enriched GO or InterPro terms. Both p-value and q-value was used to indicate statistical significance. The number of RNA-seq samples used for tissue-wide or leaf co-expression analyses were 45 and 75, respectively, which were determined according to the previous study18. Co-expression was evaluated based on the weight values calculated by R-WGCNA and pearson’s correlation coefficients (n = 45 or 75).
Further information on research design is available in the Nature Research Reporting Summary linked to this article.