TWGFD
The tetraploid wheat gene family database
1.Identification of gene families in tetraploid wheat.
The amino acid sequences of the T. dicoccoides and T. durum were downloaded from the Ensembl Plants database (https://plants.ensembl.org/index.html). The Hidden Markov Models were retrieved from the Pfam database (http://pfam.xfam.org/). The detail information of the Pfam IDs was listed in Table 1. The HMMER v3.0 package was used to perform with Hidden Markov Model (HMM) analysis with default parameters. The candidates were validated using the NCBI-CDD (National Coalition Building Institute, Conserved Domains Database) (https://www.ncbi.nlm.nih.gov/cdd/), SMART (Simple Modular Architecture Research Tool) (http://smart.embl-heidelberg.de/), HMMER (https://www.ebi.ac.uk/Tools/hmmer/), and InterPro (http://www.ebi.ac.uk/interpro/search/sequence/) online tools. The physical-chemical properties of the candidates, such as molecular weight (MW), theoretical pI (isoelectric point) and grand average of hydropathicity (GRAVY), were evaluated with the online tool ExPASy (https://web.expasy.org/protparam/). The chromosomal localization information was obtained from genome annotation files. The InParanoid v8.0 software was employed to identify orthologs in Arabidopsis thaliana and rice.
2. Gene structure and conserved motif analysis
The intron-exon gene structure was visualized using the GSDS (Gene Structure Display Sever) (http://gsds.cbi.pku.edu.cn/) according to the gene transfer format file. The 1.5 kb genomic sequences upstream of the coding regions were extracted and submitted to the PlantCARE database (http://bioinformatics.psb.ugent.be/webtools/plantcare/html/) to identify the putative cis-acting regulatory elements in the promoter region. The number of the cis-acting regulatory elements were profiled using the pheatmap package in R environment. The online MEME (Multiple Em for Motif Elicitation) (https://meme-suite.org/meme/) was used to identified the conserved protein motifs with the following parameters: the maximum number of motifs was set to 8, any number of repetitions was allowed, and the optimum width ranged from 6 to 250.
3. Multiple sequence alignment and phylogenetic relationships
Amino acid sequence alignments of the full-length proteins were generated using Clustal X v.1.83. Evolutionary analyses were performed by constructing phylogenetic trees using the neighbourhood statistics method in the MEGA X software. Bootstrap analyses of 1000 trials were performed to ensure statistical reliability. The syntenic relationships in Arabidopsis thaliana, Oryza sativa, Triticum aestivum, T. dicoccoides, and T. durum were analyzed using InParanoid v8.0. The Circos v0.65 was employed to visualize the syntenic relationships.
4. Expression profile and nucleotide variation analysis
To estimate the expression profiles of the candidate genes, we retrieved RNA-seq samples from the NCBI (national center for biotechnology information) SRA (Sequence Read Archive) database for different tissues and developmental stages, as well as for plants coping with various biotic and abiotic stresses. (https://www.ncbi.nlm.nih.gov/). The Hisat2 v2.1.0 and StringTie v1.3.5 pipelines were employed to calculate the fragments per kilobase of transcript per million fragments mapped (FPKM) value. The heatmap and hierarchical clustering were generated using the pheatmap package embedded in R with the log2 transformed FPKM values. The nucleotide variation information was download from the Genome Sequence Archive (https://bigd.big.ac.cn/gsa) under accession number CRA001951.
(Sample_information.xlsx)