Keywords
ELIXIR, survey, database, bioinformatics tools
ELIXIR, survey, database, bioinformatics tools
ELIXIR is an intergovernmental organization, primarily centred in Europe, established to host resources, such as databases, software tools, training material and cloud storage for the scientific community, under a single infrastructure. ELIXIR started in 2013 and now includes 22 state members and over 220 research Institutions. By providing a one-stop-shop for the scientific community, it aims to help researchers identify the most suitable bioinformatics resources (and the appropriate training material and workshops) to deliver their research task. Moreover, ELIXIR recognizes and facilitates sharing of data and exchange of expertise between its members, with the goal of agreeing on best practice.
In 2018, ELIXIR commissioned an international survey on the usage of databases and tools for annotating and curating human genomic variants with the aim of improving ELIXIR resources. Here we present the result of the survey.
A 27-question survey (Extended data) was designed and agreed upon by several ELIXIR members from various European countries. The survey was constructed around six main themes (detailed in “Questionnaire structure”) aimed at exploring the usage of ELIXIR resources and tools. Six ELIXIR nodes/partners informed the construction of the survey. Two researchers/research centres were identified by each of the six ELIXIR nodes to participate in a pilot test. The survey was modified based on the feedback received. The final questionnaire was made available on-line using the Webropol survey website between September and December 2018. Responses were uploaded to a Finland-based university server. Members of the Finland ELIXIR node accessed the responses and stored them in a .csv file for analysis. Information about participation was displayed on the survey and completion of the survey was taken as confirmation of participation.
Considering the absence of identifying information in data published here, and the non-sensitive nature of the survey, no ethical approval was sought for this study. No information presented here can be used to identify survey participants.
For each country, an ELIXIR representative was asked to recruit prospective survey respondents from academia, clinical diagnostics, clinical research, industry and the Government within their country. No additional eligibility criteria were set. Participation to the survey was advertised using several means, such as the mailing lists of universities, research centers, private companies and research institutes within each European country. Prospective participants were asked to fill the anonymized on-line questionnaire. A three-month deadline was given for completing the on-line questionnaire. Reminders to complete the survey were published in the regular newsletters at the recruited institutions. This approach did not allow calculation of the response rate.
The survey collected two types of data:
- quantitative data: this included category ranking metrics, as well as general frequency of use of the tools and databases surveyed. Whenever possible, a list of possible choices was provided for selection, to allow proper survey analysis and interpretation.
- qualitative data: this included participant comments.
The survey was divided in six sections (the full questionnaire with the list of tools, databases and resources that were surveyed is presented in Extended Data):
Section 1 - Background information
Section 2 - Resources for annotating and curating human genomic variants (9 questions)
Section 3 - Next generation sequencing (3 questions), (If no, skip this section)
Section 4 - Mitochondria (4 questions), (If no, skip this section)
Section 5 - Proteins (5 questions), (If no, skip this section)
Section 6 - ELIXIR (3 questions)
Sections 1, 2 and 6 were open to all participants, whereas sections 3 to 5 were only open to participants who worked or had an interest in working in these specific fields. Accordingly, respondents were asked to skip the sections that were not pertinent to their work or that of their group.
The complete list of tools and databases surveyed is presented in Extended data.
Data analysis was performed using RStudio (version 0.98.1062) and R (R version 3.4.4)1. Quantitative data are presented as absolute and relative frequency. Qualitative data were analysed per theme.
A total of 92 questionnaires were available for analysis. Data were collected from 18 European countries and the Unites States (Extended data: Figure S1). Finland contributed to the majority of answers (30%, n=28), followed by the United Kingdom (11%, n=10). The large majority of respondents (89%, n=82) identified themselves as belonging to academia or a research environment (Extended data: Table S1). The large majority (51%, n=47) of respondents gave answers to the questionnaire in relation to their work and that of a small group of (<10) people. This was followed by answers given in relation to individual work (33%, n=30) and the work of a large group of (>10) people (16%, n=15). As the survey was sent to colleagues by the organisers, the responders may well not be representative of the wider community. Additionally, most of the responders were from European countries and the responses could be substantially different elsewhere, such as from the USA.
Section 2 of the survey aimed at identifying key data resources and tools used by the scientific community primarily in ELIXIR member states. Respondents were asked to rank 52 resources (listed in Extended data: Table S2). A total of 2055 responses were collected from 88% (82) responders. In total, 22 resources were considered critical/very important by more than 50% of respondents who knew the resources and considered them relevant to their work (Figure 1). Three ELIXIR resources, Ensembl2, the Ensembl Variant Effect Predictor (VEP)3 and UniProt4, were within the top ten. Several additional resources were listed by respondents and are presented in Extended data: Tables S3 and S4.
A total of 78 responses were collected from 50 (54%) respondents on the standards and guidelines used for the interpretation of sequence variants. 66% (n=33) of respondents followed the American College of Medical Genetics (ACMG) standards and guidelines for the interpretation of sequence variants5, 22% (n=11) national guidelines, and 8% (n=4/50) in house guidelines. 4% (n=2) answered “other guidelines” and listed the European Leukemia Net (ELN) guidelines6 and the “Sequence Ontology base classification for variants”7. With regards to the human genome reference assembly version, approximately one third (38%, n=31/81) of respondents used both GRCh37 and GRCh38, 44% (n=36/81) GRCh37 only, 15% (n=12/81) GRCh38 alone, and 2% (n=2/81) GRCh37, GRCh38 and older versions (Figure 2).
Four types of data/data formats used in human genome variation analysis (VCF/BCF, BAM, FASTQ, FASTA) were considered critical/very important by the majority of respondents (Extended data: Table S5). Additional data formats that were listed as critical/very important were: BED (cited 4 times), gVCF (cited twice), PLINK8, IMPUTE9 and Hail MatrixTable. Genome-wide association studies (GWAS) and rare disease studies were listed as the most important topics and analytical operations in human genome variation analysis performed by respondents to this survey (Extended data: Table S6).
One of the main goals of this survey was to identify user requirements in the use of resources for human genome variation annotation and curation. A total of 777 responses on the importance of basic features of databases and tools were analysed. The majority of respondents (>83%) considered critical/very important a resource’s free/low cost, open license to academia, the ability to assess the quality of its data, good user documentation and its availability in English. An easy-to-use web browser and privacy policy were deemed critical/very important by >77% of respondents (Table 1).
A total of 1173 responses were analysed on the importance of technical features in variation analysis, annotation tools and databases. Good curation of the database was the top scoring technical feature, ranked critical/very important by 92% of respondents. The following five technical features were considered critical/very important by >75% of respondents: i) fit for purpose data used by the resource, ii) the scientific coverage and comprehensiveness of the resource; iii) scalability to high-throughput analysis; iv) availability of datasets for download; and v) ability to analyse large datasets/queries (Table 2).
The next section of the survey focused on Next Generation Sequencing (NGS). 66 (72%) respondents worked with NGS and ranked 6 sequencing methods listed in Extended data: Table S7. Whole exome, panel DNA, whole genome and RNA sequencing were used by more than 60% of respondents. Methylation sequencing was the least used technique (28% of respondents) but was the technique the majority of respondents would like to use (23% of respondents). Several other techniques were listed among “Other”, such as single-cell RNA sequencing, ribosome profiling, chromatin conformation techniques (3C, 4C, HiC, capture HiC), assay for transposase-accessible chromatin using sequencing (ATAC-SEQ), targeted RNA sequencing, fusion genes on cDNA, microbiome sequencing, GRO-seq.
The next section of the survey explored the use of resources related to the study of mitochondria. Only 15 (16%) respondents worked with mitochondrial DNA (mtDNA) alone or in combination with nuclear genes involved in mitochondrial functionality. Of these, 7 used NGS alone, 2 NGS and Sanger sequencing and 6 could not specify. Additional methods listed under “Other methods” were whole genome sequencing (WGS), whole exome sequencing (WES), RNA-seq and mtDNA included in WGS or WES.
Among the mitochondria-dedicated databases, MITOMAP10 was ranked as critical/very important by 65% of respondents, followed by HmtDB (Human Mitochondrial Database)11 by 44% and HmtVar (Human Mitochondrial Variants Database)12 by 33%. With the exception of MITOMAP, MitoCarta13 and HmtDB, more than half of the respondents were unaware of all other databases (Extended data: Table S8).
Among the six tools for mitochondria-related research that were ranked, MToolBox14, HaploGrep215, MitoTip16 and MitImpact217 were equally reported as critical/very important by 18% of respondents (Extended data: Table S9). The following mitochondrial databases/tools not included in the survey were listed among “Other” and ranked critical/very important: MitoFates18 and an in house built annotation pipeline. mvTool (part of MSeqDR.org) and MitoMaster (part of MITOMAP) were listed among “Other tools” used for mitochondria-related research.
The next section of the survey explored the use of resources related to the study of protein structures. 21 out of 92 (22.8%) respondents worked with protein structure, whereas 5 (5.4%) did not but would like to. The results of the survey are presented based on 182 answers from these 26 respondents. Databases of experimentally determined protein structure and complexes, such as PBD19 and PDBe20, were considered critical/very important by 65% (n=17/26) of respondents, followed by tools that report on the structural consequences of variants (62%, n=16/26). All 7 tools/databases that were surveyed were ranked critical/very important by over 45% of respondents (Table 3). The following tools/databases were not surveyed but cited by respondents as key resources for structural modelling of variants: Pymol21, Rosetta22, KiNG23, Modeller24, PolyPhen-225, SNP3D26, SNPs&GO27, SAAPdab/SAAPpred28, HOPE29 and Yasara30.
Key limiting factors to the use of tools for modelling the structural consequences of variants were the lack of expertise in the area (54%, n=13/24), difficulties in using the tools (42%, n=10/24), and difficulties in interpreting the results (38%, n=9/24) (Figure 3). Lack of high throughput capability was listed among “Other limitations”. The difficulty to translate protein dynamics into structural models was also cited.
The last section of the questionnaire covered questions related to ELIXIR as a platform that coordinates the tools and databases surveyed in this study and the benefits that ELIXIR, as a European intergovernmental organization, offers to the scientific community. The majority of respondents (52%, n=42/92) were not aware that the tools and databases surveyed in Q25 are part of ELIXIR (Extended data: Figure S2). However, the long-term sustainability of ELIXIR core resources and the verified scientific quality of datasets were considered critical/very important by >76% (n=68) of participants who answered this question. The international standards for describing and saving data, the verified quality and maintenance of ELIXIR resources and facilities to find the right research tools were also considered critical/very important by the majority of respondents (>60%) (Table 4). Additional challenges that were identified by respondents regarding the annotation and curation of human genetic variants were related to the quality of data: incorrect entries, conflicting annotation, limited access to expert curation and classification of variants obtained in routine diagnostic setting and limited number genotype-phenotype annotations. An important caveat is the ascertainment bias that the survey was primarily sent to colleagues of the authors and this probably is not a representative sample of the community involved in the annotation and curation of human genomic variations.
This survey shows that the scientific community in ELIXIR member states considers several resources supported by ELIXIR crucial or very important Moreover, it shows that the work done by ELIXIR is greatly valued. In particular, most respondents acknowledged the importance of key features and benefits promoted by ELIXIR, such as the verified scientific quality and maintenance of ELIXIR-approved resources.
Open Science Framework: ELIXIR, https://doi.org/10.17605/OSF.IO/SWX4231.
This project contains the following extended data:
- Copy of the online survey
- Figure S1. Number of responses from the countries that participated in the survey
- Figure S2. Results for Q25: “Before the survey, were you aware that the resources listed below are part of ELIXIR?
- Tables S1. Answers to Q2: Place of work. More than one answer was allowed per participant
- Table S2. List of Databases and Tools surveyed in Q4
- Table S3. Additional Resources listed by Respondents under “Other” in Q4 and considered Critical or Very Important
- Table S4. Other resources listed by Respondents in Q5
- Table S5. The importance of data formats in human genome variations analysis (Q8)
- Table S6. List of the topics and analytical operations in human genome variation analysis considered most important in the Respondents’ work (Q9)
- Table S7. Answers to Q14 (Do you use these sequencing methods?)
- Table S8. Mitochondrial databases (Q18)
- Table S9. Mitochondrial tools (Q18)
Data are available under the terms of the Creative Commons Attribution 4.0 International license (CC-BY 4.0).
Views | Downloads | |
---|---|---|
F1000Research | - | - |
PubMed Central
Data from PMC are received and updated monthly.
|
- | - |
Is the work clearly and accurately presented and does it cite the current literature?
Yes
Is the study design appropriate and is the work technically sound?
Yes
Are sufficient details of methods and analysis provided to allow replication by others?
Yes
If applicable, is the statistical analysis and its interpretation appropriate?
Yes
Are all the source data underlying the results available to ensure full reproducibility?
Yes
Are the conclusions drawn adequately supported by the results?
Partly
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: Bioinformatics, high performance computing
Is the work clearly and accurately presented and does it cite the current literature?
Yes
Is the study design appropriate and is the work technically sound?
Yes
Are sufficient details of methods and analysis provided to allow replication by others?
Yes
If applicable, is the statistical analysis and its interpretation appropriate?
Yes
Are all the source data underlying the results available to ensure full reproducibility?
Yes
Are the conclusions drawn adequately supported by the results?
Partly
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: Big Data Integration and Analytics, Non-coding RNA Annotation and Curation
Alongside their report, reviewers assign a status to the article:
Invited Reviewers | ||
---|---|---|
1 | 2 | |
Version 1 08 Oct 20 |
read | read |
Provide sufficient details of any financial or non-financial competing interests to enable users to assess whether your comments might lead a reasonable person to question your impartiality. Consider the following examples, but note that this is not an exhaustive list:
Sign up for content alerts and receive a weekly or monthly email with all newly published articles
Already registered? Sign in
The email address should be the one you originally registered with F1000.
You registered with F1000 via Google, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Google account password, please click here.
You registered with F1000 via Facebook, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Facebook account password, please click here.
If your email address is registered with us, we will email you instructions to reset your password.
If you think you should have received this email but it has not arrived, please check your spam filters and/or contact for further assistance.
Comments on this article Comments (0)