About Standards and Replicon Genetics …

…objectives

Our aim is to create synthetic sequence standards that are useful for:

The creation of test data, such as ‘spike-in controls’, by Bioinformaticians managing variant-calling bioinformatics pipelines
Diagnostic laboratories running such pipeline
Diagnostic developers validating their NGS bioinformatics pipelines
Situations where real tissue samples are unavailable
Students of bioinformatics creating bioinformatics pipelines and needing test datasets to work with.

We want you to have confidence that your bioinformatics pipelines are giving accurate insights from real patient data and to help you to support accurate patient diagnosis.

References about standards:

Hardwick, S., Deveson, I. & Mercer, T. Reference standards for next-generation sequencing. Nat Rev Genet 18, 473–484 (2017). https://doi.org/10.1038/nrg.2017.44

Blackburn, J., Wong, T., Madala, B.S. et al. Use of synthetic DNA spike-in controls (sequins) for human genome sequencing. Nat Protoc 14, 2119–2151 (2019). https://doi.org/10.1038/s41596-019-0175-1

O’Sullivan, D.M., Doyle, R.M., Temisak, S. et al. An inter-laboratory study to investigate the impact of the bioinformatics component on microbiome analysis using mock communities. Sci Rep 11, 10590 (2021). https://doi.org/10.1038/s41598-021-89881-2

.…personalized standards for you

Synthetic Reads Generator (SyRGen) can be customised to generate a set of standards suitable for your analysis pipeline:

The provided haplotype data here can be used to create any spike-in control; using the source code, anyone can create a custom set of haplotypes to represent different germline and somatic genotypes arising from normal and tumour cell lines (such as might result from a tissue sample). You can then use the browser-based app or TKinter GUI to adjust the following parameters:

Depth of coverage
Read-length
Relative proportion of haplotypes
Range of data quality
Genomic, mRNA (“exome”) and CDS data

Other options support:

Paired-end sequencing; dual-strand or single-strand sequencing
Selection of variant-only reads
Genomic-position annotation
CIGAR annotation
Annotation of haplotype source
Randomly generated reads vs all-possible reads (random-only for paired-end).

See the Help Page for more detail

Services

This work is in the Public Domain.

We are no longer looking to provide any services, beyond enabling collaborative development in the Public Domain.

For suggestions and further development, please contact us at

syrgenreads@gmail.com

…origins

Replicon Genetics was set up on 21 August 2018 by two former-AZ scientists who saw the need for simulated NGS data standards in precision medicine to validate bioinformatics pipelines. Having worked with hospital diagnostic laboratories they could see that sequence-data standards were being generated from wet-chemistry standards but often did not contain the mutations of interest, with variables such as data quality, coverage and other parameters uncontrolled.

Hearing comments like: “We don’t have a problem with our bioinformatics pipelines; it’s all those other laboratories with dodgy, poorly-validated pipelines that have the problem.” was another incentive.

But they wondered “How can you be sure that your bioinformatics pipeline isn’t misdiagnosing hundreds of patients each week? Quality isn’t a problem…. until it’s a BIG problem”

After a conversation at Dunham Massey National Trust coffee shop, a former-AZ bioinformatician agreed to address the task on a “part-time basis”.

To avoid hosting large sequence-read files on a website, with the inevitable problem of slow download speeds, an obvious solution was a browser-based generator of reads that we named “Synthetic Reads Generator“; succinctly SyRGen. This is built on top of Python code called RG_exploder.

Acknowledgements:

We are very grateful to the following people who have been generous with their time to give us constructive feedback:

Dr Eleanor Baker and her team at North West Genomic Laboratory Hub for suggesting the AK2, CIITA and NCF1 variants data, and for running an evaluation using those data.

Dr Simon Patton at EMQN for comments on 2021 and 2024 versions

Christophe Roos at Euformatics from June to Dec 2024 for comments and advice on paired-end exomic data

It would be remiss not to acknowledge the countless contributions on Stack Overflow, from many people that have helped me out of a tricky corner, particularly with code for the Tkinter GUI, to which I owe Bryan Oakley a grateful shout on his own. The infuriatingly simple W3Schools tutorials, have also provided useful examples.

Without Biopython, this would have been much harder; without Pyodide, beyond our ability to provide a browser-based version at the time we began.

Intellectual property and development

Original idea: Dr Gillian Ellison & Jane Theaker 2018

Copy on this website: “Benefits” and “about” page: Dr Gillian Ellison; first three columns on this “more-about” page: Jane Theaker

Algorithm development, Python Implementation, remainder of this website: Cary O’Donnell 2018-2025

Original Vue.js version and code for AWS hosting: Raven Biosciences 2021