QTrim

Introduction

QTrim is a next generation sequence quality trimming tool. It is primarily designed for 454/Roche sequence data however will work on any NGS reads that use Phred quality scores.  It takes as input sequence reads in fastq format (or a .fasta and .qual file) and outputs quality trimmed data in various user-defined formats.   

QTrim undertakes quality trimming of sequence reads using a novel algorithm.  As well as outputting trimmed reads in a number of various formats, QTrim outputs various statistics relating to the trimmed and untrimmed reads.  Further, if a plotting option is invoked, QTrim can also output a number of plots relating to read length, read quality and spectrum of quality calls at position throughout all reads.


Availability

QTrim is freely available for academic and non-commercial use while commercial users may need to pay a license fee (please contact us for more details). 

 

Executables

Download QTrim executables - QTrim for Linux/UNIX and QTrim for MacOS. The executables work for both 32 and 64 bit machines.  For ease of installation and use we strongly recommend use of these executables.

 

Source Code

The QTrim source code is a standalone python script and its use depends on the installation of a number of software packages (see requirements below). It will work in all flavours of Linux and Unix including Mac OS X.  Just copy the script to your directory and it works. We recommend you copy the script to a directory in your path (e.g. /usr/bin/)

 

Source Code Requirements

python (v2.6 and above) (http://www.python.org/download/releases/)

Biopython (v1.57 and above)  (http://www.biopython.org/wiki/Download)

 

Performing graphical plotting requires further installation of:

numpy (http://sourceforge.net/projects/numpy). NB Install before matplotlib

Matplotlib (v1.1.0+) (http://matplotlib.sourceforge.net/)  



 

Installation

The QTrim executables are standalone python executables and will work in all flavours of Linux and Unix including Mac OS X.  

Extraction of the QTrim executable creates a directory (QTrim_v1_1) containing the executable (QTrim_v1_1) as well as all of the files necessary for successful running of QTrim.   

QTrim can be run by calling the executable on the command line specifying the full path to the executable file  e.g.

/Users/ram/QTrim_v1_1/QTrim_v1_1

 

OR

By adding the QTrim_v1_1 directory to the PATH.

Requirements

python (v2.6 and above, not version 3) (http://www.python.org/download/releases/)

 

Input Files

QTrim will take either:

A fastq file  (contains both sequence data and associated quality scores in a single file).

OR

A .fasta file containing the sequece data and a .qual file containing the associated quality scores.

 

Example Input Files

Download example fastq files to test QTrim.  These files correspond to the poor and good quality data used in the original evaluation of QTrim.

 


Running QTrim

QTrim can be simpy run using all of the default options by typing either (example shown assumes QTrim directory has been added to the PATH)


QTrim_v1_1  -fastq InputFastqFilename

OR

QTrim_v1_1 -fasta InputFastaFilename -qual InputQualFilename   

 

Options

While QTrim can be run with it's default options to output high-quality trimmed sequence reads, we have provided a number of various options with which users can refine the sensitivity of their trimming analysis. 


Available Options

Descriptions

Default

-h, -help,--help

Prints available options and example commands

 

-v

Prints a verbose output to the screen while processing and trimming sequence reads.

 

Input Options

 

 

-fastq

fastq file that contains both sequence data and quality scores. Quality scores should be in PHRED format.

 

-fasta

Input fasta file containing all sequence reads.

 

-qual

Input quality file with quality scores that pertain to the fasta file. (-fasta and –fastq options must be used together)

 

Output Options

 

 

-o

Output filename.

Outputfile

-out_format

Output file format

Options:

1: fastq format with sequence quality scores in integer value.

2: fastq format with sequence quality scores in ASCII characters.

3. separate sequence (fasta) and quality (.qual files) with quality scores in integer values.

 

2

-seq_id_stat

Add the relevant trimming statistics to the sequence name of each read in the output fastq/fasta file.  These statistics are:

1)   read length (after trimming)

2)   Mean Quality Score (after Trimming) 

3)   How many bases were trimmed from 3’ (and 5’ if applicable) during trimming.

 

 

-plot

If this option is invoked QTrim will produce a number of plots of the statistics associated with the trimming (see below for further details).

 

Available output formats are: eps, pdf, svg, svgz.

 

If this option is not invoked trimming will continue without outputting graphs

pdf

Trim Options

 

 

-l

Minimum read length.  Any reads that do not satisfy this threshold prior to trimming are discarded.  Any reads that are trimmed to below this threshold during trimming are also discarded.

 

50

-m

This is the threshold that each of the final trimmed reads should satisfy.  For example, at Q20 the mean quality score of the trimmed read will be ≥ 20. Recommended thresholds range from 15 (not stringent) to 30 (very stringent).

 

20

-w

This corresponds to the window size required during trimming. In most cases the default value (set using the –l option) will suffice

 

-l

-mode

Mode number for execution. There are four modes available and should be set using the integer number 1, 2, 3 or 4 only.

Modes:

1)   Trims from 3’ only, removes any ambiguous bases (Ns) in the middle of each read.

2)   Trims from 3’ end, ambiguous bases in the middle of reads are untouched.

3)   Trims from 3’ and 5’, removes any ambiguous bases (Ns) in the middle of each read.

4)   Trims from 3’ and 5’, ambiguous bases in the middle of reads are untouched.

 

In the vast majority of cases mode 2 (default) is the most appropriate. 

 

2

Outputs

 

Output Files

By default QTrim outputs a number of files.  Unless the -o option has been invoked the name of all of these files will start with Outputfile:

 

Outputfile.fastq

This is the file containing the quality trimmed reads in fastq format with sequence quality scores in ASCII characters.  The format of this file can be changed using the -out_format option.


Outputfile_stat.txt

This file contains basic statistics regarding the trimming and details:

  • The total number of reads input.
  • The total number of reads output after trimming.
  • Mean read length in output
  • The minimum and maximum read lengths in output.

 

Outputfile_Trimming_report

Details the total number of bases trimmed from each sequence read.


Outputfile_readlength_read_pair.txt

Contains the number of reads (second column) with a certain read length (first column) that were output in the final trimmed sequence data.  This file can be used to plot the range of read lengths present in the trimmed sequence data.   

 

 

Output Graphics

If the -plot option is invoked, QTrim will output various graphs detailing various statistics about the data.  Separate graphics are output showing the same statistics before and after quality trimming. 

 

Quality Score Trends

Next generation sequencing approaches produce data with linerally degrading quality across reads.  These plots use a box and whisker plot to represent the spectrum of quality across all reads in the data.  To make viewing easier the data is sampled every 10 base pairs.  The second y-axis corresponds to the green line and reflects the coverage (number of reads) throughout the range of read lengths (x-axis). 

Read Length Distribution

The distribution of read lengths can be used as an approximation of the quality of NGS sequence data.  The read lengths of good quality sequence data generally cluster together with similar lengths (both before and after trimming) when plotted.  These plots (shown below) also give a graphical overview as to the amount of data that needed to be trimmed during trimming and whether the level of trimming was consistent between reads.  

 

Mean Quality Score Distribution

The plots show a graphical representation of the distribution of mean quality scores for all reads.  The x-axis indicates the mean quality score while the y-axis indicates the number of sequence reads in the dataset being analysed that have each of the mean quality scores.  In the plots prior to trimming the x-axis can theoretically range from 1-40, while in the post trimming plot the minimum value of the x-axis corresponds the threshold that each of the final trimmed reads satisfies (-m option, set to 20 as default)