VNTRfinder and PolyPredictR documentation

 

Brief summary:

polypredictr.pl detects potentially polymorphic tandem repeats using rules described by Wren et. al.[1] and Naslund et. al.[2]. It takes as input the output of the Tandem Repeats Finder (TRF) program run with the ‘-d’ option. See http://tandem.bu.edu/trf/trf.html for further information on the use of TRF.

vntrfinder.pl aligns similar repeats between sequences. Two input files are supplied; one containing a list of reference and one containing a list of targets. Repeats are detected in the references and the flanks for these are searched against the targets, highlighting any variations. For instance, you might use one bacterial strain as a reference and several others as the targets.

Both programs are also available to run locally (configured to run on UNIX with TRF version trf321.linux.exe and will not work on windows/mac). This is probably the best option if you want to analyse a large number of sequences or big sequences. Click here to download them (type tar –xzf in UNIX to unpack contents).

If you have any questions about running the programs or this documentation, please do not hesitate to contact me:

A tutorial is also provided on this site. Internet explorer users can view it at http://bioinformatics.rcsi.ie/vntrfinder/Tutorial.mht and it can also be downloaded from http://bioinformatics.rcsi.ie/vntrfinder/Tutorial.pps.

 

Back to top

 

 

PARAMETERS:

The parameters of the Tandem Repeats Finder (TRF) program are discussed in detail on the TRF homepage - http://tandem.bu.edu/trf/trf.html and in the original publication by Benson[3].

 

Default parameteres are:

50      - the minimum score to report an alignment
2,7,7  - match, mismatch, and indel scores. For the latter 2 parameters, '7' refers to the least permissive parameter for the relevant choice.
500    - maximum period of a repeat, e.g. 'TG' from the array 'TGTGTGTG' has a period of 2

 

It is important to note that the choice of parameters will affect the repetitive pattern reported by the TRF program. Therefore, the user is reminded to consider this if they have specific repeats in mind that they would like to analyse. Default parameters for the program are pre-selected on the webserver. However, to detect shorter, more inexact repeats, lower the minscore from 50 and the mismatch and indel scores from 7. Similarly, if you want to exclude longer repeats, reduce the maxperiod score from 500. The default parameters will report a tandem repeat if it is at least 25 bases in length, e.g. 5 copies of a pentamer. The higher the mismatch and indel parameters, the less likely it is that a inexact tandem repeat will be detected. To detect more inexact repeats, reduce the mismatch and indel penalty scores.

 

 

Back to top

 

 

MISMATCH:

The mismatch value refers to the maximum amount of mismatches that will be tolerated when the flanks of repeats from the reference sequence(s) are searched against the target sequence(s). The program starts with a mismatch of zero and when no hit is found or ambiguity is detected, it increases the mismatch value by one and re-searches. This process continues until a single, unambiguous hit to the target sequence is obtained, or until the mismatch value specified by the user is reached.

 

Note that increasing the mismatch limit will increase the time needed for the program to run.

Also note that if the flanks are too short, e.g. <10nt, no hit might be found regardless of the mismatch parameter selected. This is because the program ignores ambiguities, i.e. when a search returns two or more hits for the same sequence.>

 

 

Back to top

 

 

 

TARGET:

The target refers to the sequence(s) across which you want to look for repeat length variation. Flanking sequences to repeats detected in the reference(s) are searched against the target(s) and the lengths between the flanks are recorded, highlighting any potential length variation.

Please be considerate to other users. If you intend to do a large amount of analysis, please download the relevant scripts and run these locally.

 

 

Back to top

 

 

REFERENCE:

The reference refers to the sequence(s) you're interested in. For instance, if you were studying a species of bacteria and wanted to see how tandem repeats in this species differ from other species, you would use this species as a reference.

Note:
The reference only denotes the sequence(s) in which the tandem repeats are detected. Flanking sequences to repeats in the reference(s) are used to search for matches in the target(s). Returned results will summarise the lengths of the repeat blocks in the reference and all targets.

 

Please be considerate to other users. If you intend to do a large amount of analysis, please download the relevant scripts and run these locally.

 

 

Back to top

 

 

 

FLANKS:

The flanklength criteria refers to the flanking sequence used to compare a repeat between the reference to the target sequence(s).

 

Using the specified length, flanking sequence of this length both sides each repeat from the reference(s) and are searched against the target(s). Where ambiguity is detected, the mismatch parameter is increased and the search is repeated until a hit is obtained or the maximum mismatch value specified is reached.

NOTE: Unless your sequence is very short, this parameter should generally be no less than 10 because the shorter the flanks, the greater the change of a false positive match. In addition, the program does not report matches for a flank if there are two or more of these (ambiguity). If the sequence is poorly conserved, try increasing the mismatch parameter first.

 

The exact searching method used involves the use of e-PCR[4]. The procedure involves constructing an STS (Sequence-Tagged Site) which is to be used as input for e-PCR. The mismatch parameter used (tolerate a specified number of mismatches in the flanks) is the one specified by the user. The margin used is the length of the tandem repeat array. Thus the length of the repeat array in the target sequence is allowed to deviate from that of the reference up to a maximum of the length of the reference repeat array.

 

 

Back to top

 

 

 

HIT RETENTION:

The level of stringency used to report a repeat aligned between reference to the target sequence(s) can be adjusted. One of four options can be chosen:

 

A)      Keep a result when the hit “represents length difference consistent with change in the repeat copy-number     

Here, the hit is scanned with Tandem Repeats Finder and it the hit sequence has the same unit motif and length as the reference and if the copy-number is consistent the block length reported for the hit, the hit is retained. This is the most stringent option.

B)      Keep a result when the hit “has the same repeat unit length and motif

As in the first option, the hit sequence is scanned with Tandem Repeats Finder. If the detected tandem repeat has the same unit length and motif, the hit is retained.

C)      Keep a result when the hit “has any repetitive sequence

The hit sequence is scanned with Tandem Repeats Finder to ensure that it is repetitive, but whether the unit length and motif are the same as those of the reference is not assessed.

D)      Keep a result when the hit “represents any sequence

The hit is reported regardless of whether or not the hit sequence is repetitive.

 

 

Back to top

 

 

 

EXPLANATION OF THE OUTPUT:

 

VNTRfinder outputs data with the following fields:

 

ID

ID of reference sequence containing the repeats

unit_lgth

Length of repeat unit

block_lgth

Length of repeat array

blockstart

Start position of repeat in reference sequence

blockstop

Stop position of repeat in reference sequence

copynumber

Times the repeat is repeated in tandem in the array

flank1

5’ flanking sequence of the repeat block

flank2

3’ flanking sequence of the repeat block

Population(q/h/h…)

Length of the repeat block in the reference/target1/target2 etc.

Gene diversity/Heterozygosity

(a.k.a. gene diversity). 1 minus the sum of the allele frequencies in the population

(c.f. Weir, B. S. (1996) Genetic Data Analysis II: Methods for discrete population genetic data. Sunderland, MA, Sinauer Assoc.)

Variant_or_not

Whether of not the repeat has been observed to be variant

mismatchesinhits

The number of mismatches in the aligned flanks of the hits

st_dev

Standard deviation of the different repeat block lengths across the population

st_error

Standard error of the different repeat block lengths

Unit

Detected repeat unit

Block

Tandem repeat array

 

 

 

PolyPredictR outputs data with the following fields:

 

sequenceid

ID of sequence containing the potentially polymorphic repeat

start

Start position of repeat array in the sequence

stop

Stop position of repeat array in the sequence

unitlength

Length of repeat unit

copynumber

Times the repeat is repeated in tandem in the array

consensuslength

Length of consensus repeat

pcmatch

pcmatch between repeat array and consensus

pcindels

pcindels between repeat array and consensus

score

alignment score

A

% composition

C

% composition

G

% composition

T

% composition

entropy

based on % composition

yesnowren

Whether or not this repeat was predicted to be potentially polymorphic using the rules described by Wren et. al.

yesnonaslund

Whether or not this repeat was predicted to be potentially polymorphic using the rules described by Naslund et. al.

naslund_prediction

Naslund prediction from logistic regression values

naslund_p

Significance of Naslund prediction

repeatunit

Sequence

repeatblock

sequence

 

 

A visual summary is also provided, with information on what the colours mean, e.g.:

 

 

 

Back to top

 



[1] Wren, J. D., E. Forgacs, et al. (2000). "Repeat polymorphisms within gene regions: phenotypic and evolutionary implications." Am J Hum Genet 67(2): 345-56.

[2] Naslund, K., et al. (2005). "Genome-wide prediction of human VNTRs." Genomics 85(1): 24-35.

[3] Benson, G. (1999). "Tandem repeats finder: a program to analyze DNA sequences." Nucleic Acids Res 27(2): 573-80.

[4] Schuler, G. D. (1997). "Sequence mapping by electronic PCR." Genome Res 7(5): 541-50.