VNTRfinder and PolyPredictR documentation
Brief summary:
polypredictr.pl detects potentially polymorphic
tandem repeats using rules described by Wren et. al.[1]
and Naslund et. al.[2].
It takes as input the output of the
Tandem Repeats Finder (TRF) program run with the ‘-d’ option.
See http://tandem.bu.edu/trf/trf.html
for further information on the use of TRF.
vntrfinder.pl aligns similar repeats between
sequences. Two input files are supplied; one containing a list of reference and
one containing a list of targets. Repeats are detected in the references and
the flanks for these are searched against the targets, highlighting any variations.
For instance, you might use one bacterial strain as a reference and several
others as the targets.
Both programs are also available to run locally (configured
to run on UNIX with TRF version trf321.linux.exe
and will not work on windows/mac). This is probably the best option if you want
to analyse a large number of sequences or big sequences. Click here to download them (type
tar –xzf in UNIX to unpack contents).
If you have any questions about running the programs or this
documentation, please do not hesitate to contact me:
A tutorial is
also provided on this site. Internet explorer users can view it at http://bioinformatics.rcsi.ie/vntrfinder/Tutorial.mht
and it can also be downloaded from http://bioinformatics.rcsi.ie/vntrfinder/Tutorial.pps.
PARAMETERS:
The parameters of the Tandem Repeats
Finder (TRF) program are discussed in detail on the TRF homepage - http://tandem.bu.edu/trf/trf.html
and in the original publication by Benson[3].
Default parameteres are:
50 - the
minimum score to report an alignment
2,7,7 - match, mismatch, and indel scores. For the latter 2 parameters,
'7' refers to the least permissive parameter for the relevant choice.
500 - maximum period of a repeat, e.g. 'TG' from the array
'TGTGTGTG' has a period of 2
It is
important to note that the choice of parameters will affect the repetitive
pattern reported by the TRF program. Therefore, the user is reminded to
consider this if they have specific repeats in mind that they would like to
analyse. Default parameters for the program are pre-selected on the webserver.
However, to detect shorter, more inexact repeats, lower the minscore from 50
and the mismatch and indel scores from 7. Similarly, if you want to exclude
longer repeats, reduce the maxperiod score from 500. The default parameters
will report a tandem repeat if it is at least 25 bases in length, e.g. 5 copies
of a pentamer. The higher the mismatch and indel parameters, the less likely it
is that a inexact tandem repeat will be detected. To detect more inexact
repeats, reduce the mismatch and indel penalty scores.
The
mismatch value refers to the maximum amount of mismatches that will be
tolerated when the flanks of repeats from the reference
sequence(s) are searched against the target sequence(s).
The program starts with a mismatch of zero and when no hit is found or
ambiguity is detected, it increases the mismatch value by one and re-searches.
This process continues until a single, unambiguous hit to the target sequence
is obtained, or until the mismatch value specified by the user is reached.
Note that
increasing the mismatch limit will increase the time needed for the program to
run.
Also note
that if the flanks are too short, e.g. <10nt, no hit might be found regardless of the mismatch parameter selected. This is because the program ignores ambiguities, i.e. when a search returns two or more hits for the same sequence.>
The target refers to the sequence(s) across which you want
to look for repeat length variation. Flanking sequences to repeats detected in
the reference(s) are searched against the target(s)
and the lengths between the flanks are recorded, highlighting any potential
length variation.
Please be considerate
to other users. If you intend to do a large amount of analysis, please download the relevant scripts and
run these locally.
The
reference refers to the sequence(s) you're interested in. For instance, if you
were studying a species of bacteria and wanted to see how tandem repeats in
this species differ from other species, you would use this species as a
reference.
Note: The reference only denotes the sequence(s) in which the tandem
repeats are detected. Flanking sequences to repeats in the reference(s) are
used to search for matches in the target(s). Returned results
will summarise the lengths of the repeat blocks in the reference and all
targets.
Please be
considerate to other users. If you intend to do a large amount of analysis,
please download the relevant
scripts and run these locally.
The
flanklength criteria refers to the flanking sequence used to compare a repeat
between the reference to the target
sequence(s).
Using the
specified length, flanking sequence of this length both sides each repeat from
the reference(s) and are searched against the target(s). Where ambiguity is
detected, the mismatch parameter is increased and the
search is repeated until a hit is obtained or the maximum mismatch value
specified is reached.
NOTE: Unless your sequence is very short, this parameter should
generally be no less than 10 because the shorter the flanks, the greater the
change of a false positive match. In addition, the program does not report
matches for a flank if there are two or more of these (ambiguity). If the
sequence is poorly conserved, try increasing the mismatch parameter first.
The exact
searching method used involves the use of e-PCR[4].
The procedure involves constructing an STS (Sequence-Tagged Site) which is to
be used as input for e-PCR. The mismatch parameter used (tolerate a specified
number of mismatches in the flanks) is the one specified by the user. The margin
used is the length of the tandem repeat array. Thus the length of the repeat
array in the target sequence is allowed to deviate from that of the reference
up to a maximum of the length of the reference repeat array.
The level
of stringency used to report a repeat aligned between reference
to the target sequence(s) can be adjusted. One of four options
can be chosen:
A) Keep a result when the hit “represents length difference consistent with change in the repeat
copy-number”
Here, the hit is scanned with Tandem Repeats Finder and it the
hit sequence has the same unit motif and length as the reference and if the copy-number
is consistent the block length reported for the hit, the hit is retained. This is
the most stringent option.
B) Keep a result when the hit “has the same repeat unit length and motif”
As in the first option, the hit sequence is scanned with Tandem
Repeats Finder. If the detected tandem repeat has the same unit length and motif,
the hit is retained.
C) Keep a result when the hit “has any repetitive sequence”
The hit sequence is scanned with Tandem Repeats Finder to ensure
that it is repetitive, but whether the unit length and motif are the same as
those of the reference is not assessed.
D) Keep a result when the hit “represents any sequence”
The hit is reported regardless of whether or not the hit sequence
is repetitive.
EXPLANATION OF THE OUTPUT:
VNTRfinder outputs data with the
following fields:
|
ID |
ID of reference sequence containing the repeats |
|
unit_lgth |
Length of
repeat unit |
|
block_lgth |
Length of
repeat array |
|
blockstart |
Start position
of repeat in reference sequence |
|
blockstop |
Stop
position of repeat in reference sequence |
|
copynumber |
Times the
repeat is repeated in tandem in the array |
|
flank1 |
5’
flanking sequence of the repeat block |
|
flank2 |
3’ flanking
sequence of the repeat block |
|
Population(q/h/h…) |
Length of
the repeat block in the reference/target1/target2 etc. |
|
Gene diversity/Heterozygosity |
(a.k.a.
gene diversity). 1 minus the sum of the allele frequencies in the population (c.f. Weir, B. S. (1996) Genetic
Data Analysis II: Methods for discrete population genetic data. |
|
Variant_or_not |
Whether
of not the repeat has been observed to be variant |
|
mismatchesinhits |
The
number of mismatches in the aligned flanks of the hits |
|
st_dev |
Standard
deviation of the different repeat block lengths across the population |
|
st_error |
Standard
error of the different repeat block lengths |
|
Unit |
Detected
repeat unit |
|
Block |
Tandem
repeat array |
PolyPredictR outputs data with the
following fields:
|
sequenceid |
ID of sequence containing the potentially polymorphic
repeat |
|
start |
Start
position of repeat array in the sequence |
|
stop |
Stop
position of repeat array in the sequence |
|
unitlength |
Length of
repeat unit |
|
copynumber |
Times the
repeat is repeated in tandem in the array |
|
consensuslength |
Length of
consensus repeat |
|
pcmatch |
pcmatch
between repeat array and consensus |
|
pcindels |
pcindels between
repeat array and consensus |
|
score |
alignment
score |
|
A |
%
composition |
|
C |
%
composition |
|
G |
%
composition |
|
T |
%
composition |
|
entropy |
based on
% composition |
|
yesnowren |
Whether or
not this repeat was predicted to be potentially polymorphic using the rules
described by Wren et. al. |
|
yesnonaslund |
Whether
or not this repeat was predicted to be potentially polymorphic using the
rules described by Naslund et. al. |
|
naslund_prediction |
Naslund
prediction from logistic regression values |
|
naslund_p |
Significance
of Naslund prediction |
|
repeatunit |
Sequence |
|
repeatblock |
sequence |
A visual summary is also provided, with
information on what the colours mean, e.g.:

[1] Wren, J. D., E. Forgacs, et al.
(2000). "Repeat polymorphisms within gene regions: phenotypic and
evolutionary implications." Am J Hum Genet 67(2): 345-56.
[2] Naslund, K., et al. (2005). "Genome-wide
prediction of human VNTRs." Genomics 85(1): 24-35.
[3] Benson, G. (1999).
"Tandem repeats finder: a program to analyze DNA sequences." Nucleic
Acids Res 27(2): 573-80.