RePGP

Tutorial for VCF/BCF files

Each file can be downloaded by using this command:

$ wget http://files.teamerlich.org/repgp/repgp.vcf.gz

Multiple files can be downloaded by using the following command:

$ for num in $(seq 22);
do
wget http://files.teamerlich.org/repgp/repgp-imputed.chr${num}.vcf.gz
done

Using curl:

$ curl -O http://files.teamerlich.org/repgp/repgp.vcf.gz

The -O option must be used or else the file will be taken to standard out.

Viewing the vcf/bcf:

To read the vcf or bcf file execute the following command:

$ bcftools view repgp.vcf.gz | less -S
##fileformat=VCFv4.0
##FILTER=<ID=PASS,Description="All filters passed">
##fileDate=20160712
##source=intersection_dbSNP_23andme
##dbSNP_BUILD_ID=141
##reference=GRCh37p13
##phasing=none
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##contig=<ID=1>
##contig=<ID=2>
##contig=<ID=3>
##contig=<ID=4>
##contig=<ID=5>
##contig=<ID=6>
##contig=<ID=7>
##contig=<ID=8>
##contig=<ID=9>
##contig=<ID=10>
##contig=<ID=11>
##contig=<ID=12>
##contig=<ID=13>
##contig=<ID=14>
##contig=<ID=15>
##contig=<ID=16>
##contig=<ID=17>
##contig=<ID=18>
##contig=<ID=19>
##contig=<ID=20>
##contig=<ID=21>
##contig=<ID=22>
##contig=<ID=X>
##contig=<ID=MT>
##contig=<ID=Y>
##bcftools_mergeVersion=1.2-108-gd4d42ca+htslib-1.2.1-188-g6d2810c
##bcftools_mergeCommand=merge ./pgp_hu019BBA_238.clean.vcf.gz ./pgp_hu33F35D_119.clean.vcf.gz ./pgp_hu85E6EC_692.clean.vcf.gz ./pgp_huA27736_992.clean.vcf.gz ./pgp_huD87BF...
##bcftools_viewVersion=1.2-108-gd4d42ca+htslib-1.2.1-188-g6d2810c
##bcftools_viewCommand=view repgp.vcf.gz
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  pgp_hu019BBA_238        pgp_hu33F35D_119        pgp_hu85E6EC_692        pgp_huA27736_992        pgp...
1       82154   rs4477212       A       G       0       PASS    NS=1    GT:GQ:DP        0/0:0:0 0/0:0:0 0/0:0:0 0/0:0:0 ./.:.:. 0/0:0:0 0/0:0:0 ./.:.:. ./.:.:. ./.:.:. 0/0...
1       734462  rs12564807      G       A       0       PASS    NS=1    GT:GQ:DP        ./.:.:. ./.:.:. ./.:.:. ./.:.:. ./.:.:. ./.:.:. ./.:.:. 1/1:0:0 ./.:.:. ./.:.:. ./....
1       752566  rs3094315       G       A       0       PASS    NS=1    GT:GQ:DP        0/1:0:0 1/1:0:0 0/1:0:0 1/1:0:0 0/1:0:0 0/1:0:0 1/1:0:0 ./.:.:. 1/1:0:0 1/1:0:0 1/1...
1       752721  rs3131972       A       G       0       PASS    NS=1    GT:GQ:DP        0/1:0:0 1/1:0:0 0/1:0:0 1/1:0:0 ./.:.:. 0/1:0:0 1/1:0:0 0/1:0:0 1/1:0:0 ./.:.:. 1/1...
1       760998  rs148828841     C       A       0       PASS    NS=1    GT:GQ:DP        ./.:.:. ./.:.:. ./.:.:. ./.:.:. ./.:.:. ./.:.:. ./.:.:. 0/1:0:0 ./.:.:. ./.:.:. ./....
.       .       .               .       .       .       .       .       .               .      .       .       .       .       .       .       .       .       .       .
.       .       .               .       .       .       .       .       .               .      .       .       .       .       .       .       .       .       .       .
.       .       .               .       .       .       .       .       .               .      .       .       .       .       .       .       .       .       .       .

The -S option for less allows every long line to be shown on only one line (no wrap around) making it easier to see the columns

List all the samples in the vcf:

bcftools is a great program that can be used to extract infromation from the compressed file

$ bcftools query -l repgp.vcf.gz
pgp_hu019BBA_238
pgp_hu33F35D_119
pgp_hu85E6EC_692
pgp_huA27736_992
.
.
.

Display the genomic data for a single sample:

The following example will display the genomic data for sample pgp_hu019BBA_238

$ bcftools view -s pgp_hu019BBA_238 repgp.vcf.gz
##fileformat=VCFv4.0
##FILTER=<ID=PASS,Description="All filters passed">
##fileDate=20160712
##source=intersection_dbSNP_23andme
##dbSNP_BUILD_ID=141
##reference=GRCh37p13
##phasing=none
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##contig=<ID=1>
##contig=<ID=2>
##contig=<ID=3>
##contig=<ID=4>
##contig=<ID=5>
##contig=<ID=6>
##contig=<ID=7>
##contig=<ID=8>
##contig=<ID=9>
##contig=<ID=10>
##contig=<ID=11>
##contig=<ID=12>
##contig=<ID=13>
##contig=<ID=14>
##contig=<ID=15>
##contig=<ID=16>
##contig=<ID=17>
##contig=<ID=18>
##contig=<ID=19>
##contig=<ID=20>
##contig=<ID=21>
##contig=<ID=22>
##contig=<ID=X>
##contig=<ID=MT>
##contig=<ID=Y>
##bcftools_mergeVersion=1.2-108-gd4d42ca+htslib-1.2.1-188-g6d2810c
##bcftools_mergeCommand=merge ./pgp_hu019BBA_238.clean.vcf.gz ./pgp_hu33F35D_119.clean.vcf.gz ./pgp_hu85E6EC_692.clean.vcf.gz ./pgp_huA27736_992.clean.vcf.gz ./pgp_huD87BFC_58.cle
##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##bcftools_viewVersion=1.2-108-gd4d42ca+htslib-1.2.1-188-g6d2810c
##bcftools_viewCommand=view -s pgp_hu019BBA_238 repgp.vcf.gz
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  pgp_hu019BBA_238
1       82154   rs4477212       A       G       0       PASS    NS=1;AC=0;AN=2  GT:GQ:DP        0/0:0:0
1       734462  rs12564807      G       A       0       PASS    NS=1;AC=0;AN=0  GT:GQ:DP        ./.:.:.
1       752566  rs3094315       G       A       0       PASS    NS=1;AC=1;AN=2  GT:GQ:DP        0/1:0:0
1       752721  rs3131972       A       G       0       PASS    NS=1;AC=1;AN=2  GT:GQ:DP        0/1:0:0
1       760998  rs148828841     C       A       0       PASS    NS=1;AC=0;AN=0  GT:GQ:DP        ./.:.:.
.
.
.

List of all SNP ids contained in vcf:

$ bcftools query -f '%ID \n' repgp.vcf.gz
rs4477212
rs12564807
rs3094315
rs3131972
.
.
.

Viewing a region:

To search for a specific region (in this example will search for: chr17:41,194,290-41,275,478) for a sample without the header:

$ bcftools view -r 17:41194290-41275478 -s pgp_hu019BBA_238 -H repgp.vcf.gz
17      41196363        rs8176320       C       T       0       PASS    NS=1;AC=0;AN=2  GT:GQ:DP        0/0:0:0
17      41196408        rs12516 G       A       0       PASS    NS=1;AC=1;AN=2  GT:GQ:DP        0/1:0:0
17      41196795        rs1060921       T       A       0       PASS    NS=1;AC=0;AN=0  GT:GQ:DP        ./.:.:.
17      41196801        rs1060920       T       C       0       PASS    NS=1;AC=0;AN=2  GT:GQ:DP        0/0:0:0
17      41196914        rs8176319       G       A       0       PASS    NS=1;AC=0;AN=0  GT:GQ:DP        ./.:.:.
17      41197274        rs8176318       C       A       0       PASS    NS=1;AC=1;AN=2  GT:GQ:DP        0/1:0:0
17      41197423        rs8176317       T       C       0       PASS    NS=1;AC=0;AN=2  GT:GQ:DP        0/0:0:0
17      41197659        rs3092995       G       C       0       PASS    NS=1;AC=0;AN=0  GT:GQ:DP        ./.:.:.
.       .               .               .       .       .       .       .               .               .
.       .               .               .       .       .       .       .               .               .
.       .               .               .       .       .       .       .               .               .

-H option displays without the header
-r option specifies the region. (commas will not work)
-s option specifies the samples.