Skip to main content

实用短串联重复序列 (STR)

项目描述

结构分析


该软件包包含用于分析短串联重复 (STR) 的脚本和实用程序。

安装

要安装脚本和实用程序,请运行:

python3 -m pip install --upgrade str_analysis

或者,您可以使用此 docker 映像:

docker run -it weisburd/str-analysis:latest

call_non_ref_pathogenic_motifs

该脚本采用 bam 或 cram 文件,并确定哪些基序存在于已知的致病性 STR 基因座(如 RFC1、BEAN1、DAB1 等),其中已知几个基序在群体中分离。然后它可以选择运行 ExpansionHunterDenovo、ExpansionHunter 和/或 STRling,并从其输出中收集相关字段,然后用户可以进行比较或用于下游分析。它还可以运行 REViewer 以根据 ExpansionHunter 输出生成读取可视化图像。最后,它为每个基因座生成一个 json 文件,其中包含所有收集到的信息以及一个指示是否检测到致病基序的“调用”字段。

示例命令行:

# basic command 
call_non_ref_pathogenic_motifs -R hg38.fasta -g 38 sample1.cram --locus RFC1

# run ExpansionHunter and REViewer on all 9 loci with known non-ref pathogenic motifs
call_non_ref_pathogenic_motifs -R hg38.fasta --run-expansion-hunter --run-reviewer -g 38 sample1.cram --all-loci

# for 2 specific loci, run ExpansionHunter and REViewer + provide an existing ExpansionHunterDenovo profile
call_non_ref_pathogenic_motifs -R hg38.fasta --run-expansion-hunter --run-reviewer --ehdn-profile sample1.str_profile.json -g 38 sample1.cram --locus RFC1 --locus BEAN1

命令行参数:

positional arguments:
  bam_or_cram_path      bam or cram path.

optional arguments:
  -h, --help            show this help message and exit
  -g {GRCh37,hg19,hg37,37,GRCh38,hg38,38}, --genome-version {GRCh37,hg19,hg37,37,GRCh38,hg38,38}
                        Reference genome version
  -r REFERENCE_FASTA, --reference-fasta REFERENCE_FASTA
                        Reference fasta path.
  -o OUTPUT_PREFIX, --output-prefix OUTPUT_PREFIX
                        Output filename prefix.
  -s SAMPLE_ID, --sample-id SAMPLE_ID
                        The sample id to put in the output json file. If not
                        specified, it will be retrieved from the bam/cram
                        header or filename prefix.
  --strling-genotype-table STRLING_GENOTYPE_TABLE
                        Optionally provide an existing STRling output file for
                        this sample. If specified, the script will skip
                        running STRling.
  --run-strling         Optionally run STRling and copy information relevant
                        to the locus from the STRling results to the json
                        output file.
  --strling-path STRLING_PATH
                        The path of the STRling executable to use.
  --strling-reference-index STRLING_REFERENCE_INDEX
                        Optionally provide the path of a pre-computed STRling
                        reference index file. If provided, it will save a step
                        and allow STRling to complete faster.
  --expansion-hunter-denovo-profile EXPANSION_HUNTER_DENOVO_PROFILE
                        Optionally copy information relevant to the locus from
                        this ExpansionHunterDenovo profile to the output json.
                        This is instead of --run-expansion-hunter-denovo.
  --run-expansion-hunter-denovo
                        Optionally run ExpansionHunterDenovo and copy
                        information relevant to the locus from
                        ExpansionHunterDenovo results to the output json.
  --expansion-hunter-denovo-path EXPANSION_HUNTER_DENOVO_PATH
                        The path of the ExpansionHunterDenovo executable to
                        use if --run-expansion-hunter-denovo is specified.
  --run-expansion-hunter
                        If this option is specified, this script will run
                        ExpansionHunter once for each of the motif(s) it
                        detects at the locus. ExpansionHunter doesn't
                        currently support genotyping multiallelic repeats such
                        as RFC1 where an individual may have 2 alleles with
                        motifs that differ from each other (and from the
                        reference motif). Running ExpansionHunter separately
                        for each motif provides a work-around.
  --expansion-hunter-path EXPANSION_HUNTER_PATH
                        The path of the ExpansionHunter executable to use if
                        --run-expansion-hunter is specified. This must be
                        ExpansionHunter version 3 or greater.
  --use-offtarget-regions
                        Optionally use off-target regions when counting reads
                        that support a motif, and when running
                        ExpansionHunter.
  --run-reviewer        Run the REViewer tool to visualize ExpansionHunter
                        output. --run-expansion-hunter must also be specified.
  --run-reviewer-for-pathogenic-calls
                        Run the REViewer tool to visualize ExpansionHunter
                        output only when this script calls a sample as having
                        PATHOGENIC MOTIF / PATHOGENIC MOTIF. --run-expansion-
                        hunter must also be specified.
  --all-loci            Generate calls for all these loci: RFC1, BEAN1, DAB1,
                        MARCHF6, RAPGEF2, SAMD12, STARD7, TNRC6A, YEATS2
  -l {RFC1,BEAN1,DAB1,MARCHF6,RAPGEF2,SAMD12,STARD7,TNRC6A,YEATS2}, --locus {RFC1,BEAN1,DAB1,MARCHF6,RAPGEF2,SAMD12,STARD7,TNRC6A,YEATS2}
                        Generate calls for this specific locus. This argument
                        can be specified more than once to call multiple loci.
  -v, --verbose         Print detailed log messages.

命令行参数:

positional arguments:
  bam_or_cram_path      bam or cram path

optional arguments:
  -h, --help            show this help message and exit
  -g {GRCh37,hg19,hg37,37,GRCh38,hg38,38}, --genome-version {GRCh37,hg19,hg37,37,GRCh38,hg38,38}
  -R REFERENCE_FASTA, --reference-fasta REFERENCE_FASTA
                        Reference fasta path. The reference fasta is sometimes
                        necessary for decoding cram files.
  -o OUTPUT_PREFIX, --output-prefix OUTPUT_PREFIX
                        Output filename prefix
  -s SAMPLE_ID, --sample-id SAMPLE_ID
                        The sample id to put in the output json file. If not
                        specified, it will be retrieved from the bam/cram
                        header or filename prefix.
  --run-expansion-hunter-denovo
                        Optionally run ExpansionHunterDenovo and copy
                        information relevant to the locus from
                        ExpansionHunterDenovo results to the output json.
  --expansion-hunter-denovo-path EXPANSION_HUNTER_DENOVO_PATH
                        The path of the ExpansionHunterDenovo executable to
                        use when --expansion-hunter-denovo-path is specified.
  --expansion-hunter-denovo-profile EXPANSION_HUNTER_DENOVO_PROFILE
                        Optionally copy information relevant to the locus from
                        this ExpansionHunterDenovo profile to the output json.
                        This is instead of --run-expansion-hunter-denovo
  -r, --run-expansion-hunter
                        If this option is specified, this script will run
                        ExpansionHunter once for each of the motif(s) it
                        detects at the locus. ExpansionHunter doesn't
                        currently support genotyping multiallelic repeats such
                        as RFC1 where an individual may have 2 alleles with
                        motifs that differ from eachother (and from the
                        reference motif). Running ExpansionHunter separately
                        for each motif provides a work-around.
  --expansion-hunter-path EXPANSION_HUNTER_PATH
                        The path of the ExpansionHunter executable to use if
                        -r is specified. This must be ExpansionHunter version
                        3 or greater.  
  --use-offtarget-regions  
                        Optionally use off-target regions when counting reads 
                        that support a motif, and when running ExpansionHunter.
  --all-loci            Generate calls for all these loci: RFC1, BEAN1, DAB1,
                        MARCHF6, RAPGEF2, SAMD12, STARD7, TNRC6A, YEATS2
  -l {RFC1,BEAN1,DAB1,MARCHF6,RAPGEF2,SAMD12,STARD7,TNRC6A,YEATS2}, --locus {RFC1,BEAN1,DAB1,MARCHF6,RAPGEF2,SAMD12,STARD7,TNRC6A,YEATS2}
                        Generate calls for this specific locus. This argument
                        can be specified more than once to call multiple loci.
  --run-reviewer        Run the REViewer tool to visualize ExpansionHunter
                        output. --run-expansion-hunter must also be specified.
  --run-reviewer-for-pathogenic-calls
                        Run the REViewer tool to visualize ExpansionHunter
                        output only when this script calls a sample as having
                        PATHOGENIC MOTIF / PATHOGENIC MOTIF. --run-expansion-
                        hunter must also be specified.
  -v, --verbose         Print detailed log messages

*_motifs.json输出文件摘要:

call_non_ref_pathogenic_motifs脚本输出一个 .json 文件,其中包含许多字段,总结了它找到的内容,以及使用 、 和/或 args 时的 ExpansionHunter、REViewer、ExpansionHunterDenovo--run-expansion-hunter--run-reviewerSTRling --run-expansion-hunter-denovo输出--run-strling

关键字段是:

呼叫:(例如BENIGN MOTIF / PATHOGENIC MOTIF
motif1_repeat_unit:(例如AAAGG
motif2_repeat_unit:(例如AAGGG
expansion_hunter_call_repeat_unit:(例如AAAGG / AAGGG)与上述相同的信息,但采用基因型形式。 expand_hunter_call_genotype:(例如15/68)上述主题的重复次数。
expand_hunter_call_CI:(例如15-15/55-87)基因型的置信区间。
expand_hunter_call_reviewer_svg:(例如sample1.RFC1_AAGGG.expansion_hunter_reviewer.svg)REViewer 读取可视化图像路径。

输出字段:

sample_id如果此值未指定为命令行 arg,则从输入的 bam/cram 文件头或文件名前缀中解析。
call描述在 RFC1/CANVAS 轨迹上检测到的基序。它的格式类似于 VCF 基因型。可能的值为:

  • PATHOGENIC MOTIF / PATHOGENIC MOTIF仅检测到致病基序
  • BENIGN MOTIF / BENIGN MOTIF仅检测到良性基序
  • MOTIF OF UNCERTAIN SIGNIFICANCE / MOTIF OF UNCERTAIN SIGNIFICANCE检测到具有未知致病性的非规范基序
  • BENIGN MOTIF / PATHOGENIC MOTIF良性基序和致病基序的杂合子,暗示携带者状态
  • PATHOGENIC MOTIF / MOTIF OF UNCERTAIN SIGNIFICANCE杂合的致病基序和检测到的具有未知致病性的非规范基序
  • BENIGN MOTIF / MOTIF OF UNCERTAIN SIGNIFICANCE:杂合的良性基序和检测到致病性未知的非规范基序
  • NO CALL:读取数据中没有足够的证据支持上述任何选项

主题1_repeat_unit最多读取支持的重复单元。
motif1_read_count支持motif1的读取数。
motif1_normalized_read_count与motif1_read_count 相同,但通过RFC1 基因座侧翼区域的覆盖深度进行归一化
motif1_n_occurrencesmotif1 在RFC1 基因座的读取中出现的总次数。
motif1_read_count_with_offtargets此重复单元的脱靶区域内支持motif1 的读取数。这些是大约 1kb 的区域,根据模拟数据的实验,完全重复(又名 IRR)读取可能会错误映射到这些区域。
主题1_normalized_read_count_with_offtargets与motif1_read_count_with_offtargets相同,但通过RFC1基因座侧翼区域的覆盖深度进行归一化

motif2_repeat_unit如果检测到多个主题,此字段将包含下一个最多读取支持的重复单元
motif2_read_count请参阅“motif1_read_count”描述。
主题 2_n_occurrences参见“motif1_n_occurrences”描述。
...
注意: motif2_* 字段只有在读取支持超过 1 个主题时才会生成。

left_flank_coverage紧靠 RFC1 轨迹左侧的 2kb 窗口内的平均读取深度
right_flank_coverage紧靠 RFC1 轨迹右侧的 2kb 窗口内的平均读取深度

found_n_reads_overlap_rfc1_locus在 RFC1 基因座处与参考基因组中的 AAAAG 重复重复且 MAPQ > 2
读取数以及一些 5bp 或 6bp 重复基序,覆盖 > 70% 的重叠读取序列(包括任何软剪辑碱基)
found_repeats_in_fraction_of_readsfound_repeats_in_n_reads/found_n_reads_overlap_rfc1_locus

注意:下面的expansion_hunter_* 字段将在--run-expansion-hunter使用时添加。

expand_hunter_motif1_json_output_file在主题 1 上运行时,ExpansionHunter 输出 json 文件的路径。
expand_hunter_motif1_repeat_unit重复单元传递给 ExpansionHunter。
expand_hunter_motif1_short_allele_genotype ExpansionHunter 输出短等位基因的基因型(重复数)。
expand_hunter_motif1_long_allele_genotype ExpansionHunter 输出长等位基因的基因型(重复数)。
expand_hunter_motif1_short_allele_CI_start ExpansionHunter 输出短等位基因的基因型置信区间下限。
expand_hunter_motif1_short_allele_CI_end ExpansionHunter 输出短等位基因的基因型置信区间上限。
expand_hunter_motif1_long_allele_CI_start ExpansionHunter 输出长等位基因的基因型置信区间下限。
expand_hunter_motif1_long_allele_CI_end ExpansionHunter 输出长等位基因的基因型置信区间上限。 expand_hunter_motif1_total_spanning_reads ExpansionHunter 输出支持motif1 基因型的跨越读取总数。
expand_hunter_motif1_total_flanking_reads ExpansionHunter 输出支持motif1基因型的侧翼读取总数。
expand_hunter_motif1_total_inrepeat_reads ExpansionHunter 输出支持motif1 基因型的IRR 读取总数。

expand_hunter_motif2_json_output_file如果在该位点检测到第二个基序,这将是在第二个基序上运行时 ExpansionHunter 输出 json 文件的路径。上面为motif1 列出的所有字段也将出现在motif2 中。...

expand_hunter_call_repeat_unit用于运行 ExpansionHunter 的重复单元。如果检测到多个,则格式为AAAAG / AAGGG.
expand_hunter_call_genotype ExpansionHunter 基于 ExpansionHunter 运行结果的组合基因型。 expand_hunter_call_CI ExpansionHunter 输出支持motif1 基因型的跨越读取总数。

注意:下面的 *_reviewer_svg 字段仅在--run-reviewer使用时添加。

expand_hunter_motif1_reviewer_svg REViewer 为这个主题生成的 .svg 图像文件的路径。
expand_hunter_motif2_reviewer_svg如果在该位点检测到第二个基序,这将是 REViewer 为该其他基序生成的 .svg 图像文件的路径。
expand_hunter_call_reviewer_svg最终的 .svg 图像文件。如果检测到多个基序,这将包含一个合并图像,其中包含从上面的主题 1 和主题 2 图像中选择的 1 个短等位基因面板和 1 个长等位基因面板。

注意:下面的 ehn_ 字段仅在--run-expansion-hunter-denovo使用时添加。

ehdn_motif1_repeat_unit
ehdn_motif1_anchored_irr_count
ehdn_motif1_paired_irr_count
ehdn_motif1_total_irr_count
ehdn_sample_read_depth
ehdn_motif1_n_anchored_regions

示例输出:

假设脚本检测到 sample1 在 RFC1 基因座处包含两组读取 - 一些具有 AAGGG 基序,一些具有 AAAAG。该脚本然后为 AAGGG 主题运行 ExpansionHunter,然后为 AAAAG 主题再次运行它。假设 ExpansionHunter 输出 15/73 作为 AAGGG 的基因型和 15/22 作为 AAAAG 的基因型。然后该脚本将输出:

呼叫:(BENIGN MOTIF / PATHOGENIC MOTIF
expansion_hunter_call_repeat_unit:(AAAAG / AAGGG
expansion_hunter_call_genotype:(15/73)从上面的两个基因型中选择。
expand_hunter_call_CI : ( 15-15/55-87) 从两组置信区间中选择。
expand_hunter_call_reviewer_svg ( sample1.RFC1_AAGGG.expansion_hunter_reviewer.svg)

项目详情


下载文件

下载适用于您平台的文件。如果您不确定要选择哪个,请了解有关安装包的更多信息。

源分布

str_analysis-0.9.7.ta​​r.gz (116.2 kB 查看哈希

已上传 source

内置分布

str_analysis-0.9.7-py3-none-any.whl (188.6 kB 查看哈希

已上传 py3