I now compiled a FASTA file of the spike protein sequences of SARS-like viruses from GenBank: https://drive.google.com/uc?export=download&id=1r9TzeL6jaQsV6JChQL8r9-WG9-3Y4Wgw. TSV metadata: https://drive.google.com/uc?export=download&id=1QVurMpmQfbZa2KEe57YSrWHjfvnouwiM. My file includes sequences like RmYN02 and LYRa3 that are missing the whole genome sequence at GenBank.
I basically just entered the accession number of Wuhan-Hu-1's spike protein to protein BLAST (QHD43416): https://blast.ncbi.nlm.nih.gov/Blast.cgi?PAGE=Proteins. I entered "SARS-CoV-2 (taxid:2697049)" in the organism field and clicked the "exclude" checkbox next to it. Then I clicked "Algorithm parameters" and I set "Max target sequences" to 500. Then I clicked I clicked "BLAST" and I selected "FASTA (complete sequence)" from the "Download" menu. Then I removed the last entries from the file which weren't SARS-like viruses, removed most sequences marked as synthetic constructs, and so on.
I found that there's a new sequence called BtSY2 which was added to GenBank in January 2023: https://www.ncbi.nlm.nih.gov/nuccore/OP963576.1. It doesn't have a full genome at GenBank but only the full CDS. In my alignment of the spike proteins, the number of letter changes from Wuhan-Hu-1 was 20 in BANAL-52, 33 in RaTG13, and 35 in BtSY2, but after that there was a huge gap until Pangolin coronavirus GX_P2V which had 98 letter changes.
If you look at the region of the spike protein 100 bases before PRRA and 100 bases after PRRA, it has only one amino acid change from Wuhan-Hu-1 in RaTG13 and BtSY2 and two changes in BANAL-52. Most current strains of Omicron have 4 amino acid changes in the same region even though their whole genome has an order of magnitude less nucleotide changes from Wuhan-Hu-1.
You can download a FASTA file for a global subsample of about 3000 SARS 2 sequences from NextStrain: https://docs.nextstrain.org/projects/ncov/en/latest/reference/remote_inputs.html. I used Nextclade CLI to generate the protein sequences for all sequences: https://docs.nextstrain.org/projects/nextclade/en/stable/user/nextclade-cli.html. I checked the same region of 100 residues before and 100 residues after the PRRA insert to see how many amino acid changes the region has from Wuhan-Hu-1, but when I ignored positions with an X letter, 1917 out of 2929 sequences had 4 changes, 365 had 1 change, 60 had 2 changes, 38 had zero changes, 25 had 5 changes, and so on (but many of the sequences were old samples from 2020 or 2021):
$ curl https://data.nextstrain.org/files/ncov/open/global/sequences.fasta.xz|gzip -dc>global.fa
[...]
$ brew install --cask miniconda;conda init ${SHELL##*/};conda install -c bioconda nextclade
[...]
$ nextclade dataset get --name sars-cov-2 --output-dir .
[...]
$ nextclade run --input-dataset sars-cov-2 --output-all cladeout global.fa
[...]
$ curl -Lso sars2.spike.fa 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=protein&rettype=fasta&id=YP_009724390'
$ dif1x(){ awk 'NR==1{split($0,a,"");l=length;next}{split($0,b,"");n=0;for(i=1;i<=l;i++)if(a[i]!="X"&&b[i]!="X"&&a[i]!=b[i])n++;print n}' "$@";} # number of letter changes between first sequence and other sequences, ignore X letters
$ cola(){ sed -f <(awk '{print"s/"$1"/\033[38;2;0;0;0m\033[48;2;"$2";"$3";"$4"m"$1"\033[0m/g"}' <(printf %s\\n 'A 242 121 121' 'C 242 182 121' 'D 242 242 121' 'E 182 242 121' 'F 121 242 151' 'G 121 242 222' 'H 121 182 242' 'I 121 121 242' 'K 182 121 242' 'L 242 121 242' 'M 255 191 191' 'N 255 223 191' 'P 255 255 191' 'Q 223 255 191' 'R 191 255 207' 'S 191 255 244' 'T 191 223 255' 'V 191 191 255' 'W 223 191 255' 'Y 255 191 255' '- 140 140 140')) "$@";}
$ sub=$(seqkit subseq -r 581:784 cladeout/nextclade_gene_S.translation.fasta|seqkit mutate -d 101:104|seqkit seq -s);seqkit subseq -r 581:784 sars2.spike.fa|seqkit mutate -d 101:104|seqkit seq -s|dif1x - <(echo "$sub")|paste - <(cola<<<"$sub")|sort|uniq -c|sort -r|head
1917 4 TLEILDITPCSFGGVSVITPGTNTSNQVAVLYQGVNCTEVPVAIHADQLTPTWRVYSTGSNVFQTRAGCLIGAEYVNNSYECDIPIGAGICASYQTQTKSRSVASQSIIAYTMSLGAENSVAYSNNS,IAIPTNFTISVTTEILPVSMTKTSVDCTMYICGDSTECSNLLLQYGSFCTQLKRALTGIAVEQDKNTQEVFAQ
365 1 TLEILDITPCSFGGVSVITPGTNTSNQVAVLYQGVNCTEVPVAIHADQLTPTWRVYSTGSNVFQTRAGCLIGAEHVNNSYECDIPIGAGICASYQTQTNSRSVASQSIIAYTMSLGAENSVAYSNNSIAIPTNFTISVTTEILPVSMTKTSVDCTMYICGDSTECSNLLLQYGSFCTQLNRALTGIAVEQDKNTQEVFAQ
60 2 TLEILDITPCSFGGVSVITPGTNTSNQVAVLYQGVNCTEVPVAIHADQLTPTWRVYSTGSNVFQTRAGCLIGAEHVNNSYECDIPIGAGICASYQTQTNSRSVASQSIIAYTMSLGAENSVAYSNNSIAIPINFTISVTTEILPVSMTKTSVDCTMYICGDSTECSNLLLQYGSFCTQLNRALTGIAVEQDKNTQEVFAQ
38 0 TLEILDITPCSFGGVSVITPGTNTSNQVAVLYQDVNCTEVPVAIHADQLTPTWRVYSTGSNVFQTRAGCLIGAEHVNNSYECDIPIGAGICASYQTQTNSRSVASQSIIAYTMSLGAENSVAYSNNSIAIPTNFTISVTTEILPVSMTKTSVDCTMYICGDSTECSNLLLQYGSFCTQLNRALTGIAVEQDKNTQEVFAQ
25 5 TLEILDITPCSFGGVSVITPGTNTSNQVAVLYQGVNCTEVPVAIHADQLTPTWRVYSTGSNVFQTRAGCLIGAEYVNSSYECDIPIGAGICASYQTQTKSRSVASQSIIAYTMSLGAENSVAYSNNSIAIPTNFTISVTTEILPVSMTKTSVDCTMYICGDSTECSNLLLQYGSFCTQLKRALTGIAVEQDKNTQEVFAQ
21 3 TLEILDITPCSFGGVSVITPGTNTSNQVAVLYQGVNCTEVPVAIHADQLTPTWRVYSTGSNVFQTRAGCLIGAEYVNNSYECDIPIGAGICASYQTQTKSXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXEQDKNTQEVFAQ
19 2 TLEILDITPCSFGGVSVITPGTNTSNQVAVLYQGVNCTEVPVAIHADQLTPTWRVYSTGSNVFQTRAGCLIGAEYVNNSYECDIPIGAGICASYQTQTNSRSVASQSIIAYTMSLGAENSVAYSNNSIAIPTNFTISVTTEILPVSMTKTSVDCTMYICGDSTECSNLLLQYGSFCTQLNRALTGIAVEQDKNTQEVFAQ
18 5 TLEILDITPCSFGGVSVITPGTNTSNQVAVLYHGVNCTEVPVAIHADQLTPTWRVYSTGSNVFQTRAGCLIGAEYVNNSYECDIPIGAGICASYQTQTKSRSVASQSIIAYTMSLGAENSVAYSNNSIAIPTNFTISVTTEILPVSMTKTSVDCTMYICGDSTECSNLLLQYGSFCTQLKRALTGIAVEQDKNTQEVFAQ
18 2 TLEILDITPCSFGGVSVITPGTNTSNQVAVLYQGVNCTEVPVAIHADQLTPTWRVYSTGSNVFQTRAGCLIGAEHVNNSYECDIPIGAGICASYQTQTNSRSVASQSIIAYTMSLGVENSVAYSNNSIAIPTNFTISVTTEILPVSMTKTSVDCTMYICGDSTECSNLLLQYGSFCTQLNRALTGIAVEQDKNTQEVFAQ
17 0 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
The code below shows the region around the Y?Y?Y pattern in BtSY2 and other sequences. It's LDSKVGGNYNYLYRLFRKS in Wuhan-Hu-1, BtSY2, BANAL-52, and a bunch of supposed Pangolin viruses, but RaTG13 actually has five amino acid changes in the same short region: IDAKEGGNFNYLYRLFRKA. I guess it's recombination again...
$ curl -Lso spikes.fa 'https://drive.google.com/uc?export=download&id=1r9TzeL6jaQsV6JChQL8r9-WG9-3Y4Wgw'
$ curl -Lso spikes.tsv 'https://drive.google.com/uc?export=download&id=1QVurMpmQfbZa2KEe57YSrWHjfvnouwiM'
$ targ=QHD43416.1;sub=$(seqkit subseq -r 500:520 spikes.fa);seqkit seq -ni spikes.fa|awk -F\\t 'NR==FNR{a[$1]=$12"|"$7"|"$7"|"$9"|"$10"|"$11;next}{print a[$0]}' spikes.tsv -|cut -c-150|paste <(dif1x <(seqkit grep -rp $targ<<<"$sub"|seqkit seq -s;seqkit seq -s<<<"$sub")) <(dif1x <(seqkit grep -rp $targ spikes.fa|seqkit seq -s;seqkit seq -s spikes.fa)) -|column -ts$'\t'|paste -d' ' <(seqkit seq -s<<<"$sub"|cola) -
[...]
QDTG-----YYFYRSHRSTKL 15 259 Sarbecovirus sp. YN2021|2021-09-22|2021-09-22|China: Yunnan|Wu,Z....Jin,Q.|A comprehensive survey of bat sarbecoviruses across China for the origin tr
QDKG-----YYFYRSHRSTKL 15 258 Sarbecovirus sp. HN2021G|2021-09-22|2021-09-22|China: Hunan|Wu,Z....Jin,Q.|A comprehensive survey of bat sarbecoviruses across China for the origin tr
IDAKEGGNFNYLYRLFRKANL 5 33 Bat coronavirus RaTG13|2020-01-29|2020-01-29|China|Zhu,Y....Shi,Z.L.|A pneumonia outbreak associated with a new coronavirus of probable bat origin
LDSKVGGNYNYLYRLFRKSNL 0 20 Bat coronavirus BANAL-20-52/Laos/2020|2021-09-19|2021-09-19|Laos|Temmam,S....Eloit,M.|Bat coronaviruses related to SARS-CoV-2 and infectious for human
LDSKVGGNYNYLYRLFRKSNL 0 35 Bat SARS-like virus BtSY2|2023-01-24|2023-01-24|China: Yunnan|Wang,J....Shi,M.|
LDSKVGGNYNYLYRLFRKSNL 0 0 Severe acute respiratory syndrome coronavirus 2 Wuhan-Hu-1|2020-01-12|2020-01-12|China|Wu,F....Zhang,Y.-Z.|A new coronavirus associated with human res
QDALTGGNYGYLYRLFRKSKL 6 98 Pangolin coronavirus GX_P2V|2022-02-01|2022-02-01||Lu,S....Song,L.|
QDALTGGNYGYLYRLFRKSKL 6 98 Pangolin coronavirus PCoV_GX-P2V|2020-04-01|2020-04-01|China|Cao,W....Pei,G.|
QDALTGGNYGYLYRLFRKSKL 6 98 Pangolin coronavirus PCoV_GX-P5L|2020-04-01|2020-04-01|China|Cao,W....Pei,G.|
QDALTGGNYGYLYRLFRKSKL 6 99 Pangolin coronavirus PCoV_GX-P4L|2020-04-01|2020-04-01|China|Cao,W....Pei,G.|
QDALTGDNYGYLYRLFRKSKL 7 99 Pangolin coronavirus PCoV_GX-P5E|2020-04-01|2020-04-01|China|Cao,W....Pei,G.|
QDALTGGN--YLYRLFRKSKL 7 100 Pangolin coronavirus PCoV_GX-P1E|2020-04-01|2020-04-01|China|Cao,W....Pei,G.|
LDSKVGGNYNYLYRLFRKSNL 0 120 Bat coronavirus BANAL-20-103/Laos/2020|2021-09-19|2021-09-19|Laos|Temmam,S....Eloit,M.|Bat coronaviruses related to SARS-CoV-2 and infectious for huma
LDSKVGGNYNYLYRLFRKSNL 0 120 Bat coronavirus BANAL-20-236/Laos/2020|2021-09-19|2021-09-19|Laos|Temmam,S....Eloit,M.|Bat coronaviruses related to SARS-CoV-2 and infectious for huma
LDSKVGGNYNYLYRLFRKSNL 0 127 Pangolin coronavirus cDNA8-S|2020-08-01|2020-08-01|China|Shen,Y....Chen,W.|Isolation of SARS-CoV-2-related coronavirus from Malayan pangolins
LDSKVGGNYNYLYRLFRKSNL 0 128 Pangolin coronavirus cDNA16-S|2020-08-01|2020-08-01|China|Shen,Y....Chen,W.|Isolation of SARS-CoV-2-related coronavirus from Malayan pangolins
LDSKVXGNYNYLYRLFRKSNL 0 128 MAG: Pangolin coronavirus GD/M5-9/2019|2023-03-30|2023-03-30|China|Cui,X....Cui,X.|
LDSKVGGNYNYLYRLFRKSNL 0 127 Pangolin coronavirus MP789|2020-05-18|2020-05-18|China|Chen,J....Liu,P.|Are pangolins the intermediate host of the 2019 novel coronavirus (SARS-CoV-2)
LDSKVGGNYNYLYRLFRKSNL 0 127 MAG: Pangolin coronavirus GD/P44-9/2019|2023-03-30|2023-03-30|China|Cui,X....Cui,X.|
LDSKVGGNYNYLYRLFRKSNL 0 127 Pangolin coronavirus cDNA18-S|2020-08-01|2020-08-01|China|Shen,Y....Chen,W.|Isolation of SARS-CoV-2-related coronavirus from Malayan pangolins
LDQG-----QYYYRSHRKTKL 13 316 Sarbecovirus sp. FJ2021E|2021-09-22|2021-09-22|China: Fujian|Wu,Z....Jin,Q.|A comprehensive survey of bat sarbecoviruses across China for the origin t
LDQG-----QYYYRSHRKTKL 13 315 Sarbecovirus sp. FJ2021M|2021-09-22|2021-09-22|China: Fujian|Wu,Z....Jin,Q.|A comprehensive survey of bat sarbecoviruses across China for the origin t
[...]