Как удалить (удалить) строки, которые присутствуют менее чем в 10% столбца в текстовом файле?

Question 1

Я новичок в Bash и, пожалуйста, ответьте на мой вопрос (возможно, глупый). У меня есть такой текстовый файл (здесь лишь небольшая часть):

                       type test    test    test    test    test    test    test    test    test    test    test    test    control control control control control control control control control control control control control control control control
Actinomyces_odontolyticus   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0.04306 0   0   0   0   0
Actinomyces_sp_HMSC035G02   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0.00575 0   0   0   0   0
Actinomyces_sp_HPA0247  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0.01802 0   0   0   0   0
Actinomyces_sp_ICM47    0   0   0   0   0   0   0   0   0.00244 0   0   0   0   0   0   0   0   0   0   0   0   0   0.00347 0   0   0   0   0
Actinomyces_sp_S6_Spd3  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0.01421 0   0   0   0   0
Actinomyces_sp_oral_taxon_181   0   0   0.00045 0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0.01219 0   0   0   0   0
Aeriscardovia_aeriphila 0   0   0.00786 0.00471 0   0   0   0.00118 0.00645 0.00918 0.01208 0   0.00153 0   0   0   0   0.00923 0   0.01527 0   0.00719 0.00423 0.00177 0   0.00468 0.0047  0.01937
Alloscardovia_omnicolens    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
Bifidobacterium_adolescentis    0.06235 0.05427 0.78772 0.11693 0.03352 0.17129 0.23957 0.25833 0.16216 0.18002 2.27324 0.23361 0.38109 0   0.59227 0   0.46423 1.06198 0.20985 0   0.26431 0.7178  0   0   0.04301 0.27795 0.06356 0.54188
Bifidobacterium_angulatum   0   0   0   0.02457 0   0.03637 0   0   0   0   0   0   0   0   0.03184 0   0   0   0   0   0   0   0   0   0.00368 0   0   0
Bifidobacterium_bifidum 0   0   0   0   0   0   0   0   0   0.08402 0   0   0   0   0.06594 0   0   0   0   0   0   0   0   0   0   0   0   0

Я хочу удалить те строки (Бактерии), которых нет хотя бы в 10% столбцов (Отдельные лица). Это означает, что если, например, у меня есть 70 человек, я хочу удалить те бактерии, которых нет (т.е. = 0), по крайней мере, у 7 человек.

Может ли кто-нибудь помочь мне с некоторыми командами Bash?

Question 2

#!/bin/bash

s_DATA_FILE="remove_bacteria_sample_data.txt"

i_PEOPLE_NUMBER=$(head -n1 ${s_DATA_FILE} | awk '{print NF-1}')
# Be aware Bash does not support decimals so 10% of 28 people is 2
# I increased the example to 50% to get at least 2 results with your sample file
i_PERCENT=50
i_MAX_ZEROS=$((i_PERCENT*i_PEOPLE_NUMBER/100))
s_BACTERIA_LIST=$(awk '!(NR==1) { print $1 }' ${s_DATA_FILE})

echo "Found ${i_PEOPLE_NUMBER} People (test and control)"
echo "Max empty readings per Bacteria are ${i_PERCENT}%: ${i_MAX_ZEROS}"
echo

for s_BACTERIA in ${s_BACTERIA_LIST}
do
    # Please be aware that space after ${s_BACTERIA} is required to avoid matching names that start the same
    # Like if you add Actinomyces_sp_HPA0247 and Actinomyces_sp_HPA0247_2
    # Space makes sure Actinomyces_sp_HPA0247 will return only one row
    i_COUNT_ZEROS=$(grep "${s_BACTERIA} " ${s_DATA_FILE} | awk '{for(i=1; i<=NF; i++) if ($i==0) {i_count_zeros++}; print i_count_zeros; exit}')
    if [[ $i_COUNT_ZEROS -le $i_MAX_ZEROS ]]; then
        echo "* ${s_BACTERIA} meets the criteria with ${i_COUNT_ZEROS} people not being tested"
    else
        echo "- Not meeting the criteria ${s_BACTERIA} with ${i_COUNT_ZEROS} people not being tested"
    fi
done

This will return you:

./remove_bacteria.sh 
Found 28 People (test and control)
Max empty readings per Bacteria are 50%: 14

- Not meeting the criteria Actinomyces_odontolyticus with 27 people not being tested
- Not meeting the criteria Actinomyces_sp_HMSC035G02 with 27 people not being tested
- Not meeting the criteria Actinomyces_sp_HPA0247 with 27 people not being tested
- Not meeting the criteria Actinomyces_sp_ICM47 with 26 people not being tested
- Not meeting the criteria Actinomyces_sp_S6_Spd3 with 27 people not being tested
- Not meeting the criteria Actinomyces_sp_oral_taxon_181 with 26 people not being tested
* Aeriscardovia_aeriphila meets the criteria with 13 people not being tested
- Not meeting the criteria Alloscardovia_omnicolens with 28 people not being tested
* Bifidobacterium_adolescentis meets the criteria with 5 people not being tested
- Not meeting the criteria Bifidobacterium_angulatum with 24 people not being tested
- Not meeting the criteria Bifidobacterium_bifidum with 26 people not being tested

Question 3

Question 4

You can do that with this awk command, where file is your initial file, and cleaned_file is the resulting file:

awk '{nzeros=0; for(col=2; col<=NF; col++) {if($col == 0) {nzeros++}} {if(nzeros < 0.9 * (NF - 1)) {print $0}}}' file > cleaned_file

Explanation:

nzeros=0: We initialize a variable where we store the number of zeros for each row.
for(col=2; col<=NF; col++) {if($col == 0) {nzeros++}}: For every row, we loop from the second column (col=2 - the first column is the bacteria type) til its end (col<=NF - NF is the number of fields, that is the total number of columns). If a column's value is 0 (if($col == 0)), we increase the value of nzeros by 1 (nzeros++).
if(nzeros < 0.9 * (NF - 1)) {print $0}}: If the number of zeros is less than 90% (0.9) of the total number of columns minus the first one if(nzeros < 0.9 * (NF - 1)), we print that row (print $0 - $0 means the whole row in awk).

The output for your sample is:

                       type test    test    test    test    test    test    test    test    test    test    test    test    control control control control control control control control control control control control control control control control
Aeriscardovia_aeriphila 0   0   0.00786 0.00471 0   0   0   0.00118 0.00645 0.00918 0.01208 0   0.00153 0   0   0   0   0.00923 0   0.01527 0   0.00719 0.00423 0.00177 0   0.00468 0.0047  0.01937
Bifidobacterium_adolescentis    0.06235 0.05427 0.78772 0.11693 0.03352 0.17129 0.23957 0.25833 0.16216 0.18002 2.27324 0.23361 0.38109 0   0.59227 0   0.46423 1.06198 0.20985 0   0.26431 0.7178  0   0   0.04301 0.27795 0.06356 0.54188
Bifidobacterium_angulatum   0   0   0   0.02457 0   0.03637 0   0   0   0   0   0   0   0   0.03184 0   0   0   0   0   0   0   0   0   0.00368 0   0   0

score 0 · Answer 1 · 21 August 2020 в 07:59

#!/bin/bash

s_DATA_FILE="remove_bacteria_sample_data.txt"

i_PEOPLE_NUMBER=$(head -n1 ${s_DATA_FILE} | awk '{print NF-1}')
# Be aware Bash does not support decimals so 10% of 28 people is 2
# I increased the example to 50% to get at least 2 results with your sample file
i_PERCENT=50
i_MAX_ZEROS=$((i_PERCENT*i_PEOPLE_NUMBER/100))
s_BACTERIA_LIST=$(awk '!(NR==1) { print $1 }' ${s_DATA_FILE})

echo "Found ${i_PEOPLE_NUMBER} People (test and control)"
echo "Max empty readings per Bacteria are ${i_PERCENT}%: ${i_MAX_ZEROS}"
echo

for s_BACTERIA in ${s_BACTERIA_LIST}
do
    # Please be aware that space after ${s_BACTERIA} is required to avoid matching names that start the same
    # Like if you add Actinomyces_sp_HPA0247 and Actinomyces_sp_HPA0247_2
    # Space makes sure Actinomyces_sp_HPA0247 will return only one row
    i_COUNT_ZEROS=$(grep "${s_BACTERIA} " ${s_DATA_FILE} | awk '{for(i=1; i<=NF; i++) if ($i==0) {i_count_zeros++}; print i_count_zeros; exit}')
    if [[ $i_COUNT_ZEROS -le $i_MAX_ZEROS ]]; then
        echo "* ${s_BACTERIA} meets the criteria with ${i_COUNT_ZEROS} people not being tested"
    else
        echo "- Not meeting the criteria ${s_BACTERIA} with ${i_COUNT_ZEROS} people not being tested"
    fi
done

This will return you:

./remove_bacteria.sh 
Found 28 People (test and control)
Max empty readings per Bacteria are 50%: 14

- Not meeting the criteria Actinomyces_odontolyticus with 27 people not being tested
- Not meeting the criteria Actinomyces_sp_HMSC035G02 with 27 people not being tested
- Not meeting the criteria Actinomyces_sp_HPA0247 with 27 people not being tested
- Not meeting the criteria Actinomyces_sp_ICM47 with 26 people not being tested
- Not meeting the criteria Actinomyces_sp_S6_Spd3 with 27 people not being tested
- Not meeting the criteria Actinomyces_sp_oral_taxon_181 with 26 people not being tested
* Aeriscardovia_aeriphila meets the criteria with 13 people not being tested
- Not meeting the criteria Alloscardovia_omnicolens with 28 people not being tested
* Bifidobacterium_adolescentis meets the criteria with 5 people not being tested
- Not meeting the criteria Bifidobacterium_angulatum with 24 people not being tested
- Not meeting the criteria Bifidobacterium_bifidum with 26 people not being tested

score 0 · Answer 2 · 21 August 2020 в 07:59

You can do that with this awk command, where file is your initial file, and cleaned_file is the resulting file:

awk '{nzeros=0; for(col=2; col<=NF; col++) {if($col == 0) {nzeros++}} {if(nzeros < 0.9 * (NF - 1)) {print $0}}}' file > cleaned_file

Explanation:

nzeros=0: We initialize a variable where we store the number of zeros for each row.
for(col=2; col<=NF; col++) {if($col == 0) {nzeros++}}: For every row, we loop from the second column (col=2 - the first column is the bacteria type) til its end (col<=NF - NF is the number of fields, that is the total number of columns). If a column's value is 0 (if($col == 0)), we increase the value of nzeros by 1 (nzeros++).
if(nzeros < 0.9 * (NF - 1)) {print $0}}: If the number of zeros is less than 90% (0.9) of the total number of columns minus the first one if(nzeros < 0.9 * (NF - 1)), we print that row (print $0 - $0 means the whole row in awk).

The output for your sample is:

                       type test    test    test    test    test    test    test    test    test    test    test    test    control control control control control control control control control control control control control control control control
Aeriscardovia_aeriphila 0   0   0.00786 0.00471 0   0   0   0.00118 0.00645 0.00918 0.01208 0   0.00153 0   0   0   0   0.00923 0   0.01527 0   0.00719 0.00423 0.00177 0   0.00468 0.0047  0.01937
Bifidobacterium_adolescentis    0.06235 0.05427 0.78772 0.11693 0.03352 0.17129 0.23957 0.25833 0.16216 0.18002 2.27324 0.23361 0.38109 0   0.59227 0   0.46423 1.06198 0.20985 0   0.26431 0.7178  0   0   0.04301 0.27795 0.06356 0.54188
Bifidobacterium_angulatum   0   0   0   0.02457 0   0.03637 0   0   0   0   0   0   0   0   0.03184 0   0   0   0   0   0   0   0   0   0.00368 0   0   0

Как удалить (удалить) строки, которые присутствуют менее чем в 10% столбца в текстовом файле?

2 ответа

Другие вопросы по тегам:

Похожие вопросы: