Средство проверки исходного файла ASCII

Question 1

Для официальной документации Ubuntu, где исходные английские файлы находятся в DocBook xml, существует требование ASCII только символы. Мы используем командную строку "средства проверки" (см. здесь).

grep --color='auto' -P -n "[\x80-\xFF]" *.xml

Однако команда имеет дефект, по-видимому, не на всех компьютерах, она пропускает некоторые строки с символами неASCII, потенциально приводящими ко лжи хорошо. результат.

У кого-либо есть лучшее предложение для командной строки средства проверки ASCII?

Заинтересованные лица могли бы рассмотреть для использования этого файла (текстовый файл, не XML-файл DocBook) как тестовый сценарий. Первые три строки с не символы ASCII являются строками 9, 14 и 18. Строки 14 и 18 были пропущены в проверке:

$ grep --color='auto' -P -n "[\x80-\xFF]" install.en.txt | head -13
9:Appendix F, GNU General Public License.
330:when things go wrong. The Installation Howto can be found in Appendix A,
337:Chapter 1. Welcome to Ubuntu
359:1.1. What is Ubuntu?
394:1.1.1. Sponsorship by Canonical
402:1.2. What is Debian?
456:1.2.1. Ubuntu and Debian
461:1.2.1.1. Package selection
475:1.2.1.2. Releases
501:1.2.1.3. Development community
520:1.2.1.4. Freedom and Philosophy
534:1.2.1.5. Ubuntu and other Debian derivatives
555:1.3. What is GNU/Linux?

Question 2

Если Вы хотите искать символы неASCII, возможно, необходимо инвертировать поиск для исключения символов ASCII:

grep -Pn '[^\x00-\x7F]'

, Например:

$ curl https://help.ubuntu.com/16.04/installation-guide/amd64/install.en.txt -s | grep -nP '[^\x00-\x7F]' | head
9:Appendix F, GNU General Public License.
14:(codename "‘Xenial Xerus’"), for the 64-bit PC ("amd64") architecture. It also
18:━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
330:when things go wrong. The Installation Howto can be found in Appendix A,
337:Chapter 1. Welcome to Ubuntu
359:1.1. What is Ubuntu?
368:  • Ubuntu will always be free of charge, and there is no extra fee for the "
372:  • Ubuntu includes the very best in translations and accessibility
376:  • Ubuntu is shipped in stable and regular release cycles; a new release will
380:  • Ubuntu is entirely committed to the principles of open source software

В строках 9, 330, 337 и 359, Unicode неразрывные пробелы присутствуют.

<час>

особый вывод Вы добираетесь, возможно, из-за grep поддержка UTF-8. Для локали Unicode некоторые из тех символов могут выдержать сравнение равный с нормальным символом ASCII. Принуждение локали C покажет ожидаемые результаты в этом случае:

$ LANG=C grep -Pn '[\x80-\xFF]' install.en.txt| head
9:Appendix F, GNU General Public License.
14:(codename "‘Xenial Xerus’"), for the 64-bit PC ("amd64") architecture. It also
18:━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
330:when things go wrong. The Installation Howto can be found in Appendix A,
337:Chapter 1. Welcome to Ubuntu
359:1.1. What is Ubuntu?
368:  • Ubuntu will always be free of charge, and there is no extra fee for the "
372:  • Ubuntu includes the very best in translations and accessibility
376:  • Ubuntu is shipped in stable and regular release cycles; a new release will
380:  • Ubuntu is entirely committed to the principles of open source software

$ LANG=en_GB.UTF-8 grep -Pn '[\x80-\xFF]' install.en.txt| head
9:Appendix F, GNU General Public License.
330:when things go wrong. The Installation Howto can be found in Appendix A,
337:Chapter 1. Welcome to Ubuntu
359:1.1. What is Ubuntu?
394:1.1.1. Sponsorship by Canonical
402:1.2. What is Debian?
456:1.2.1. Ubuntu and Debian
461:1.2.1.1. Package selection
475:1.2.1.2. Releases
501:1.2.1.3. Development community

Question 3

Question 4

Можно распечатать все строки неASCII файла с помощью моего сценария Python 3, который я размещаю на GitHub здесь:

GitHub: ByteCommander/encoding-check

Можно или клонировать или загрузить весь репозиторий или просто сохранить файл encoding-check и сделайте это исполняемым использованием chmod +x encoding-check.

Затем можно выполнить его как это с файлом для проверки как только аргумент:

./encoding-check FILENAME если это расположено в Вашем текущем рабочем каталоге, или...
/path/to/encoding-check FILENAME если это расположено в /path/to/, или...
encoding-check FILENAME если это расположено в каталоге, который является частью $PATH переменная среды, т.е. /usr/local/bin или ~/bin.

Без любых дополнительных аргументов это распечатает каждую строку и ее число, где это нашло символы неASCII. Наконец, существует сводная строка, которая говорит Вам, сколько строк файл имел всего и сколько из них содержало символы неASCII.

Этот метод, как гарантируют, правильно будет декодировать все символы ASCII и обнаруживать все, что является определенно не ASCII.

Вот пример, работает на файле, содержащем первые 20 строк Вашего данного install.en.txt:

$ ./encoding-check install-first20.en.txt
     9: Appendix��F, GNU General Public License.
    14: (codename "���Xenial Xerus���"), for the 64-bit PC ("amd64") architecture. It also
    18: ���������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������
--------------------------------------------------------------------------------
20 lines in 'install-first20.en.txt', thereof 3 lines with non-ASCII characters.

Но сценарий имеет некоторые дополнительные аргументы для тонкой настройки проверенного кодирования и выходного формата. Просмотрите справку и судите их:

$ encoding-check -h
usage: encoding-check [-h] [-e ENCODING] [-s | -c | -l] [-m] [-w] [-n] [-f N]
                     [-t]
                     FILE [FILE ...]

Show all lines of a FILE containing characters that don't match the selected
ENCODING.

positional arguments:
  FILE                  the file to be examined

optional arguments:
  -h, --help            show this help message and exit
  -e ENCODING, --encoding ENCODING
                        file encoding to test (default 'ascii')
  -s, --summary         only print the summary
  -c, --count           only print the detected line count
  -l, --lines           only print the detected lines
  -m, --only-matching   hide files without matching lines from output
  -w, --no-warnings     hide warnings from output
  -n, --no-numbers      do not show line numbers in output
  -f N, --fit-width N   trim lines to N characters, or terminal width if N=0;
                        non-printable characters like tabs will be removed
  -t, --title           print title line above each file

Как --encoding, каждый кодек, который знает Python 3, допустим. Просто попробуйте один, в худшем случае Вы получаете немного сообщения об ошибке...

Question 5

Эта команда Perl главным образом заменяет это grep команда (вещь пропавшие без вести, являющиеся цветами):

perl -ne '/[\x80-\xFF]/&&print($ARGV."($.):\t^".$_)' *.xml

n: Perl причин для принятия следующего цикла вокруг программы, которая заставляет его выполнить итерации по аргументам имени файла несколько как sed-n или awk:
```
LINE:
  while (<>) {
      ...             # your program goes here
  }
```
-e: может использоваться для ввода одной строки программы.
/[\x80-\xFF]/&&print($ARGV."($.):\t^".$_): Если строка содержит символ в диапазоне \x80-\xFF, печатает имя текущего файла, номер строки текущего файла, a :\t^строка и содержание текущей строки.

Вывод на демонстрационном каталоге, содержащем файл примера в вопросе и файле, содержащем только ààààà и символ новой строки:

% perl -ne '/[\x80-\xFF]/&&print($ARGV."($.):\t^".$_)' file | head -n 10
file(9):    ^AppendixÂ F, GNU General Public License.
file(14):   ^(codename "â€˜Xenial Xerusâ€™"), for the 64-bit PC ("amd64") architecture. It also
file(18):   ^â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”
file(330):  ^when things go wrong. The Installation Howto can be found in AppendixÂ A, 
file(337):  ^ChapterÂ 1.Â Welcome to Ubuntu
file(359):  ^1.1.Â What is Ubuntu?
file(368):  ^  â€¢ Ubuntu will always be free of charge, and there is no extra fee for the "
file(372):  ^  â€¢ Ubuntu includes the very best in translations and accessibility
file(376):  ^  â€¢ Ubuntu is shipped in stable and regular release cycles; a new release will
file(380):  ^  â€¢ Ubuntu is entirely committed to the principles of open source software
% perl -ne '/[\x80-\xFF]/&&print($ARGV."($.):\t^".$_)' file1
file1(1):   ^ààààà

muru · Accepted Answer · 1 December 2019 в 13:09