File preprocessing
Out of the box lychee supports HTML, Markdown and plain text formats. More precisely, HTML files are parsed as HTML5 with the use of the html5ever parser. Markdown files are treated as CommonMark with the use of pulldown-cmark.
For any other file format lychee falls back to a “plain text” mode. This means that linkify attempts to extract URLs on a best-effort basis. If invalid UTF-8 characters are encountered, the input file is skipped, because it is assumed that the file is in a binary format lychee cannot understand.
lychee allows file preprocessing with the --preprocess flag.
For each input file the command specified with --preprocess is invoked instead of reading the input file directly.
In the following there are examples how to preprocess common file formats.
In most cases it’s necessary to create a helper script for preprocessing,
as no parameters can be supplied from the CLI directly.
lychee files/* --preprocess ./preprocess.shThe referenced preprocess.sh script could look like this:
#!/usr/bin/env bash
case "$1" in*.pdf) exec pdftohtml -i -s -stdout "$1" # Alternatives: # exec pdftotext "$1" - # exec pdftk "$1" output - uncompress | grep -aPo '/URI *\(\K[^)]*' ;;*.odt|*.docx|*.epub|*.ipynb) exec pandoc "$1" --to=html --wrap=none --markdown-headings=atx ;;*.odp|*.pptx|*.ods|*.xlsx) # libreoffice can't print to stdout unfortunately libreoffice --headless --convert-to html "$1" --outdir /tmp file=$(basename "$1") file="/tmp/${file%.*}.html" sed '/<body/,$!d' "$file" # discard content before body which contains libreoffice URLs rm "$file";;*.adoc|*.asciidoc) asciidoctor -a stylesheet! "$1" -o - ;;*.csv) # specify --delimiter if values not delimited by "," exec csvtk csv2json "$1" ;;*) # identity function, output input without changes exec cat ;;esacFor more examples and information take a look at lychee-all, a repository dedicated to collect use-cases with file preprocessing. Feel free to open up an issue if you are missing a specific file format or have questions.