• DocLang

    From Retrograde@fungus@amongus.com.invalid to comp.misc on Tue Jun 16 02:53:50 2026
    From Newsgroup: comp.misc

    From the «lacking intelligence enough to parse SGML» department:
    Feed: www.theregister.com - Articles
    Title: A modest proposal: Reformat everything to make documents more
    palatable to AI
    Date: Mon, 15 Jun 2026 23:23:21 +0000
    Link: https://www.theregister.com/ai-and-ml/2026/06/16/a-modest-proposal-reformat-everything-to-make-documents-more-palatable-to-ai/5255938

    Image[1]

    Websites are being redesigned for consumption by AI models, and now a
    coalition wants to extend the trend to digital documents. The LF AI &
    Data Foundation, under the Linux Foundation, has formed a working group
    to steer the development of DocLang, an AI-friendly document format that
    aims to help enterprises feed their files to AI systems. The DocLang
    group, founded by IBM, NVIDIA, Red Hat, ABBYY, HumanSignal, and Forgis, contends that existing formats like PDF, Markdown, HTML, and LaTeX are ill-suited for AI document parsing. In late 2024, IBM developed an open
    source toolkit called Docling to facilitate AI document parsing, not
    unlike Microsoft's MarkItDown or the Marker project. Docling provides a
    way to convert various file formats into structured AI-ready data.
    DocLang expands upon that foundation with a standard for exchanging
    structured output across different systems. "DocLang is designed to
    solve one of the foundational problems in enterprise AI: documents were
    built for humans, not machines," said Maxime Vermeir, VP of AI Strategy
    at AI automation biz ABBYY in a statement. "By introducing a minimal, standardized, and AI-native representation of document structure,
    layout, meaning and governance, DocLang creates a far more deterministic foundation for modern AI systems." The new DocLang format is necessary,
    the spec authors argue, because existing formats were designed for
    rendering and lose semantic information, structural relationships, or
    geometric context when AI models turn them into tokens. The
    specification explains that Markdown lacks sufficient scope, that HTML
    is excessively verbose, and that LaTeX allows too much ambiguity.
    Essentially, DocLang is optimized for LLM tokenizers through markup that
    maps between DocLang elements and LLM tokens on a 1-to-1 basis. The spec
    relies on a limited XML vocabulary that aligns with LLM tokenizers to
    produce optimized prompts. It is lossless, so the AI conversion doesn't
    do away with valuable info. It's designed to support common graphical
    elements like tables, formulas, charts, and multimodal content. And it's
    an open standard. DocLang could also help keep costs under control.
    According to AI Cost Check, having an AI model conduct an OCR scan on a
    PDF requires about 1,200 input tokens and 150 output tokens as a
    baseline. That's inconsequential to corporate AI customers on a one-off
    basis but demands attention at scale. And because AI models have highly variable token costs, companies may find they are spending more than
    they anticipated to have their AI system ingest PDFs, particularly if
    the documents are long and complicated or an expensive frontier model is
    used. "PDFs were designed for rendering, not understanding," said Jon
    Knisley, AI Value and Enablement Lead at ABBYY, in an email to The
    Register. "Every time a PDF enters an AI pipeline, structure, meaning
    and layout get lost, so the model's accuracy ends up bottlenecked by
    document quality rather than model quality. Teams compensate by building
    custom parsers at every integration point, which results in brittle,
    one-off work, and a new engineering sprint for every new document type." According to Knisley, that has measurable cost. "Ambiguous structure
    forces the model into guesswork, which drives up hallucination risk and
    burns tokens deciphering layout instead of extracting meaning," he
    explained. "With DocLang, customers can expect better accuracy, lower
    costs, fewer tokens consumed, faster performance and more consistent
    outputs. The exact savings depend on the use case and document
    complexity, but our initial benchmarks show 4x to more than 30x lower
    cost depending on the model evaluated." Knisley also cited governance advantages, noting that document provenance data and metadata can get
    stripped when documents gets moved. DocLang, he said, keeps that
    information attached. ABBYY, which offers AI document processing, has
    created the DocLang Interactive Benchmark to illustrate the potential
    token savings of feeding DocLang documents to AI models. A PDF of IBM's
    2025 annual report, for example, results 8,421 input tokens and 512
    output tokens while a DocLang version requires only 5,310 input tokens
    and 498 output tokens. What's more, the DocLang version results in lower latency (2.7s vs 4.2s) and delivers better quality (the AI missed one subsection and mangled a table merger in the PDF). "It's still early,
    and we won't overstate adoption," said Knisley. "The standard is open
    and free to build on, and the group is actively inviting more technology providers and enterprises to join. The early response has been
    encouraging, and we're optimistic about where it goes from here." ®

    Links:
    [1]: https://image.theregister.com/?imageId=5255961&width=800 (image)

    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Lawrence =?iso-8859-13?q?D=FFOliveiro?=@ldo@nz.invalid to comp.misc on Tue Jun 16 03:40:21 2026
    From Newsgroup: comp.misc

    On 16 Jun 2026 02:53:50 GMT, Retrograde wrote:

    [from <https://www.theregister.com/ai-and-ml/2026/06/16/a-modest-proposal-reformat-everything-to-make-documents-more-palatable-to-ai/5255938>:]
    "DocLang is designed to solve one of the foundational problems in
    enterprise AI: documents were built for humans, not machines,"

    LOL at “documents were built for humans, not machines”. What was the “I” in “AI” supposed to stand for, again?
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From not@not@telling.you.invalid (Computer Nerd Kev) to comp.misc on Wed Jun 17 08:24:53 2026
    From Newsgroup: comp.misc

    Lawrence D'Oliveiro <ldo@nz.invalid> wrote:
    On 16 Jun 2026 02:53:50 GMT, Retrograde wrote:
    [from <https://www.theregister.com/ai-and-ml/2026/06/16/a-modest-proposal-reformat-everything-to-make-documents-more-palatable-to-ai/5255938>:]
    "DocLang is designed to solve one of the foundational problems in
    enterprise AI: documents were built for humans, not machines,"

    LOL at "documents were built for humans, not machines". What was the
    "I" in "AI" supposed to stand for, again?

    From what I've seen, it's definitely "Idiot".
    --
    __ __
    #_ < |\| |< _#
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From kludge@kludge@panix.com (Scott Dorsey) to comp.misc on Tue Jun 16 19:17:43 2026
    From Newsgroup: comp.misc

    Computer Nerd Kev <not@telling.you.invalid> wrote:
    Lawrence D'Oliveiro <ldo@nz.invalid> wrote:
    On 16 Jun 2026 02:53:50 GMT, Retrograde wrote:
    [from <https://www.theregister.com/ai-and-ml/2026/06/16/a-modest-proposal-reformat-everything-to-make-documents-more-palatable-to-ai/5255938>:]
    "DocLang is designed to solve one of the foundational problems in
    enterprise AI: documents were built for humans, not machines,"

    LOL at "documents were built for humans, not machines". What was the
    "I" in "AI" supposed to stand for, again?

    From what I've seen, it's definitely "Idiot".

    Why can't people just use TeX markup like God and Knuth intended?
    --scott
    --
    "C'est un Nagra. C'est suisse, et tres, tres precis."
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Lawrence =?iso-8859-13?q?D=FFOliveiro?=@ldo@nz.invalid to comp.misc on Wed Jun 17 07:35:52 2026
    From Newsgroup: comp.misc

    On Tue, 16 Jun 2026 19:17:43 -0400 (EDT), Scott Dorsey wrote:

    Why can't people just use TeX markup like God and Knuth intended?

    Because troff came first.
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From kludge@kludge@panix.com (Scott Dorsey) to comp.misc on Wed Jun 17 18:41:06 2026
    From Newsgroup: comp.misc

    In article <110tion$1mg7k$2@dont-email.me>,
    Lawrence =?iso-8859-13?q?D=FFOliveiro?= <ldo@nz.invalid> wrote:
    On Tue, 16 Jun 2026 19:17:43 -0400 (EDT), Scott Dorsey wrote:

    Why can't people just use TeX markup like God and Knuth intended?

    Because troff came first.

    troff was just an updated runoff. TeX was a different order of magnitude;
    it was up with commercial typesetting systems like Xics.
    --scott
    --
    "C'est un Nagra. C'est suisse, et tres, tres precis."
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Bob Eager@throwaway0008@eager.cx to comp.misc on Wed Jun 17 22:44:53 2026
    From Newsgroup: comp.misc

    On Wed, 17 Jun 2026 18:41:06 -0400, Scott Dorsey wrote:

    In article <110tion$1mg7k$2@dont-email.me>,
    Lawrence =?iso-8859-13?q?D=FFOliveiro?= <ldo@nz.invalid> wrote:
    On Tue, 16 Jun 2026 19:17:43 -0400 (EDT), Scott Dorsey wrote:

    Why can't people just use TeX markup like God and Knuth intended?

    Because troff came first.

    troff was just an updated runoff. TeX was a different order of
    magnitude;
    it was up with commercial typesetting systems like Xics.
    --scott

    troff was a development of roff, which included some typesetting features. roff was named as a UNIX-style (shorter word) version of DEC's Runoff.
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Lawrence =?iso-8859-13?q?D=FFOliveiro?=@ldo@nz.invalid to comp.misc on Wed Jun 17 23:54:56 2026
    From Newsgroup: comp.misc

    On Wed, 17 Jun 2026 18:41:06 -0400 (EDT), Scott Dorsey wrote:

    On Wed, 17 Jun 2026 07:35:52 -0000 (UTC), Lawrence D’Oliveiro wrote:

    On Tue, 16 Jun 2026 19:17:43 -0400 (EDT), Scott Dorsey wrote:

    Why can't people just use TeX markup like God and Knuth intended?

    Because troff came first.

    troff was just an updated runoff. TeX was a different order of
    magnitude; it was up with commercial typesetting systems like Xics.

    troff appears: 4th ed Unix, 1973 <https://wiki.tuhs.org/doku.php?id=systems:4th_edition&s[]=troff>.
    Originally designed for the CAT phototypesetter from 1972 <https://en.wikipedia.org/wiki/Troff>.

    TEX -- didn’t come out till 1978, if I recall rightly.
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From cross@cross@spitfire.i.gajendra.net (Dan Cross) to comp.misc on Thu Jun 18 13:48:20 2026
    From Newsgroup: comp.misc

    In article <n9gmb5Fdvd6U7@mid.individual.net>,
    Bob Eager <throwaway0008@eager.cx> wrote:
    On Wed, 17 Jun 2026 18:41:06 -0400, Scott Dorsey wrote:

    In article <110tion$1mg7k$2@dont-email.me>,
    Lawrence =?iso-8859-13?q?D=FFOliveiro?= <ldo@nz.invalid> wrote:
    On Tue, 16 Jun 2026 19:17:43 -0400 (EDT), Scott Dorsey wrote:

    Why can't people just use TeX markup like God and Knuth intended?

    Because troff came first.

    troff was just an updated runoff. TeX was a different order of
    magnitude;
    it was up with commercial typesetting systems like Xics.
    --scott

    troff was a development of roff, which included some typesetting features. >roff was named as a UNIX-style (shorter word) version of DEC's Runoff.

    Early Unix roff was a cousin of DEC's Runoff; more immediately,
    it is a descendent of McIlroy's BCPL roff[*] on Multics, which
    in turn descends from Saltzer's RUNOFF on CTSS.

    DEC's version was inspired by the versions on CTSS and Multics
    (and apparently GENIE?), but is an independent implementation.

    Unix roff became nroff ("new roff"), and `troff` adapted it for
    use with a phototypesetter ("typesetter roff", hence "troff").

    - Dan C.

    [*] Wikipedia says that the Unix version was actually more
    directly influenced by a PL/1 version on Multics that came after
    McIlroy's (in 1974). I've never heard this before, and I think
    it is inaccurate: the timelines don't match up. Unix `roff`
    existed well before 1974, and BTL had pulled out of Multics by
    then, anyway.
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From kludge@kludge@panix.com (Scott Dorsey) to comp.misc on Thu Jun 18 12:10:02 2026
    From Newsgroup: comp.misc

    Dan Cross <cross@spitfire.i.gajendra.net> wrote:
    Early Unix roff was a cousin of DEC's Runoff; more immediately,
    it is a descendent of McIlroy's BCPL roff[*] on Multics, which
    in turn descends from Saltzer's RUNOFF on CTSS.

    Yes, and note that RUNOFF got ported to all kinds of other systems from
    OS/360 to CDC NOS. So a lot of people learned about the whole concept of
    text processing from RUNOFF and then went on to build much better systems.

    DEC's version was inspired by the versions on CTSS and Multics
    (and apparently GENIE?), but is an independent implementation.

    Yes, DSR (Digital Standard Runoff) was a bit weird but had features added
    that were well-adapted to printing out DEC manuals instead of academic papers.

    Unix roff became nroff ("new roff"), and `troff` adapted it for
    use with a phototypesetter ("typesetter roff", hence "troff").

    Yes.

    Meanwhile the guys at MIT were all using Scribe by 1980, and there were
    plenty of commercial systems out there for typesetting which were more sophisticated but has a higher bar to entry, like INTERPRESS and the
    previously mentioned XICS. IBM had ISIL/GML with a more fancy markup
    front end on top of that.
    --scott
    --
    "C'est un Nagra. C'est suisse, et tres, tres precis."
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Bob Eager@throwaway0008@eager.cx to comp.misc on Thu Jun 18 20:28:45 2026
    From Newsgroup: comp.misc

    On Thu, 18 Jun 2026 12:10:02 -0400, Scott Dorsey wrote:

    Meanwhile the guys at MIT were all using Scribe by 1980

    We had a Scribe clone on EMAS in Edinburgh and Canterbury in the UK. I
    used it quite a bit.

    In 1984 I bought a cheap IBM compatible PC called an Advance 86B. It came
    with a 'Perfect' suite of programs.

    The text editor was an Emacs clone, without LISP and no optional key
    binding. But I liked it, and have used Emacs-like editors to this day.

    The text formatter was separate (no WYSIWYG). It was a Scribe clone, again much reduced but functional.
    --- Synchronet 3.22a-Linux NewsLink 1.2