Jump to content

llms.txt


benbyf
 Share

Recommended Posts

Went down a rabbit hole on this one...

Really curious why they chose a structureless document format like markdown when there are rich and mature data standards like Schema.org. The foundational work would have already been completed, the syntax well established/adopted, and there could be a lot of areas where the wheel may not need to be reinvented. Already exists on millions of websites and generators/parsers already exist for it- adoption by devs and orgs could adopt it so much more quickly with updates to existing packages/libraries. Almost all of the examples they give on the llmstxt website could be satisfied out of the box and if not, remedied by extending the specification.

Maybe I'm missing the boat on this one, but is their rejection of an existing data structure due to the fact that they want LLMs to read the content "naturally"? Are LLMs incapable? If that's the case, should LLMs be giving anyone programming tips... I know I've veered off the topic of this post, but this proposal is 3 months old and curious if it has any legs.

ANYWAY. Considerations for a module...

How would it handle different information at different URL paths? Would it just dump all content into a single file? May create a massive file that starts to introduce performance issues on larger sites with things like blogs. This issue filed on the proposal Github repo brings up multiple URLs but the proposal itself doesn't seem to have anything concrete that takes this into consideration.

Thinking about this out loud because creating a module to satisfy this need may end up being a more generalized module. I think this would end up turning into a Markdown generator, if not a library outright. I did a good amount of searching and there are tons of PHP packages that parse MD but I couldn't find any that generate MD from values. If that library existed, the module would be a lot easier to build.

In the case of this module we'd essentially be building two versions of the same site because Markdown is as concerned about presentation as it is about content rather than just logic.

Each field would have to be configured for rendering in MD. The llmstxt example of Franklin's BBQ is a good illustration. They have an unordered list of weekly hours, but their menu is formatted as a table. In that example, either one could be rendered as a list or table. Assuming we are using a repeater for hours and a repeater for menu items, each field would need to have settings for how it should be rendered (list or table). In the case of a table, fields for table headers need to be mapped and the subfields in the repeater mapped to column values. I don't even know what the settings would look like to render the business hours as a list according to the example.

I'm thinking that putting all of the configuration into a module would be a significant challenge. I'm not sure that this proposed standard lends itself well to creating content for the markdown file via a user-friendly UI. It may need a developer to handle it all separately. This is one of the reasons I mentioned Schema data. It would be trivial to implement a Schema object, we already to for Google's structured data.

The biggest lift would be to write a library that the developer uses to render the MD data and minimizing per-field configuration, and probably making the module just a formatter that outputs Markdown using defined methods.

Here's a hypothetical implementation that uses page classes and an imaginary MarkdownGenerator module. This would render something like the Franklin's BBQ example in the link above

<?php namespace ProcessWire;

// site/classes/HomePage
class HomePage extends DefaulePage
{
    public function renderLlmsMarkdown(): string
    {
        $md = wire('modules')->get('MarkdownGenerator');

        return $md->render([
            $md->h1($this->title),
            $md->quote($this->summary),
            $md->text('Here are our weekly hours'),
            $md->ul(
                array_map(
                    fn ($row) => "{$row->days}: {$row->hours}",
                    $this->pages->get(1012)->operating_hours->getArray(),
                ),
            ),
            $md->h2('Menus'),
            $md->ul(
                array_map(
                    fn ($menuPage) => $md->link($menuPage->title, "{$menuPage->httpUrl}/llms.txt"),
                    $this->get('template=menu')->menus->getArray(),
                ),
            ),
        ]);
    }
}

// site/classes/MenuPage.php
class MenuPage extends DefaulePage
{
    public function renderLlmsMarkdown(): string
    {
        $md = wire('modules')->get('MarkdownGenerator');

        $markdownItems = array_map(function($menuSection) use ($md) {
            return $md->render([
                $md->h2($menuSection->title),
                $md->table(
                    ['Item', 'Price'],                    
                    array_map(
                        fn ($item) => [$item->title, $item->price],
                        $section->menu_items->getArray(),
                    ),
                ),
            ]);
        }, $this->menu->getArray());

        return $md->renderBlocks($markdownItems);
    }
}

// site/init.php
foreach ($pages->find('template=home|menu') as $llmPage) {
    $wire->addHook(
        "{$llmPage->url}llms.txt",
        fn (HookEvent $e) => $e->pages->get($llmPage)->renderLlmsMarkdown()
    );
}

That should really leverage caching in the real world.

This approach will render an llms file at each individual URL for pages that are configured. This standard proposal seems to be taking a non-web approach and, as mentioned in that Github issue above, haven't considered leveraging web URLs but instead creating a stack of separate linked MD documents that an LLM reads like a book at the root URL. Since the standard doesn't say "look for llms.txt at every available url', then any pages with llms data will have to be specifically referenced/rendered on either the root llms.txt document or another llms.txt document that is or has an ancestor that is referenced in the root document. This follows the BBQ example, but just uses actual URLs rather than generating a stack of separate MD documents at the root. I assume you could just hook file names that contain Page IDs or something, but this makes more sense to me.

Seems like an incredibly efficient way to build a whole new internet just for robots without any value provided to the people doing the work 🤔 At the very least I want a promise from someone that my life and the lives of my family will be spared when Skynet takes over.

tl;dr

  • Creating a module specifically to render llms data may not be the most efficient way to go about this
  • A module that puts configurations into the UI would have to be extremely complex and account for the many types of fields available in ProcessWire
  • Accounting for fields requires that each type of field is limited to the type of MD is can generate if the module attempted to make everything configurable
  • The best way would probably be to create fields that will contain the content and then have your code do the rendering
  • This is basically just creating two websites. One for people and one for LLMs
  • Because this proposed standard has chosen markup over a logical data structure, it's probably always going to be on the shoulders of the developer unless they change something

Another challenge is their expectation of additional content management:

Quote

Remember, when constructing your llms.txt you should “use concise, clear language. When linking to resources, include brief, informative descriptions. Avoid ambiguous terms or unexplained jargon.”

If this is important enough then there may be a need to manage LLM consumable information separately in fields that contain content sufficiently dumbed down for LLMs. Maybe the real module here is one that connects to an LLM API which auto-summarizes the content to then be used when creating MD files that describe the content to other LLMs.

Solution: a library or Module that takes inputs and renders Markdown. Wouldn't be anything specific to AI.

Or this standard could be tossed and we can just render structured data on the web page so LLMs can use the internet as a natively hyperlinked set of documents located at stable and discoverable URLs...

Having thought this out I think even less of this standard than when I first read the proposal 🤣

  • Like 2
Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...