User:Vadcx/Modding Architecture

From Official Factorio Wiki
Jump to navigation Jump to search

Content warning: prepare a lot of tea. LOL


Preface: This was written as part of a discussion whether a code that sets up a metatable in a script file's root is safe to use (that file is always loaded and that table is never saved into global.

What you will read here: Factorio's Lua API and modding architecture design decisions; my high-level overview and explanation of usage and inner workings of Lua API/mods.

What's left to do: generalize the text and provide easy to follow examples. Describe a way to practically store functions/metatables in global by using FunctionFactories that recreate necessary functions based on data-only descriptions saved in global. Make its own section for the glossary because the devs chose very poor names for crucial API functions.

EDITS ARE WELCOME if you have ideas, go ahead and edit the text. Without asking even. If I notice something I dont like, I will undo that single change.


Recommended reading order:

That’s all for now. I assume people reading understand the determinism and the lockstep networking architecture.

If you have spent a lot of time in pure Lua, you would expect all the handy-dandy properties of dynamic typing and functions being first-class values. Much like JavaScript. Ask a table about key MYFUNC and get that function in return or a table that acts like a function due to __call metatable :)

The foundational decision by Factorio devs was to use the standard Lua aka PUC-Rio Lua. This means admitting to its shortcomings and long-standing limitations like load that provides no guarantees about safety when loading bytecode directly.

For security reasons, the result is that Factorio will never try to serialize functions or code of the Lua Virtual Machive (VM). Although it would have been cool if the entire Lua VM state were saved/transferred/restored exactly as it was before. This would completely eliminate desync issues on our (modders) side due to improper save handling. Yet this would allow the loophole of deserializing untrusted code… and there goes the security… Even if there are examples of other VM implementations that allow this. Altough maybe in a non-portable way.



If we are not allowed to serialize functions, it makes sense to erase metatables completely too, because they are mostly made of custom callback functions. Following to these constraints leaves us with the current Factorio architecture:

  • All Lua code is plain text Lua.
  • No functions/metatables will be saved to meet the constraints
  • Only data state is saved and restored for game saves (and multiplayer that uses game saves under the hood)
  • Lua’s data persistence in game saves is achieved by only serializing the special global Factorio table
  • Factorio’s internal data is restored from the save game, data needed for Lua scripts is restored by recreating the global table as it was before the save game
  • Lua code is loaded afresh from the included script files (be it mods or scenario scripts)
  • Lua code makes sure to re-register metatables when its code loads, because they were erased, even for tables inside global

If it helps, think of Factorio’s save system as OS hibernation. You have all the same code on disk, but must pay attention to load previously saved stuff from pagefile into memory so it appears seamless.

The high level explanation of a desync in Factorio is that you forgot to sneak through data in global between current players and players loading from code + save data. There are no other causes, because Factorio devs deliberately limit you in what you can do or reach from within Lua. That’s why there’s no asymmetrical I/O in Factorio where you would load a file on one player’s computer or the server. It’s a decision to limit the “desync surface” so to say.

And because Factorio does not go out of its way to serialize the Lua VM state, you only get the data stored in global, which is serialized between saves; all your local and “Lua global” and upvalue variables be damned (poof, gone)!

Read Gangsir’s desync chapter again and the previous paragraphs if needed.



Only now does it make sense to read the more advanced articles:

It should make sense now why the “Heavy mode” of desync troubleshooting is there. The desyncs between players can only happen if you can’t restore (current state) from (persistent save data + script files)*. The only other culprit is Factorio itself introducing desyncs, but that’s not our problem.

  • data-lifecycle article, “on_load()” section



I hope you can now understand and explain Gangsir’s desync examples:

  1. Using local variables. When a player loads (save data + script files), he starts with values from code. If you intend on using them, they must be loaded/overloaded from global persistence table.
    • Just consider all data coming from script files immutable and values from global must be overlayed on top. (Linux examples: OverlayFS, Docker, Fedora Silverblue, SteamOS.)
  2. Conditional event subscribing. When your (save data + script file) loads, it must know if it needs to subscribe to an event from before by checking your saved variable in global persistence table. If you didn’t save it in global, the newly joined player has no idea that event must have been subscribed to!
  3. Improper use of on_load. Carefully read the data-lifecycle, the warnings and don’t do anything crazy except trying to load the saved state from global to restore from “hibernation”.
  4. Comparison by reference. This is very tricky, because that’s exactly what makes Lua so easy to use. Even if functions are now out of the way (not saved-restored), there are still Lua tables and objects that came from Factorio’s API. The tutorial says:

Be cautious of comparing tables by reference. In multiplayer syncing, tables deserialized from the server state will be new objects, not equal by reference to any table initialized by client code. if a == b then and if a ~= b then

… they may have different results if a came from loading script file code and b came from deserialization (save file).

Note that LuaObjects provided by the game have their equality operator overwritten to prevent this behaviour, so code such as LuaEntityA ~= LuaEntityB will not desync. However, this does not apply when LuaObjects are used as keys in tables:

if table[LuaObject] then

This will desync in the same way as described for the plain tables a and b above. For entities it is recommended to use LuaEntity.unit_number as the table key instead of the whole entity.

Lua’s tables and game’s objects are pretty much C pointers. They can’t be restored between save & load. The game tries to help when it comes to the equality operator of LuaObject, but when it is used as table keys you happen to rely on the pointer address that will never be the same between save and after load.

Remember:

When loading from save/joining multiplayer:

global[ FactorioGameObject ] = someValue will not be found after loading –> desync.

The following will work after loading because the serpent library takes care of it across serializations (only for all regular Lua objects in global):

global.savedLuaTable = {"I am a SavedLuaTable"}

-- times passes, empires fall, you join the game:
local persistedTable = global.savedLuaTable
global[ persistedTable ] = LuaValue

Next example will not work because when loading code, Lua is blissfully unaware and creates new tables:

local freshTable = require("make-me-a-table")
return global[ freshTable ] -- NOT FOUND, new object used as key here -> desync

--- in contrast, save it under a permanent string key:
return global["my-old-table"] --> always finds the saved value

Hopefully I understood and delivered that Lua example correctly.



After this short preface I can finally explain (to myself and to you) the metatable shenanigans here. The game trusts us that the script files on disk were untouched (no scenario updates) and that we handle on_init correctly when loading saved game. Since the game always loads script files afresh, each player will go through that code that attaches the metatable as part of require of this file. This metatable is effectively static, in that it comes from the immutable script file code. I haven’t combed through the code to see how control.lua -> on_init -> comes to require and load session_data.lua, but it must be happening at some point. See graph in data-lifecycle

The reasons this metatable cannot desync:

  1. It is present for the player who started the save and loaded with default code for joining/loading players.
  2. On its own it does not modify any state. If some script relied on its modified trusted value, that script would necessarily modify global to deviate from default, thus also syncing it for save/load i.e. other players.
  3. The above is still a shaky ground, because the metatable always calls is_multiplayer, right? And this is the only security guarantee for both previous points: it is currently not possible to turn a Singleplayer into a Multiplayer session without reloading the game and vice-versa. So going from SP->MP / MP->SP (the only potential way for this metatable to affect some other code’s LOCAL VALUES) requires reloading the entire script + save state. Therefore all players start with same state persisted through global.

Now the last point also means this PR can be rewritten as load-time if condition, because multiplayer won’t be enabled dynamically (much like compile-time in C):

if not game.is_multiplayer() then -- we can do this, because our parent (control.lua) is in "Runtime" stage aka live
    -- only attaches a metatable if scenario/game save started as Singleplayer

    -- Hacky way to ensure singleplayer is always "trusted" besides being an admin
    -- Since players are accessed by name, return true on missing keys.
    setmetatable(trusted, {__index = function()
        return true -- only for not existing values
    end
end
})

Based on above, this entire file is loaded on new game & joining game. But Lua’s require semantics also only ever require (execute) a given code file once, so we can’t run into a problem of multiple different states here.

I need to see how this ties into the testing/benchmarking I wanted it for. If anything, this new piece of code would be my preferred variant for now.

PS: Since the game saves always load Lua code from files and never deserialize code, it means we can sneak in code updates into save files by editing script files directly. Just make sure your new code can tolerate previously saved global persistence state and the other game objects previously created on Factorio’s side.

PPS: Almost 11k characters, 1700 words… quite an article. How many Ko-Fis can I demand now? <3