帮我破译Python的编组代码对象。 .pyc文件几乎相同:.pyc文件的结构。

我有:


从源代码编译的代码对象。
此代码对象的编组表示。
其(代码对象)代码节的递归反汇编。
其所有字段值。

主要目的:

我想找出不同的代码对象如何相互存储和引用。即,如何存储到子代码对象的链接?该模块应引用其所有功能。该函数应该具有对所有其他函数的引用,这些引用可以从中调用。当虚拟机将代码对象id存储到.pyc时,是否将其保留?我不这样认为,因为在id文件中看不到.pyc。例如,我在反汇编的源代码中有这样的指令:

LOAD_CONST        2 (<code object baz at 0x7f380995e5d0, file "foo.py", line 7>)


因此:


虚拟机如何找到baz代码对象?我看不到所有这些信息:编组字符串中的0x7f380995e5d0, file "foo.py", line 7。是在每次运行程序时创建存储在编组代码中的对象id 0x7f380995e5d0还是创建该对象?
如果不存储,如何在封送处理的代码对象(.pyc文件)中保留对象的连接?

我想,我将进一步对gdb进行调查,但是也许这种方法(.pyc文件解密)也可以完成这项工作。

当前结果:

我将所有这些信息用于创建下一个文件:第一列是封送处理的二进制表示形式代码对象,第二个是我已经确定的每个字节序列的含义。

b'
\xe3                    <don't know>
\x00\x00\x00\x00        <foo.py: co_argcount: 0>
\x00\x00\x00\x00        <foo.py: co_kwonlyargcount: 0>
\x00\x00\x00\x00        <foo.py: co_nlocals: 0>
\x03\x00\x00\x00        <foo.py: co_stacksize: 3>               
@\x00\x00\x00           <foo.py: co_flags = '@' = 0x40 = 64>
s.\x00\x00\x00          <foo.py: number of bytes for module instructions = '.' = 46>
d\x00                   <foo.py: co_code:  0 LOAD_CONST        0 (1)
Z\x00                   <foo.py: co_code:  2 STORE_NAME        0 (a)
d\x01                   <foo.py: co_code:  4 LOAD_CONST        1 (2)
Z\x01                   <foo.py: co_code:  6 STORE_NAME        1 (b)
e\x00                   <foo.py: co_code:  8 LOAD_NAME         0 (a)
e\x01                   <foo.py: co_code: 10 LOAD_NAME         1 (b)
\x17\x00                <foo.py: co_code: 12 BINARY_ADD
Z\x02                   <foo.py: co_code: 14 STORE_NAME        2 (c)
d\x02                   <foo.py: co_code: 16 LOAD_CONST        2 (<code object baz at 0x7f380995e5d0, file "foo.py", line 7>)
d\x03                   <foo.py: co_code: 18 LOAD_CONST        3 ('baz')
\x84\x00                <foo.py: co_code: 20 MAKE_FUNCTION     0
Z\x03                   <foo.py: co_code: 22 STORE_NAME        3 (baz)
e\x03                   <foo.py: co_code: 24 LOAD_NAME         3 (baz)
e\x00                   <foo.py: co_code: 26 LOAD_NAME         0 (a)
e\x01                   <foo.py: co_code: 28 LOAD_NAME         1 (b)
\x83\x02                <foo.py: co_code: 30 CALL_FUNCTION     2
Z\x04                   <foo.py: co_code: 32 STORE_NAME        4 (multiplication)
e\x04                   <foo.py: co_code: 34 LOAD_NAME         4 (multiplication)
d\x01                   <foo.py: co_code: 36 LOAD_CONST        1 (2)
\x13\x00                <foo.py: co_code: 38 BINARY_POWER
Z\x05                   <foo.py: co_code: 40 STORE_NAME        5 (square)
d\x04                   <foo.py: co_code: 42 LOAD_CONST        4 (None)
S\x00                   <foo.py: co_code: 44 RETURN_VALUE
)\x05                   <foo.py: co_const: size>
\xe9\x01\x00\x00\x00    <foo.py: co_const[0]: 1>
\xe9\x02\x00\x00\x00    <foo.py: co_const[1]: 2>
c                       <TYPE_CODE>
\x02\x00\x00\x00        <baz: co_argcount: 2>
\x00\x00\x00\x00        <baz: co_kwonlyargcount: 0>
\x02\x00\x00\x00        <baz: co_nlocals: 2>
\x02\x00\x00\x00        <baz: co_stacksize: 2>               
C\x00\x00\x00           <baz: co_flags = 'C' = 0x43 = 67>
s\x08\x00\x00\x00       <baz: co_code: size = 8 bytes>
|\x00                   <baz: co_code: 0 LOAD_FAST                0 (x) 
|\x01                   <baz: co_code: 2 LOAD_FAST                1 (y) 
\x14\x00                <baz: co_code: 4 BINARY_MULTIPLY                
S\x00                   <baz: co_code: 6 RETURN_VALUE                   
)\x01                   <baz: co_const: size>
N                       <baz: co_const[0]: None>
\xa9\x00                <don't know> 
)\x02                   <baz: co_varnames: size>
\xda\x01                <baz: number of characters of next item>
x                       <baz: co_varnames[0]: x>
\xda\x01                <baz: number of characters of next item>
y                       <baz: co_varnames[1]: y>
r\x03\x00\x00\x00       <baz: don't know. But the 'r' = 'TYPE_REF'>
r\x03\x00\x00\x00       <baz: don't know. But the 'r' = 'TYPE_REF'>
\xfa\x06                <baz: next item length>
foo.py                  <baz: co_filename>
\xda\x03                <baz: number of characters of next item>
baz                     <baz: co_name: 'baz'>
\x07\x00\x00\x00        <baz: co_firstlineno: 7>
s\x02\x00\x00\x00       <baz: co_lnotab: size = 2 >
\x00\x01                <baz: co_lnotab>
r\x07\x00\x00\x00       <foo.py: co_const[3]: reference to baz>
N                       <foo.py: co_const[4]: None>
)\x06                   <foo.py: co_names: size> 
\xda\x01                <foo.py: number of characters of next item>
a                       <foo.py: co_names[0]: a>
\xda\x01                <foo.py: number of characters of next item>
b                       <foo.py: co_names[1]: b>
\xda\x01                <foo.py: number of characters of next item>
c                       <foo.py: co_names[2]: c>
r\x07\x00\x00\x00       <foo.py: co_names[3]: reference to baz>
Z\x0e                   <foo.py: number of characters of next item>
multiplication          <foo.py: co_names[4]: multiplication>
Z\x06                   <foo.py: number of characters of next item>
square                  <foo.py: co_names[5]: square>
r\x03\x00\x00\x00       <foo.py: don't know>     
r\x03\x00\x00\x00       <foo.py: don't know>     
r\x03\x00\x00\x00       <foo.py: don't know>     
r\x06\x00\x00\x00       <foo.py: don't know>     
\xda\x08                <foo.py: number of characters of next item>
<module>                <foo.py: co_name>
\x03\x00\x00\x00        <foo.py: co_firstlineno>
s\n\x00\x00\x00         <foo.py: co_lnotab: size = '\n' = 0A>
\x04\x01                <foo.py: o_lnotab> 
\x04\x01                <foo.py: o_lnotab>
\x08\x02                <foo.py: o_lnotab>
\x08\x07                <foo.py: o_lnotab>
\n\x01'                 <foo.py: o_lnotab>



复制所需的代码段:

1)源代码:foo.py

a = 1 
b = 2 
c = a + b 

def baz(x,y):
    return x * y

multiplication = baz(a,b)
square = multiplication ** 2


2)foo.py的编组表示。

source_py = "foo.py"

with open(source_py) as f_source:
    source_code = f_source.read()

code_obj_compile = compile(source_code, source_py, "exec")

data = marshal.dumps(code_obj_compile)

print(data)


3)代码对象的完整(递归)反汇编。

import types

dis.dis(code_obj_compile)

for x in code_obj_compile.co_consts:
    if isinstance(x, types.CodeType):
        sub_byte_code = x
        func_name = sub_byte_code.co_name
        print('\nDisassembly of %s:' % func_name)
        dis.dis(sub_byte_code)


4)所有代码对象的字段值。

def print_co_obj_fields(code_obj):
    # Iterating through all instance attributes
    # and calling all having the 'co_' prefix
    for name in dir(code_obj):
        if name.startswith('co_'):
            co_field = getattr(code_obj, name)
            print(f'{name:<20} = {co_field}')

print_co_obj_fields(code_obj_compile)


#1 楼

下面的答案是参考Python 2.7的。虚拟机如何找到baz代码对象?我看不到所有这些信息:0x7f380995e5d0,文件“ foo.py”,已编组的字符串中的第7行。对象ID 0x7f380995e5d0是存储在编组代码中还是在程序每次运行时创建?


baz代码对象位于co_consts成员内。以您的示例为例。

>>> import marshal
>>> import dis
>>> 
>>> source_py = "foo.py"
>>> 
>>> with open(source_py) as f_source:
...     source_code = f_source.read()
>>> 

>>> code_obj_compile = compile(source_code, source_py, "exec")


如果反汇编新生成的代码对象,则可以找到对baz的引用

>>> dis.dis(code_obj_compile)
  1           0 LOAD_CONST               0 (7)
              3 STORE_NAME               0 (a)

  2           6 LOAD_CONST               1 (5)
              9 STORE_NAME               1 (b)

  3          12 LOAD_NAME                0 (a)
             15 LOAD_NAME                1 (b)
             18 BINARY_ADD
             19 STORE_NAME               2 (c)

  5          22 LOAD_CONST               2 (<code object baz at 0x7f1dcdb06bb0, file "foo.py", line 5>)
             25 MAKE_FUNCTION            0
... snip...


baz代码对象位于父代码对象的co_consts数组内,如下所示。

>>> code_obj_compile.co_consts[2]
<code object baz at 0x7f1dcdb06bb0, file "foo.py", line 5>


也可以将其拆解。
/>
>>> dis.dis(code_obj_compile.co_consts[2])
  6           0 LOAD_FAST                0 (x)
              3 LOAD_FAST                1 (y)
              6 BINARY_MULTIPLY
              7 RETURN_VALUE


每次运行程序时都会创建对象。因此,地址将相应地更改。


如果不存储,则如何在封送处理的代码对象(.pyc文件)中保留对象的连接?


只解释一下。如果仔细看一下指令,您会发现LOAD_CONST指令将偏移量作为参数-操作数。

  5          22 LOAD_CONST               2 (<code object baz at 0x7f1dcdb06bb0, file "foo.py", line 5>)


这里的偏移量是2,它指示Python虚拟机器将co_consts数组中的第三个(从零开始)项目加载到评估堆栈上。因此,使用其他元数据成员中的偏移量保留了“连接”。

#2 楼

代码对象封送处理的目的是将程序存储到文件或从文件还原程序。因此,它应该具有针对Python所有功能的编码方案:对象,字节码,名称等,否则它将无法从文件中还原程序。
因此,它使用了多种类型标识符,可以将其分为四个组:


单个类型:{类型标识符},大小为1个字节。
 Example: TYPE_NONE = 'N'`, `TYPE_TRUE = 'T'.



短类型:{类型标识符} + 1个字节值
 Example: TYPE_SHORT_ASCII_INTERNED = 'Z'.



long TYPE:{类型标识符} + 4个字节值
 Example: TYPE_STRING = 's'.



对象类型:{类型标识符} +所有不同类型的组合,包括object TYPE本身。也就是说,它具有递归结构。
 Example: TYPE_CODE = 'c'



所有类型都可以在这里看到:cpython / Python / marshal.c
此外,代码对象具有多个int字段。它们在编组的字符串中没有标识符,只有四个字节值的序列。
    int co_argcount;            /* #arguments, except *args */
    int co_kwonlyargcount;      /* #keyword only arguments */
    int co_nlocals;             /* #local variables */
    int co_stacksize;           /* #entries needed for evaluation stack */
    int co_flags;               /* CO_..., see below */
    int co_firstlineno;         /* first source line number */
    

完整的代码对象结构在这里:cpython / Include / code.h
这很有用知道转储代码对象的顺序,因为这样我们就可以计算结果字符串中的每个字段偏移量,例如-前四个字节是co_argcount,第二个是co_kwonlyargcount,依此类推。
代码对象转储的说明:
    # PyCodeObject *co - pointer to the code object
    # p                - pointer to the file object,
    that accumulating marshaled code object before
    writing to the file.
    
    W_TYPE(TYPE_CODE, p);
    w_long(co->co_argcount, p);
    w_long(co->co_kwonlyargcount, p);
    w_long(co->co_nlocals, p);
    w_long(co->co_stacksize, p);
    w_long(co->co_flags, p);
    w_object(co->co_code, p);
    w_object(co->co_consts, p);
    w_object(co->co_names, p);
    w_object(co->co_varnames, p);
    w_object(co->co_freevars, p);
    w_object(co->co_cellvars, p);
    w_object(co->co_filename, p);
    w_object(co->co_name, p);
    w_long(co->co_firstlineno, p);
    w_object(co->co_lnotab, p);

结果:foo.py编组的字符串已完全解密:
b'
\xe3                    <foo.py: '\xe3' & 0x80 (FLAG_REF)  = 'c' (TYPE_CODE)>
\x00\x00\x00\x00        <foo.py: co_argcount: 0>
\x00\x00\x00\x00        <foo.py: co_kwonlyargcount: 0>
\x00\x00\x00\x00        <foo.py: co_nlocals: 0>
\x03\x00\x00\x00        <foo.py: co_stacksize: 3>               
@\x00\x00\x00           <foo.py: co_flags = '@' = 0x40 = 64>
s.\x00\x00\x00          <foo.py: number of bytes for module instructions = '.' = 46>
d\x00                   <foo.py: co_code:  0 LOAD_CONST        0 (1)
Z\x00                   <foo.py: co_code:  2 STORE_NAME        0 (a)
d\x01                   <foo.py: co_code:  4 LOAD_CONST        1 (2)
Z\x01                   <foo.py: co_code:  6 STORE_NAME        1 (b)
e\x00                   <foo.py: co_code:  8 LOAD_NAME         0 (a)
e\x01                   <foo.py: co_code: 10 LOAD_NAME         1 (b)
\x17\x00                <foo.py: co_code: 12 BINARY_ADD
Z\x02                   <foo.py: co_code: 14 STORE_NAME        2 (c)
d\x02                   <foo.py: co_code: 16 LOAD_CONST        2 (<code object baz at 0x7f380995e5d0, file "foo.py", line 7>)
d\x03                   <foo.py: co_code: 18 LOAD_CONST        3 ('baz')
\x84\x00                <foo.py: co_code: 20 MAKE_FUNCTION     0
Z\x03                   <foo.py: co_code: 22 STORE_NAME        3 (baz)
e\x03                   <foo.py: co_code: 24 LOAD_NAME         3 (baz)
e\x00                   <foo.py: co_code: 26 LOAD_NAME         0 (a)
e\x01                   <foo.py: co_code: 28 LOAD_NAME         1 (b)
\x83\x02                <foo.py: co_code: 30 CALL_FUNCTION     2
Z\x04                   <foo.py: co_code: 32 STORE_NAME        4 (multiplication)
e\x04                   <foo.py: co_code: 34 LOAD_NAME         4 (multiplication)
d\x01                   <foo.py: co_code: 36 LOAD_CONST        1 (2)
\x13\x00                <foo.py: co_code: 38 BINARY_POWER
Z\x05                   <foo.py: co_code: 40 STORE_NAME        5 (square)
d\x04                   <foo.py: co_code: 42 LOAD_CONST        4 (None)
S\x00                   <foo.py: co_code: 44 RETURN_VALUE
)\x05                   <foo.py: co_const: size>
\xe9\x01\x00\x00\x00    <foo.py: co_const[0]: 1; '\xe9' & 0x80 (FLAG_REF) = 'i' (TYPE_INT)>
\xe9\x02\x00\x00\x00    <foo.py: co_const[1]: 2; '\xe9' & 0x80 (FLAG_REF) = 'i' (TYPE_INT)>
c                       <foo.py: co_const[2]: 'c' = TYPE_CODE>
\x02\x00\x00\x00        <baz: co_argcount: 2>
\x00\x00\x00\x00        <baz: co_kwonlyargcount: 0>
\x02\x00\x00\x00        <baz: co_nlocals: 2>
\x02\x00\x00\x00        <baz: co_stacksize: 2>               
C\x00\x00\x00           <baz: co_flags = 'C' = 0x43 = 67>
s\x08\x00\x00\x00       <baz: co_code: size = 8 bytes>
|\x00                   <baz: co_code: 0 LOAD_FAST                0 (x) 
|\x01                   <baz: co_code: 2 LOAD_FAST                1 (y) 
\x14\x00                <baz: co_code: 4 BINARY_MULTIPLY                
S\x00                   <baz: co_code: 6 RETURN_VALUE                   
)\x01                   <baz: co_const: size>
N                       <baz: co_const[0]: None>
\xa9\x00                <baz: co_names: size = 0  '\xa9' & 0x80 (FLAG_REF)  = ')'> 
)\x02                   <baz: co_varnames: size>
\xda\x01                <baz: number of characters of next item; '\xda' & 0x80 (FLAG_REF)  = 'Z'>
x                       <baz: co_varnames[0]: x>
\xda\x01                <baz: number of characters of next item; '\xda' & 0x80 (FLAG_REF)  = 'Z'>
y                       <baz: co_varnames[1]: y>
r\x03\x00\x00\x00       <baz: co_freevars: reference to empty tuple '()'>     
r\x03\x00\x00\x00       <baz: co_cellvars: reference to empty tuple '()'>
\xfa\x06                <baz: next item length>
foo.py                  <baz: co_filename>
\xda\x03                <baz: number of characters of next item>
baz                     <baz: co_name: 'baz'>
\x07\x00\x00\x00        <baz: co_firstlineno: 7>
s\x02\x00\x00\x00       <baz: co_lnotab: size = 2 >
\x00\x01                <baz: co_lnotab>
r\x07\x00\x00\x00       <foo.py: co_const[3]: reference to 'baz'>
N                       <foo.py: co_const[4]: None>
)\x06                   <foo.py: co_names: size> 
\xda\x01                <foo.py: number of characters of next item>
a                       <foo.py: co_names[0]: a>
\xda\x01                <foo.py: number of characters of next item>
b                       <foo.py: co_names[1]: b>
\xda\x01                <foo.py: number of characters of next item>
c                       <foo.py: co_names[2]: c>
r\x07\x00\x00\x00       <foo.py: co_names[3]: reference to 'baz'>
Z\x0e                   <foo.py: number of characters of next item>
multiplication          <foo.py: co_names[4]: multiplication>
Z\x06                   <foo.py: number of characters of next item>
square                  <foo.py: co_names[5]: square>
r\x03\x00\x00\x00       <foo.py: co_varnames: reference to empty tuple '()'>     
r\x03\x00\x00\x00       <foo.py: co_freevars: reference to emtpy tuple '()'>     
r\x03\x00\x00\x00       <foo.py: co_cellvars: reference to empty tuple '()'>
r\x06\x00\x00\x00       <foo.py: co_filename: reference to 'foo.py'>     
\xda\x08                <foo.py: number of characters of next item>
<module>                <foo.py: co_name>
\x03\x00\x00\x00        <foo.py: co_firstlineno>
s\n\x00\x00\x00         <foo.py: co_lnotab: size = '\n' = 0A>
\x04\x01                <foo.py: o_lnotab> 
\x04\x01                <foo.py: o_lnotab>
\x08\x02                <foo.py: o_lnotab>
\x08\x07                <foo.py: o_lnotab>
\n\x01'                 <foo.py: o_lnotab>

有用的信息:
如何在python中创建代码对象?