X-Git-Url: https://git.librecmc.org/?a=blobdiff_plain;f=docs%2Fkeep_data_small.txt;h=21d732674c64f3a456b7ca7d992234b053fe0465;hb=c2a06db69de7562024524a89a7b0f0f7e61c5999;hp=fcd8df4a93877ed1c9d0dde4d535bd4964f741bd;hpb=3d101dd4670e449a064bd8ea88d5343d83144f49;p=oweals%2Fbusybox.git diff --git a/docs/keep_data_small.txt b/docs/keep_data_small.txt index fcd8df4a9..21d732674 100644 --- a/docs/keep_data_small.txt +++ b/docs/keep_data_small.txt @@ -2,50 +2,64 @@ When many applets are compiled into busybox, all rw data and bss for each applet are concatenated. Including those from libc, -if static bbox is built. When bbox is started, _all_ this data +if static busybox is built. When busybox is started, _all_ this data is allocated, not just that one part for selected applet. What "allocated" exactly means, depends on arch. -On nommu it's probably bites the most, actually using real +On NOMMU it's probably bites the most, actually using real RAM for rwdata and bss. On i386, bss is lazily allocated by COWed zero pages. Not sure about rwdata - also COW? -In order to keep bbox NOMMU and small-mem systems friendly +In order to keep busybox NOMMU and small-mem systems friendly we should avoid large global data in our applets, and should minimize usage of libc functions which implicitly use -such structures in libc. - -Small experiment measures "parasitic" bbox memory consumption. -Here we start 1000 "busybox sleep 10" in parallel. -bbox binary is practically allyesconfig static one, -built against uclibc: - -bash-3.2# nmeter '%t %c %b %m %p %[pn]' -23:17:28 .......... 0 0 168M 0 147 -23:17:29 .......... 0 0 168M 0 147 -23:17:30 U......... 0 0 168M 1 147 -23:17:31 SU........ 0 188k 181M 244 391 -23:17:32 SSSSUUU... 0 0 223M 757 1147 -23:17:33 UUU....... 0 0 223M 0 1147 -23:17:34 U......... 0 0 223M 1 1147 -23:17:35 .......... 0 0 223M 0 1147 -23:17:36 .......... 0 0 223M 0 1147 -23:17:37 S......... 0 0 223M 0 1147 -23:17:38 .......... 0 0 223M 1 1147 -23:17:39 .......... 0 0 223M 0 1147 -23:17:40 .......... 0 0 223M 0 1147 -23:17:41 .......... 0 0 210M 0 906 -23:17:42 .......... 0 0 168M 1 147 -23:17:43 .......... 0 0 168M 0 147 +such structures. + +Small experiment to measure "parasitic" bbox memory consumption: +here we start 1000 "busybox sleep 10" in parallel. +busybox binary is practically allyesconfig static one, +built against uclibc. Run on x86-64 machine with 64-bit kernel: + +bash-3.2# nmeter '%t %c %m %p %[pn]' +23:17:28 .......... 168M 0 147 +23:17:29 .......... 168M 0 147 +23:17:30 U......... 168M 1 147 +23:17:31 SU........ 181M 244 391 +23:17:32 SSSSUUU... 223M 757 1147 +23:17:33 UUU....... 223M 0 1147 +23:17:34 U......... 223M 1 1147 +23:17:35 .......... 223M 0 1147 +23:17:36 .......... 223M 0 1147 +23:17:37 S......... 223M 0 1147 +23:17:38 .......... 223M 1 1147 +23:17:39 .......... 223M 0 1147 +23:17:40 .......... 223M 0 1147 +23:17:41 .......... 210M 0 906 +23:17:42 .......... 168M 1 147 +23:17:43 .......... 168M 0 147 This requires 55M of memory. Thus 1 trivial busybox applet -takes 55k of memory. +takes 55k of memory on 64-bit x86 kernel. + +On 32-bit kernel we need ~26k per applet. + +Script: + +i=1000; while test $i != 0; do + echo -n . + busybox sleep 30 & + i=$((i - 1)) +done +echo +wait + +(Data from NOMMU arches are sought. Provide 'size busybox' output too) Example 1 One example how to reduce global data usage is in -archival/libunarchive/decompress_unzip.c: +archival/libarchive/decompress_unzip.c: /* This is somewhat complex-looking arrangement, but it allows * to place decompressor state either in bss or in @@ -61,7 +75,7 @@ archival/libunarchive/decompress_unzip.c: (see the rest of the file to get the idea) This example completely eliminates globals in that module. -Required memory is allocated in inflate_gunzip() [its main module] +Required memory is allocated in unpack_gz_stream() [its main module] and then passed down to all subroutines which need to access 'globals' as a parameter. @@ -85,9 +99,9 @@ and then declare that ptr_to_globals is a pointer to it: ptr_to_globals is declared as constant pointer. This helps gcc understand that it won't change, resulting in noticeably -smaller code. In order to assign it, use PTR_TO_GLOBALS macro: +smaller code. In order to assign it, use SET_PTR_TO_GLOBALS macro: - PTR_TO_GLOBALS = xzalloc(sizeof(G)); + SET_PTR_TO_GLOBALS(xzalloc(sizeof(G))); Typically it is done in _main(). @@ -104,8 +118,12 @@ its needs. Library functions are prohibited from using it. #define G (*(struct globals*)&bb_common_bufsiz1) -Be careful, though, and use it only if -sizeof(struct globals) <= sizeof(bb_common_bufsiz1). +Be careful, though, and use it only if globals fit into bb_common_bufsiz1. +Since bb_common_bufsiz1 is BUFSIZ + 1 bytes long and BUFSIZ can change +from one libc to another, you have to add compile-time check for it: + +if (sizeof(struct globals) > sizeof(bb_common_bufsiz1)) + BUG__globals_too_big(); Drawbacks @@ -127,6 +145,11 @@ one of above methods is not worth the resulting code obfuscation. If you have less than ~300 bytes of global data - don't bother. + Finding non-shared duplicated strings + +strings busybox | sort | uniq -c | sort -nr + + gcc's data alignment problem The following attribute added in vi.c: @@ -135,7 +158,7 @@ static int tabstop; static struct termios term_orig __attribute__ ((aligned (4))); static struct termios term_vi __attribute__ ((aligned (4))); -reduced bss size by 32 bytes, because gcc sometimes aligns structures to +reduces bss size by 32 bytes, because gcc sometimes aligns structures to ridiculously large values. asm output diff for above example: tabstop: @@ -154,3 +177,80 @@ ridiculously large values. asm output diff for above example: .size term_vi, 60 gcc doesn't seem to have options for altering this behaviour. + +gcc 3.4.3 and 4.1.1 tested: +char c = 1; +// gcc aligns to 32 bytes if sizeof(struct) >= 32 +struct { + int a,b,c,d; + int i1,i2,i3; +} s28 = { 1 }; // struct will be aligned to 4 bytes +struct { + int a,b,c,d; + int i1,i2,i3,i4; +} s32 = { 1 }; // struct will be aligned to 32 bytes +// same for arrays +char vc31[31] = { 1 }; // unaligned +char vc32[32] = { 1 }; // aligned to 32 bytes + +-fpack-struct=1 reduces alignment of s28 to 1 (but probably +will break layout of many libc structs) but s32 and vc32 +are still aligned to 32 bytes. + +I will try to cook up a patch to add a gcc option for disabling it. +Meanwhile, this is where it can be disabled in gcc source: + +gcc/config/i386/i386.c +int +ix86_data_alignment (tree type, int align) +{ +#if 0 + if (AGGREGATE_TYPE_P (type) + && TYPE_SIZE (type) + && TREE_CODE (TYPE_SIZE (type)) == INTEGER_CST + && (TREE_INT_CST_LOW (TYPE_SIZE (type)) >= 256 + || TREE_INT_CST_HIGH (TYPE_SIZE (type))) && align < 256) + return 256; +#endif + +Result (non-static busybox built against glibc): + +# size /usr/srcdevel/bbox/fix/busybox.t0/busybox busybox + text data bss dec hex filename + 634416 2736 23856 661008 a1610 busybox + 632580 2672 22944 658196 a0b14 busybox_noalign + + + + Keeping code small + +Set CONFIG_EXTRA_CFLAGS="-fno-inline-functions-called-once", +produce "make bloatcheck", see the biggest auto-inlined functions. +Now, set CONFIG_EXTRA_CFLAGS back to "", but add NOINLINE +to some of these functions. In 1.16.x timeframe, the results were +(annotated "make bloatcheck" output): + +function old new delta +expand_vars_to_list - 1712 +1712 win +lzo1x_optimize - 1429 +1429 win +arith_apply - 1326 +1326 win +read_interfaces - 1163 +1163 loss, leave w/o NOINLINE +logdir_open - 1148 +1148 win +check_deps - 1148 +1148 loss +rewrite - 1039 +1039 win +run_pipe 358 1396 +1038 win +write_status_file - 1029 +1029 almost the same, leave w/o NOINLINE +dump_identity - 987 +987 win +mainQSort3 - 921 +921 win +parse_one_line - 916 +916 loss +summarize - 897 +897 almost the same +do_shm - 884 +884 win +cpio_o - 863 +863 win +subCommand - 841 +841 loss +receive - 834 +834 loss + +855 bytes saved in total. + +scripts/mkdiff_obj_bloat may be useful to automate this process: run +"scripts/mkdiff_obj_bloat NORMALLY_BUILT_TREE FORCED_NOINLINE_TREE" +and select modules which shrank.