mbox series

[v2,bpf-next,0/3] bpf/verifier: subprog/func_call simplifications

Message ID d13f6b60-3632-3e4d-111a-2edb4198b473@solarflare.com
Headers show
Series bpf/verifier: subprog/func_call simplifications | expand

Message

Edward Cree March 29, 2018, 10:44 p.m. UTC
By storing subprog boundaries as a subprogno mark on each insn, rather than
 a start (and implicit end) for each subprog, we collect a number of gains:
* More efficient determination of which subprog contains a given insn, and
  thus of find_subprog (which subprog begins at a given insn).
* Number of verifier "full recursive walk" passes is reduced, since most of
  the work is done in the main insn walk (do_check()).  Leftover work in
  other passes is mostly linear scans (O(insn_cnt)) or, in the case of
  check_max_stack_depth(), a topological sort (O(subprog_cnt)).

Some other changes were also included to support this:
* Per-subprog info is stored in env->subprog_info, an array of structs,
  rather than several arrays with a common index.
* Call graph is now stored in the new bpf_subprog_info struct; used here
  for check_max_stack_depth() but may have other uses too.

Along with this, patch #3 puts parent pointers (used by liveness analysis)
 in the registers instead of the func_state or verifier_state, so that we
 don't need skip_callee() machinery.  This also does the right thing for
 stack slots, so they don't need their own special handling for liveness
 marking either.

Edward Cree (3):
  bpf/verifier: validate func_calls by marking at do_check() time
  bpf/verifier: update selftests
  bpf/verifier: per-register parent pointers

 include/linux/bpf_verifier.h                |  32 +-
 kernel/bpf/verifier.c                       | 631 +++++++++++++---------------
 tools/testing/selftests/bpf/test_verifier.c |  51 ++-
 3 files changed, 344 insertions(+), 370 deletions(-)

Comments

Edward Cree March 29, 2018, 10:50 p.m. UTC | #1
On 29/03/18 23:44, Edward Cree wrote:
> By storing subprog boundaries as a subprogno mark on each insn, rather than
>  a start (and implicit end) for each subprog, we collect a number of gains:
> * More efficient determination of which subprog contains a given insn, and
>   thus of find_subprog (which subprog begins at a given insn).
> * Number of verifier "full recursive walk" passes is reduced, since most of
>   the work is done in the main insn walk (do_check()).  Leftover work in
>   other passes is mostly linear scans (O(insn_cnt)) or, in the case of
>   check_max_stack_depth(), a topological sort (O(subprog_cnt)).
>
> Some other changes were also included to support this:
> * Per-subprog info is stored in env->subprog_info, an array of structs,
>   rather than several arrays with a common index.
> * Call graph is now stored in the new bpf_subprog_info struct; used here
>   for check_max_stack_depth() but may have other uses too.
>
> Along with this, patch #3 puts parent pointers (used by liveness analysis)
>  in the registers instead of the func_state or verifier_state, so that we
>  don't need skip_callee() machinery.  This also does the right thing for
>  stack slots, so they don't need their own special handling for liveness
>  marking either.
Whoops, forgot to add:
Changes from v1:
* No longer allows non-contiguous subprogs.
* No longer allows LD_ABS|IND and pseudo-calls in the same prog.

> Edward Cree (3):
>   bpf/verifier: validate func_calls by marking at do_check() time
>   bpf/verifier: update selftests
>   bpf/verifier: per-register parent pointers
>
>  include/linux/bpf_verifier.h                |  32 +-
>  kernel/bpf/verifier.c                       | 631 +++++++++++++---------------
>  tools/testing/selftests/bpf/test_verifier.c |  51 ++-
>  3 files changed, 344 insertions(+), 370 deletions(-)
>
Alexei Starovoitov April 3, 2018, 1:08 a.m. UTC | #2
On Thu, Mar 29, 2018 at 11:44:17PM +0100, Edward Cree wrote:
> By storing subprog boundaries as a subprogno mark on each insn, rather than
>  a start (and implicit end) for each subprog, we collect a number of gains:
> * More efficient determination of which subprog contains a given insn, and
>   thus of find_subprog (which subprog begins at a given insn).
> * Number of verifier "full recursive walk" passes is reduced, since most of
>   the work is done in the main insn walk (do_check()).  Leftover work in
>   other passes is mostly linear scans (O(insn_cnt)) or, in the case of
>   check_max_stack_depth(), a topological sort (O(subprog_cnt)).
> 
> Some other changes were also included to support this:
> * Per-subprog info is stored in env->subprog_info, an array of structs,
>   rather than several arrays with a common index.
> * Call graph is now stored in the new bpf_subprog_info struct; used here
>   for check_max_stack_depth() but may have other uses too.
> 
> Along with this, patch #3 puts parent pointers (used by liveness analysis)
>  in the registers instead of the func_state or verifier_state, so that we
>  don't need skip_callee() machinery.  This also does the right thing for
>  stack slots, so they don't need their own special handling for liveness
>  marking either.

I like patch 3 and going to play with it.
How did you test it ? Do you have processed_insn numbers for
cilium or selftests programs before and after?
There should be no difference, right?
Essentially it's trading extra run-time cost of skip_callee logic
for higher memory overhead due to parent pointers in every register
state. I guess that's ok, since it does cleanup things nicely.

As far as patch 1 it was very difficult to review, since several logically
different things clamped together. So breaking it apart:
- converting two arrays of subprog_starts and subprog_stack_depth into
  single array of struct bpf_subprog_info is a good thing
- tsort is interesting, but not sure it's correct. more below
- but main change of combining subprog logic with do_check is no good.
The verifier have to move towards compiler style code analysis instead of
the other way around. It has to analyze the program in simple and
hopefully easy to understand passes instead of combinging everything
into one loop. We _have_ to get rid of do_check() approach and
corresponding insn_processed counter. That was and still is
the biggest and most pressing issue we need to solve in bpf verification.
The verifier has to switch to compiler style code analysis to scale.
The algorithms should be such that scale with thousands of instructions
and thousands of functions. The only way I see reaching that goal
is to do hierarchical program analysis in passes:
- identify subrpograms
- identify basic blocks, build control flow graph
- build dom graph, ssa and so on
- and do per-basic block liveness and data flow analysis that propagates
  the combined result further into other BBs along cfg edges.
There will be no 'do_check() across whole program' walk.
Combining subprog pass with do_check is going into opposite direction
of this long term work. Divide and conquer. Combining more things into
do_check is the opposite of this programming principle.
My short term plan is to split basic instruction correctness checks
out of do_check() loop into separate pass and run it early on.
It's necessary for bpf libraries to work, since the verifier cannot do
full data flow analysis at this point, but should build cfg and move
as much as possible out of instruction by instruction walk to scale
with number of bpf programs/libraries and sizes of them.

As far as tsort approach for determining max stack depth...
It's an interesting idea, but implementation is suffering from the same
'combine everything into one loop' coding issue.
I think updating total_stack_depth math should be separate from sorting,
since the same function can be part of different stacks with different max.
I don't see how updating global subprog_info[i].total_stack_depth can
be correct. It has to be different for every stack and the longest
stack is not necessarily the deepest. May be I'm missing something,
but combined sort and stack_depth math didn't make it easy to review.
I find existing check_max_stack_depth() algo much easier to understand.
Edward Cree April 3, 2018, 1:39 p.m. UTC | #3
On 03/04/18 02:08, Alexei Starovoitov wrote:
> I like patch 3 and going to play with it.
> How did you test it ?
Just test_verifier and test_progs (the latter has a failure
 "test_bpf_obj_id:FAIL:get-prog-info(fd) err 0 errno 2 i 0 type 1(1) info_len 80(40) jit_enabled 0 jited_prog_len 0 xlated_prog_len 72"
 but that was already there before my patch).

> Do you have processed_insn numbers for
> cilium or selftests programs before and after?
> There should be no difference, right?
That's a good check, I'll do that.

> As far as patch 1 it was very difficult to review, since several logically
> different things clamped together. So breaking it apart:
> - converting two arrays of subprog_starts and subprog_stack_depth into
>   single array of struct bpf_subprog_info is a good thing
> - tsort is interesting, but not sure it's correct. more below
> - but main change of combining subprog logic with do_check is no good.
<snip>
> There will be no 'do_check() across whole program' walk.
> Combining subprog pass with do_check is going into opposite direction
> of this long term work. Divide and conquer. Combining more things into
> do_check is the opposite of this programming principle.
The main object of my change here was to change the data structures, to
 store a subprogno in each insn aux rather than using bsearch() on the
 subprog_starts.  I have now figured out the algorithm to do this in its
 own pass (previously I thought this needed a recursive walk which is why
 I wanted to roll it into do_check() - doing more than one whole-program
 recursive walk seems like a bad idea.)

> My short term plan is to split basic instruction correctness checks
> out of do_check() loop into separate pass and run it early on.
I agree with that short term plan, sounds like a good idea.
I'm still not sure I understand the long-term plan, though; since most
 insns' input registers will still need to be checked (I'm assuming
 majority of most real ebpf programs consists of computing and
 dereferencing pointers), the data flow analysis will have to end up
 doing all the same register updates current do_check() does (though
 potentially in a different order), e.g. if a function is called three
 times it will have to analyse it with three sets of input registers.
Unless you have some way of specifying function preconditions, I don't
 see how it works.  In particular something like
    char *f(char *p)
    {
        *p++ = 0;
        return p;
    }
    int main(void)
    {
        char *p = "abc"; /* represents getting a ptr from ctx or map */
        p = f(p);
        p = f(p);
        p = f(p);
        return 0;
    }
 seems as though it would be difficult to analyse in any way more
 scalable than the current full recursive walk.  Unless you somehow
 propagate register state _backwards_ as constraints when analysing a
 function...?  In any case it seems like there are likely to be things
 which current verifier accepts which require 'whole-program' analysis
 to determine the dataflow (e.g. what if there were some conditional
 branches in f(), and the preconditions on p depended on the value of
 some other arg in r2?)

> As far as tsort approach for determining max stack depth...
> It's an interesting idea, but implementation is suffering from the same
> 'combine everything into one loop' coding issue.
> I think updating total_stack_depth math should be separate from sorting,
> since the same function can be part of different stacks with different max.
> I don't see how updating global subprog_info[i].total_stack_depth can
> be correct. It has to be different for every stack and the longest
> stack is not necessarily the deepest. May be I'm missing something,
> but combined sort and stack_depth math didn't make it easy to review.
> I find existing check_max_stack_depth() algo much easier to understand.
The sort _is_ the way to compute total stack depth.  The sort order isn't
 being stored anywhere; it's being done just so that each subprog gets
 looked at after all its callers have been considered.  So when it gets
 selected as a 'frontier node', its maximum stack depth is known, and can
 thus be used to update its callees (note that we do a max_t() with each
 callee's existing total_stack_depth, thus getting the deepest stack of
 all call chains to the function).
It may help to imagine drawing the call graph and labelling each node with
 a stack depth as it is visited; sadly that's difficult to show in an email
 (or a code comment).  But I can try to explain it a bit better than
 "/* Update callee total stack depth */".

I will also try to split up patch #1 into more pieces.  I mistakenly thought
 that existing check_max_stack_depth() depended on some invariants that I was
 removing, but I guess that was only true while I had non-contiguous subprogs.

-Ed
Alexei Starovoitov April 3, 2018, 11:37 p.m. UTC | #4
On Tue, Apr 03, 2018 at 02:39:11PM +0100, Edward Cree wrote:
> On 03/04/18 02:08, Alexei Starovoitov wrote:
> > I like patch 3 and going to play with it.
> > How did you test it ?
> Just test_verifier and test_progs (the latter has a failure
>  "test_bpf_obj_id:FAIL:get-prog-info(fd) err 0 errno 2 i 0 type 1(1) info_len 80(40) jit_enabled 0 jited_prog_len 0 xlated_prog_len 72"
>  but that was already there before my patch).

hmm. that doesn't fail for me and any other bots didn't complain.
Are you sure you're running the latest kernel and tests?

You may see the issue with buildid test:
test_stacktrace_build_id:FAIL:get build_id with readelf err -1 errno 2
that's due to actually missing buildid in the binary.

> > Do you have processed_insn numbers for
> > cilium or selftests programs before and after?
> > There should be no difference, right?
> That's a good check, I'll do that.
> 
> > As far as patch 1 it was very difficult to review, since several logically
> > different things clamped together. So breaking it apart:
> > - converting two arrays of subprog_starts and subprog_stack_depth into
> >   single array of struct bpf_subprog_info is a good thing
> > - tsort is interesting, but not sure it's correct. more below
> > - but main change of combining subprog logic with do_check is no good.
> <snip>
> > There will be no 'do_check() across whole program' walk.
> > Combining subprog pass with do_check is going into opposite direction
> > of this long term work. Divide and conquer. Combining more things into
> > do_check is the opposite of this programming principle.
> The main object of my change here was to change the data structures, to
>  store a subprogno in each insn aux rather than using bsearch() on the
>  subprog_starts.  I have now figured out the algorithm to do this in its
>  own pass (previously I thought this needed a recursive walk which is why
>  I wanted to roll it into do_check() - doing more than one whole-program
>  recursive walk seems like a bad idea.)

hmm. what's wrong with bsearch? It's trivial and fast.

> > My short term plan is to split basic instruction correctness checks
> > out of do_check() loop into separate pass and run it early on.
> I agree with that short term plan, sounds like a good idea.
> I'm still not sure I understand the long-term plan, though; since most
>  insns' input registers will still need to be checked (I'm assuming
>  majority of most real ebpf programs consists of computing and
>  dereferencing pointers), the data flow analysis will have to end up
>  doing all the same register updates current do_check() does (though
>  potentially in a different order), e.g. if a function is called three
>  times it will have to analyse it with three sets of input registers.
> Unless you have some way of specifying function preconditions, I don't
>  see how it works.  In particular something like
>     char *f(char *p)
>     {
>         *p++ = 0;
>         return p;
>     }
>     int main(void)
>     {
>         char *p = "abc"; /* represents getting a ptr from ctx or map */
>         p = f(p);
>         p = f(p);
>         p = f(p);
>         return 0;
>     }
>  seems as though it would be difficult to analyse in any way more
>  scalable than the current full recursive walk.  Unless you somehow
>  propagate register state _backwards_ as constraints when analysing a
>  function...?  In any case it seems like there are likely to be things
>  which current verifier accepts which require 'whole-program' analysis
>  to determine the dataflow (e.g. what if there were some conditional
>  branches in f(), and the preconditions on p depended on the value of
>  some other arg in r2?)

The main point that we have to make verifier scale. That's my #1 priority.
Even if we don't see the solution today we have to work towards it.
I believe adopting compiler style analysis is the right way forward.
The independent passes to determine and _keep_ info about subprograms,
basic blocks, cfg, dom graph, liveness, dataflow are necessary steps
not only for the main purpose of the verifier (which is safety check),
but for additional analysis that JITs, like NFP, have to do.
"walk all instruction and apply single transformation" is a _good thing_.
llvm has tens, if not hundreds, loops like:
for_each_bb_in_func(bb, func)
  for_each_insn_in_bb(insn, bb)
    do stuff with insn
Compiler designers could have combined multiple of such passes into
fewer ones, but it's not done, because it increases complexity and
causes tough bugs where pass is partially complete. cfg and/or dataflow
sometimes is recomputed independently from transformation
instead of doing during the pass for the same reasons.

As far as scaling the verifier the number to shot for is 1M bpf instructions.
For certain program types, like XDP, we probably will always keep 4k insn limit,
but in some cases, where program runs in the user contex, we can run it longer.
There 64k insns would be fine. Verifier already capable to inject code
into the program. It could easily inject cond_resched afer every 4k or so.
We'd need to make progs preemtable and be smarter with rcu-based map lookups.
It's all solvable and easy to do as long as core of the verifier scales.
The prime example where more than 4k instructions and loops are mandatory
is user space stack analysis inside the program. Like walking python stack
requires non-trival pointer chasing. With 'pragma unroll' the stack depth
limit today is ~18. That's not really usable. Looping through 100 python
frames would require about 16k bpf assembler instructions. That's 16k JITed
cpu instructions. It's plenty fast and safe in user contex, but if we increase
bpf max_insn limit to 16k, I'm afraid, the verifier will start falling apart
due to insn_processed counter.
Hence do_check approach must go. The rough idea is to compute per basic
block a set of INs (registers and stack) that basic block needs
to see to be safe and corresponding set of OUTs.
Then propagate this knowledge across cfg edges.
Once we have such set per bpf function, it will essentially answer the question
'what arguments this function needs to see to be safe and what it returns'
To make bpf libraries scale we'd need to keep such information
around after the verification, so dynamic linking and indirect calls
are fast to verify.
It's very high level obviously. There are many gotchas to resolve.

> > As far as tsort approach for determining max stack depth...
> > It's an interesting idea, but implementation is suffering from the same
> > 'combine everything into one loop' coding issue.
> > I think updating total_stack_depth math should be separate from sorting,
> > since the same function can be part of different stacks with different max.
> > I don't see how updating global subprog_info[i].total_stack_depth can
> > be correct. It has to be different for every stack and the longest
> > stack is not necessarily the deepest. May be I'm missing something,
> > but combined sort and stack_depth math didn't make it easy to review.
> > I find existing check_max_stack_depth() algo much easier to understand.
> The sort _is_ the way to compute total stack depth.  The sort order isn't
>  being stored anywhere; it's being done just so that each subprog gets
>  looked at after all its callers have been considered.  So when it gets
>  selected as a 'frontier node', its maximum stack depth is known, and can
>  thus be used to update its callees (note that we do a max_t() with each
>  callee's existing total_stack_depth, thus getting the deepest stack of
>  all call chains to the function).
> It may help to imagine drawing the call graph and labelling each node with
>  a stack depth as it is visited; sadly that's difficult to show in an email
>  (or a code comment).  But I can try to explain it a bit better than
>  "/* Update callee total stack depth */".

Please do, since that's my concern with tsort.
The verifier is the key piece of bpf infra and to be effective maintainers
we need to thoroughly understand the verifier code.
We cannot just take the patch based on the cover letter. The author may
disappear tomorrow and what we're going to do with the code?
Hence today, with information I have, I prefer to keep the current
check_max_stack_depth() as-is.
In other words it looks to me that tsort complexity is not justified.
Edward Cree April 4, 2018, 11:58 p.m. UTC | #5
On 04/04/18 00:37, Alexei Starovoitov wrote:
> hmm. that doesn't fail for me and any other bots didn't complain.
> Are you sure you're running the latest kernel and tests?
Ah, test_progs isn't actually rebuilding because __NR_bpf is undeclared;
 something must be going wrong with header files.
Never mind.

> hmm. what's wrong with bsearch? It's trivial and fast. 
bsearch is O(log n), and the sort() call on the subprog_starts (which happens
 every time add_subprog() is called) is O(n log n).
Whereas reading aux->subprogno is O(1).
As far as I'm concerned, that's a sign that the latter data structure is the
 appropriate one.

> Even if we don't see the solution today we have to work towards it.
I guess I'm just not confident "towards" is the direction you think it is.

> Compiler designers could have combined multiple of such passes into
> fewer ones, but it's not done, because it increases complexity and
> causes tough bugs where pass is partially complete.
I'm not trying to combine together multiple 'for bb in prog/for insn in bb'-
 type passes.  The combining I was doing was more on 'for all possible
 execution paths'-type passes, because it's those that explode combinatorially.
Happily I think we can go a long way towards getting rid of them; but while I
 think we can get down to only having 1, I don't think we can reach 0.

> The prime example where more than 4k instructions and loops are mandatory
> is user space stack analysis inside the program. Like walking python stack
> requires non-trival pointer chasing. With 'pragma unroll' the stack depth
> limit today is ~18. That's not really usable. Looping through 100 python
> frames would require about 16k bpf assembler instructions.
But this would be solved by having support for bounded loops, and I think I've
 successfully shown that this is not inherently incompatible with a do_check()
 style walk.

> Hence do_check approach must go. The rough idea is to compute per basic
> block a set of INs (registers and stack) that basic block needs
> to see to be safe and corresponding set of OUTs.
> Then propagate this knowledge across cfg edges.
> Once we have such set per bpf function, it will essentially answer the question
> 'what arguments this function needs to see to be safe and what it returns'
> To make bpf libraries scale we'd need to keep such information
> around after the verification, so dynamic linking and indirect calls
> are fast to verify.
> It's very high level obviously. There are many gotchas to resolve.
I agree that if we can do this it'll be ideal.  But that's a big 'if'; my
 example code was intended to demonstrate that the "set of INs bb/func needs to
 see to be safe" can be an arbitrarily complicated disjunction, and that instead
 of a combinatorially exploding number of paths to walk (the do_check() approach)
 you now have combinatorially exploding IN-constraints to propagate backwards.

> Please do, since that's my concern with tsort.
> The verifier is the key piece of bpf infra and to be effective maintainers
> we need to thoroughly understand the verifier code.
> We cannot just take the patch based on the cover letter. The author may
> disappear tomorrow and what we're going to do with the code?
I entirely accept this argument.  Unfortunately, when writing explanations,
 it's difficult to know when one has reached something that will be
 understood, so inevitably there will have to be a few iterations to get
 there ;-)
Alexei Starovoitov April 5, 2018, 5:28 a.m. UTC | #6
On Thu, Apr 05, 2018 at 12:58:46AM +0100, Edward Cree wrote:
> On 04/04/18 00:37, Alexei Starovoitov wrote:
> > hmm. that doesn't fail for me and any other bots didn't complain.
> > Are you sure you're running the latest kernel and tests?
> Ah, test_progs isn't actually rebuilding because __NR_bpf is undeclared;
>  something must be going wrong with header files.
> Never mind.
> 
> > hmm. what's wrong with bsearch? It's trivial and fast. 
> bsearch is O(log n), and the sort() call on the subprog_starts (which happens
>  every time add_subprog() is called) is O(n log n).
> Whereas reading aux->subprogno is O(1).
> As far as I'm concerned, that's a sign that the latter data structure is the
>  appropriate one.

only if it can be done as separate _first_ pass before cfg.

> > Even if we don't see the solution today we have to work towards it.
> I guess I'm just not confident "towards" is the direction you think it is.
> 
> > Compiler designers could have combined multiple of such passes into
> > fewer ones, but it's not done, because it increases complexity and
> > causes tough bugs where pass is partially complete.
> I'm not trying to combine together multiple 'for bb in prog/for insn in bb'-
>  type passes.  The combining I was doing was more on 'for all possible
>  execution paths'-type passes, because it's those that explode combinatorially.
> Happily I think we can go a long way towards getting rid of them; but while I
>  think we can get down to only having 1, I don't think we can reach 0.

we have to. That's the point. 'explore all possible paths' must be removed.
New code should not be relying on that in a way that it's difficult to remove
later. subprog boundaries is a typical example where doing it as
part of 'explore all paths' is harmful long term. Regardless of extra
O(num_of_subrogs) complexity it brings short term.

> > The prime example where more than 4k instructions and loops are mandatory
> > is user space stack analysis inside the program. Like walking python stack
> > requires non-trival pointer chasing. With 'pragma unroll' the stack depth
> > limit today is ~18. That's not really usable. Looping through 100 python
> > frames would require about 16k bpf assembler instructions.
> But this would be solved by having support for bounded loops, and I think I've
>  successfully shown that this is not inherently incompatible with a do_check()
>  style walk.

No. It's worse. Your proposed approach to bounded loops completely
relying on 'explore all paths' whereas John's does not.
Can 'explore all paths' work with 1M bpf insns? No.
Whereas an approach that builds dom graph, detects natural loops
and loop invariants - can.

> > Hence do_check approach must go. The rough idea is to compute per basic
> > block a set of INs (registers and stack) that basic block needs
> > to see to be safe and corresponding set of OUTs.
> > Then propagate this knowledge across cfg edges.
> > Once we have such set per bpf function, it will essentially answer the question
> > 'what arguments this function needs to see to be safe and what it returns'
> > To make bpf libraries scale we'd need to keep such information
> > around after the verification, so dynamic linking and indirect calls
> > are fast to verify.
> > It's very high level obviously. There are many gotchas to resolve.
> I agree that if we can do this it'll be ideal.  But that's a big 'if'; my
>  example code was intended to demonstrate that the "set of INs bb/func needs to
>  see to be safe" can be an arbitrarily complicated disjunction, and that instead
>  of a combinatorially exploding number of paths to walk (the do_check() approach)
>  you now have combinatorially exploding IN-constraints to propagate backwards.

This sounds like arbitrary assumptions what this new approach would do.
Instead of rejecting it outright try to solve this very hard problem.

Before this discussion gets carried away too far let's get back to this patch set.
Having seen all arguments I still think that only patch 3 is worth pursuing further.
Edward Cree April 5, 2018, 8:49 a.m. UTC | #7
On 05/04/18 06:28, Alexei Starovoitov wrote:
> On Thu, Apr 05, 2018 at 12:58:46AM +0100, Edward Cree wrote:
>> On 04/04/18 00:37, Alexei Starovoitov wrote:
>>> hmm. that doesn't fail for me and any other bots didn't complain.
>>> Are you sure you're running the latest kernel and tests?
>> Ah, test_progs isn't actually rebuilding because __NR_bpf is undeclared;
>>  something must be going wrong with header files.
>> Never mind.
>>
>>> hmm. what's wrong with bsearch? It's trivial and fast. 
>> bsearch is O(log n), and the sort() call on the subprog_starts (which happens
>>  every time add_subprog() is called) is O(n log n).
>> Whereas reading aux->subprogno is O(1).
>> As far as I'm concerned, that's a sign that the latter data structure is the
>>  appropriate one.
> only if it can be done as separate _first_ pass before cfg.
As I think I already said, I'm working on a next version of the patch that
 does just that.

> No. It's worse. Your proposed approach to bounded loops completely
> relying on 'explore all paths' whereas John's does not.
> Can 'explore all paths' work with 1M bpf insns? No.
> Whereas an approach that builds dom graph, detects natural loops
> and loop invariants - can.
... IF it's possible at all, which I'm still doubtful about.

> This sounds like arbitrary assumptions what this new approach would do.
> Instead of rejecting it outright try to solve this very hard problem.
I'm not rejecting it outright; I'm just saying, be prepared for the possibility
 that it can't be solved, and try to leave a line of retreat / have a plan B in
 the case that it proves to be subject to a no-free-lunch theorem.  Because my
 intuition is that an entropy argument may require all algos to have the same
 superexponential worst-case running time as 'explore all paths' does.

> Before this discussion gets carried away too far let's get back to this patch set.
> Having seen all arguments I still think that only patch 3 is worth pursuing further.
I think the next version of the patch series (coming real soon now) answers at
 least most of your objections (and indeed the above debate isn't relevant to it).
Jiong Wang April 5, 2018, 3:50 p.m. UTC | #8
On 03/04/2018 02:08, Alexei Starovoitov wrote:
> Combining subprog pass with do_check is going into opposite direction
> of this long term work. Divide and conquer. Combining more things into
> do_check is the opposite of this programming principle.

Agree. And for the redundant insn traversal issue in check_subprogs that
Edward trying to fix, I am thinking we could do it by utilizing the
existing DFS traversal in check_cfg.

The current DFS probably could be improved into an generic instruction
information collection pass.

This won't make the existing DFS complexer as it only does information
collection as a side job during traversal. These collected information
then could be used to build any other information to be consumed later,
for example subprog, basic blocks etc.

For the redundant insn traversal issue during subprog detection, the
Like how we mark STATE_LIST_MARK in DFS, we could just call add_subprog
for BPF_PSEUDO_CALL insn during DFS.

i.e we change the code logic of check_cfg into:

check_cfg
{
   * DFS traversal:
     - detect back-edge.
     - collect STATE_LIST_MARK.
     - collect subprog destination.

   * check all insns are reachable.
   * check_subprogs (insn traversal removed).
}

I prototyped a patch for above change, the change looks to me is easier to
review. There are 8 regressions on selftests however after this change due
to check_subprogs moved after some checks in DFS that some errors detected
before the expected errors, need to be cleaned up.

Does this direction sound good?

And if we want to build basic-block later, we could just call a new add_bb
(similar as add_subprog) for jump targets etc. (some of those places are
actually STATE_LIST_MARK_JMPTARGET if we separate STATE_LIST_MARK as
STATE_LIST_MARK_JMPTARGET and STATE_LIST_MARK_HEURISTIC).

Regards,
Jiong
Edward Cree April 5, 2018, 4:29 p.m. UTC | #9
On 05/04/18 16:50, Jiong Wang wrote:
> On 03/04/2018 02:08, Alexei Starovoitov wrote:
>> Combining subprog pass with do_check is going into opposite direction
>> of this long term work. Divide and conquer. Combining more things into
>> do_check is the opposite of this programming principle.
>
> Agree. And for the redundant insn traversal issue in check_subprogs that
> Edward trying to fix, I am thinking we could do it by utilizing the
> existing DFS traversal in check_cfg.
>
> The current DFS probably could be improved into an generic instruction
> information collection pass.
<snip>
> And if we want to build basic-block later, we could just call a new add_bb
> (similar as add_subprog) for jump targets etc. (some of those places are
> actually STATE_LIST_MARK_JMPTARGET if we separate STATE_LIST_MARK as
> STATE_LIST_MARK_JMPTARGET and STATE_LIST_MARK_HEURISTIC).
>
> Regards,
> Jiong
>
This is broadly similar to my medium-term plan, except that I would use
 the Kahn's-algorithm style tsort rather than the depth-first tsort
 algorithm used by current check_cfg.
* chop up prog into functions and bbs, incidentally placing S_L_Ms
* create control flow graph similar to my function call graph
* tsort it to detect loops, similar to how my check_max_stack_depth()
  implementation detects recursion.

-Ed
Alexei Starovoitov April 6, 2018, 1:20 a.m. UTC | #10
On Thu, Apr 05, 2018 at 04:50:03PM +0100, Jiong Wang wrote:
> On 03/04/2018 02:08, Alexei Starovoitov wrote:
> > Combining subprog pass with do_check is going into opposite direction
> > of this long term work. Divide and conquer. Combining more things into
> > do_check is the opposite of this programming principle.
> 
> Agree. And for the redundant insn traversal issue in check_subprogs that
> Edward trying to fix, I am thinking we could do it by utilizing the
> existing DFS traversal in check_cfg.
> 
> The current DFS probably could be improved into an generic instruction
> information collection pass.
> 
> This won't make the existing DFS complexer as it only does information
> collection as a side job during traversal. These collected information
> then could be used to build any other information to be consumed later,
> for example subprog, basic blocks etc.
> 
> For the redundant insn traversal issue during subprog detection, the
> Like how we mark STATE_LIST_MARK in DFS, we could just call add_subprog
> for BPF_PSEUDO_CALL insn during DFS.
> 
> i.e we change the code logic of check_cfg into:
> 
> check_cfg
> {
>   * DFS traversal:
>     - detect back-edge.
>     - collect STATE_LIST_MARK.
>     - collect subprog destination.
> 
>   * check all insns are reachable.
>   * check_subprogs (insn traversal removed).
> }

I don't think that will work.
Functions are independent from CFG.
With bpf libraries we will have disjoint functions sitting in the kernel
and check_cfg would need to be done separately from function boundaries.