diff mbox

Add a source_tree field to Project.

Message ID 1301520312.28156.124.camel@feioso
State Superseded
Headers show

Commit Message

Guilherme Salgado March 30, 2011, 9:25 p.m. UTC
On Wed, 2011-03-30 at 12:10 +0800, Jeremy Kerr wrote:
> Hi Guilherme,
> 
> >  apps/patchwork/models.py                      |    1 +
> >  lib/sql/migration/008-project-source-tree.sql |    3 +++
> >  2 files changed, 4 insertions(+), 0 deletions(-)
> >  create mode 100644 lib/sql/migration/008-project-source-tree.sql
> 
> Looks good. I'd like to wait until there is a user of this field before 
> merging the change though - or are you using this for linaro-internal things?

I'm using it on a script I'm writing to fetch the git history of every
project and scan that looking for patches that have been committed. Just
like the existing patchwork-update-commits script does but this one is
fully automated, and to make it more easily testable I'm experimenting
with python-dulwich to scan the git history. 

I haven't submitted it for feedback yet because it's still very much a
work in progress, but I would certainly do so. (I'm including what I
have so far here, though)

Oh, there's one extra DB schema change I'm considering... That will
allow us to keep track of the last seen commit ref so that we can scan
the git history incrementally.  I've added it as another field to
Project, named last_seen_commit_ref.  How does that sound to you?

Comments

Guilherme Salgado March 31, 2011, 9:32 p.m. UTC | #1
On Wed, 2011-03-30 at 18:25 -0300, Guilherme Salgado wrote:
> On Wed, 2011-03-30 at 12:10 +0800, Jeremy Kerr wrote:
> > Hi Guilherme,
> > 
> > >  apps/patchwork/models.py                      |    1 +
> > >  lib/sql/migration/008-project-source-tree.sql |    3 +++
> > >  2 files changed, 4 insertions(+), 0 deletions(-)
> > >  create mode 100644 lib/sql/migration/008-project-source-tree.sql
> > 
> > Looks good. I'd like to wait until there is a user of this field before 
> > merging the change though - or are you using this for linaro-internal things?
> 
> I'm using it on a script I'm writing to fetch the git history of every
> project and scan that looking for patches that have been committed. Just
> like the existing patchwork-update-commits script does but this one is
> fully automated, and to make it more easily testable I'm experimenting
> with python-dulwich to scan the git history. 

So, although I could do most things using python-dulwich it was failing
in some cases and I just don't have the time to chase down these
failures and fix them, so I resorted to running git on a subprocess. I'm
still writing the script in python as that allows me to have at least
some test coverage, which is something very valuable to me.

This new version of the script works similarly to the previous version.
There are a few things that need to be changed but it has some tests and
is able to scan the qemu tree and update the status of a couple patches.
Also note that it updates the commit_ref of a patch, which will allow us
to generate a link to the project's gitweb once we have the commit_url
field.

I'd appreciate some feedback on this; specially whether or not this is
something that's going to be useful upstream and, if so, if the current
approach is reasonable.

Cheers,
Dirk Wallenstein April 1, 2011, 8:34 a.m. UTC | #2
On Thu, Mar 31, 2011 at 06:32:33PM -0300, Guilherme Salgado wrote:
> On Wed, 2011-03-30 at 18:25 -0300, Guilherme Salgado wrote:
> > On Wed, 2011-03-30 at 12:10 +0800, Jeremy Kerr wrote:
> > > Hi Guilherme,
> > > 
> > > >  apps/patchwork/models.py                      |    1 +
> > > >  lib/sql/migration/008-project-source-tree.sql |    3 +++
> > > >  2 files changed, 4 insertions(+), 0 deletions(-)
> > > >  create mode 100644 lib/sql/migration/008-project-source-tree.sql
> > > 
> > > Looks good. I'd like to wait until there is a user of this field before 
> > > merging the change though - or are you using this for linaro-internal things?
> > 
> > I'm using it on a script I'm writing to fetch the git history of every
> > project and scan that looking for patches that have been committed. Just
> > like the existing patchwork-update-commits script does but this one is
> > fully automated, and to make it more easily testable I'm experimenting
> > with python-dulwich to scan the git history. 
> 
> So, although I could do most things using python-dulwich it was failing
> in some cases and I just don't have the time to chase down these
> failures and fix them, so I resorted to running git on a subprocess. I'm
> still writing the script in python as that allows me to have at least
> some test coverage, which is something very valuable to me.
> 
> This new version of the script works similarly to the previous version.
> There are a few things that need to be changed but it has some tests and
> is able to scan the qemu tree and update the status of a couple patches.
> Also note that it updates the commit_ref of a patch, which will allow us
> to generate a link to the project's gitweb once we have the commit_url
> field.
> 
> I'd appreciate some feedback on this; specially whether or not this is
> something that's going to be useful upstream and, if so, if the current
> approach is reasonable.

I think something like that would be really really useful.

I was wondering about a way to include patches that were slightly
modified when they were applied.  The author, author date, and subject
do not change in most cases (ok, maybe the subject is extended at the
start to add a section, and maybe space and case changes). 

I think if author and author date match, and if a case insensitive
space-less version of the mail-subject is contained in the
commit-subject it is the corresponding commit.
Dirk Wallenstein April 1, 2011, 11:01 a.m. UTC | #3
On Wed, Mar 30, 2011 at 06:25:12PM -0300, Guilherme Salgado wrote:
> On Wed, 2011-03-30 at 12:10 +0800, Jeremy Kerr wrote:
> > Hi Guilherme,
> > 
> > >  apps/patchwork/models.py                      |    1 +
> > >  lib/sql/migration/008-project-source-tree.sql |    3 +++
> > >  2 files changed, 4 insertions(+), 0 deletions(-)
> > >  create mode 100644 lib/sql/migration/008-project-source-tree.sql
> > 
> > Looks good. I'd like to wait until there is a user of this field before 
> > merging the change though - or are you using this for linaro-internal things?
> 
> I'm using it on a script I'm writing to fetch the git history of every
> project and scan that looking for patches that have been committed. Just
> like the existing patchwork-update-commits script does but this one is
> fully automated, and to make it more easily testable I'm experimenting
> with python-dulwich to scan the git history. 
> 
> I haven't submitted it for feedback yet because it's still very much a
> work in progress, but I would certainly do so. (I'm including what I
> have so far here, though)
> 
> Oh, there's one extra DB schema change I'm considering... That will
> allow us to keep track of the last seen commit ref so that we can scan
> the git history incrementally.  I've added it as another field to
> Project, named last_seen_commit_ref.  How does that sound to you?
> 
> 
> -- 
> Guilherme Salgado <https://launchpad.net/~salgado>

> commit dd10a245c50f651b38e7074d49413e91b2d82e14
> Author: Guilherme Salgado <guilherme.salgado@linaro.org>
> Date:   Fri Mar 25 15:59:28 2011 -0300
> 
>     Adds a script which goes through all registered projects looking for patches that have been committed already
>     
>     It does that by checking out the project's source code from its VCS of choice
>     (currently only git is supported, though), scanning the commits there and
>     comparing them to the patches in Patchwork.
> 
> diff --git a/apps/patchwork/bin/update-committed-patches.py b/apps/patchwork/bin/update-committed-patches.py
> new file mode 100755
> index 0000000..a444783
> --- /dev/null
> +++ b/apps/patchwork/bin/update-committed-patches.py
> @@ -0,0 +1,35 @@
> +#!/usr/bin/python
> +
> +import _pythonpath
> +from patchwork.models import Patch, Project, State
> +from patchwork.utils import (
> +    ensure_source_checkout_for_project, get_hashes_for_commits)
> +
> +
> +#for project in Project.objects.all():
> +for project in Project.objects.filter(linkname='linux-kernel'):
> +    if project.source_tree is None:
> +        continue
> +
> +    repo = ensure_source_checkout_for_project(project)
> +    if repo is None:
> +        print ("Skipping %s as we couldn't get a source checkout" %
> +               project.name)
> +        continue
> +
> +    hashes = get_hashes_for_commits(repo,
> +                                    stop_at=project.last_seen_commit_ref)
> +    for commit_id, patch_hash in hashes:
> +        # There may be multiple patches with the same hash. That's usually
> +        # the case when a second version of a patch series is submitted
> +        # and some of the patches in the series are identical in both
> +        # series.
> +        for patch in Patch.objects.filter(project=project, hash=patch_hash):
> +            patch.state = State.objects.get(name='Accepted')
> +            patch.commit_ref = commit_id
> +            print patch, patch.state
> +    else:
> +        print "No new %s commits to parse" % project.name
> +
> +    project.last_seen_commit_ref = repo.head()
> +    project.save()
> diff --git a/apps/patchwork/tests/__init__.py b/apps/patchwork/tests/__init__.py
> index 68fe563..e79331b 100644
> --- a/apps/patchwork/tests/__init__.py
> +++ b/apps/patchwork/tests/__init__.py
> @@ -23,3 +23,4 @@ from patchwork.tests.bundles import *
>  from patchwork.tests.mboxviews import *
>  from patchwork.tests.updates import *
>  from patchwork.tests.filters import *
> +from patchwork.tests.test_utils import *
> diff --git a/apps/patchwork/tests/test_utils.py b/apps/patchwork/tests/test_utils.py
> new file mode 100644
> index 0000000..0c60e74
> --- /dev/null
> +++ b/apps/patchwork/tests/test_utils.py
> @@ -0,0 +1,74 @@
> +
> +import tempfile
> +from time import time
> +from unittest import TestCase
> +
> +from dulwich.objects import Blob, Commit, parse_timezone, Tree
> +from dulwich.repo import Repo
> +
> +from patchwork.utils import get_hashes_for_commits
> +
> +
> +class TestGetHashesForCommits(TestCase):
> +
> +    def test_one_commit(self):
> +        repo = self.create_git_repo()
> +        commit = self.add_file_and_commit(repo, 'foo', 'Content1')
> +        # Here there are no hashes because get_hashes_for_commits() skips the
> +        # first one as it's unlikely to be of any interest to us.
> +        self.assertEqual(
> +            [], list(get_hashes_for_commits(repo, stop_at=None)))
> +
> +    def test_two_commits(self):
> +        repo = self.create_git_repo()
> +        commit = self.add_file_and_commit(repo, 'foo', 'Content1')
> +        commit2 = self.add_file_and_commit(repo, 'bar', 'Content2', commit)
> +        self.assertEqual(
> +            [(commit2.id, '5c010402c5673981ee3e1712e6a037de3ff9cae4')],
> +            list(get_hashes_for_commits(repo, stop_at=None)))
> +
> +    def test_empty_patch(self):
> +        repo = self.create_git_repo()
> +        commit = self.add_file_and_commit(repo, 'foo', 'Content1')
> +        head = self.add_file_and_commit(repo, 'bar', '', commit)
> +        self.assertEqual(
> +            [], list(get_hashes_for_commits(repo, stop_at=None)))
> +
> +    def test_stop_at(self):
> +        repo = self.create_git_repo()
> +        commit = self.add_file_and_commit(repo, 'foo', 'Content1')
> +        commit2 = self.add_file_and_commit(repo, 'bar', 'Content2', commit)
> +        commit3 = self.add_file_and_commit(repo, 'baz', 'Content3', commit2)
> +        self.assertEqual(
> +            [(commit3.id, '11d22fa0986b3bb341baa76b8a6a757a46a2f916')],
> +            list(get_hashes_for_commits(repo, stop_at=commit2.id)))
> +
> +    def create_git_repo(self):
> +        tmpdir = tempfile.mkdtemp()
> +        repo = Repo.init(tmpdir)
> +        return repo
> +
> +    def add_file_and_commit(self, repo, filename, data, parent=None):
> +        blob = Blob.from_string(data)
> +        parents = []
> +        tree = Tree()
> +        if parent is not None:
> +            tree = repo[parent.tree]
> +            parents = [parent.id]
> +        tree.add(0100644, filename, blob.id)
> +        commit = Commit()
> +        commit.tree = tree.id
> +        author = 'You <you@example.com>'
> +        commit.author = commit.committer = author
> +        commit.commit_time = commit.author_time = int(time())
> +        tz = parse_timezone('-0200')[0]
> +        commit.commit_timezone = commit.author_timezone = tz
> +        commit.encoding = "UTF-8"
> +        commit.message = "A commit"
> +        commit.parents = parents
> +        object_store = repo.object_store
> +        object_store.add_object(blob)
> +        object_store.add_object(tree)
> +        object_store.add_object(commit)
> +        repo.refs['refs/heads/master'] = commit.id
> +        return commit
> diff --git a/apps/patchwork/utils.py b/apps/patchwork/utils.py
> index e41ffb6..353147e 100644
> --- a/apps/patchwork/utils.py
> +++ b/apps/patchwork/utils.py
> @@ -17,8 +17,15 @@
>  # along with Patchwork; if not, write to the Free Software
>  # Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
>  
> +import os
> +from StringIO import StringIO
>  
> -from patchwork.models import Bundle, Project, BundlePatch
> +from dulwich.client import get_transport_and_path
> +from dulwich.patch import write_tree_diff
> +from dulwich.repo import Repo
> +
> +from patchwork.parser import hash_patch, parse_patch
> +from patchwork.models import Bundle, BundlePatch
>  from django.shortcuts import get_object_or_404
>  
>  def get_patch_ids(d, prefix = 'patch_id'):
> @@ -137,3 +144,64 @@ def set_bundle(user, project, action, data, patches, context):
>      bundle.save()
>  
>      return []
> +
> +
> +def ensure_source_checkout_for_project(project):
> +    forest = '/home/salgado/src' # This is where we store the trees we checkout
> +    root = os.path.join(forest, project.linkname)
> +    if not os.path.exists(root):
> +        repo = Repo.init(root, mkdir=True)
> +    else:
> +        repo = Repo(root)
> +
> +    transport, path = get_transport_and_path(project.source_tree)
> +    refs = transport.fetch(path, repo)
> +    # XXX: Is this the appropriate thing to do?  will there always be a master
> +    # branch?
> +    repo.refs['refs/heads/master'] = refs['HEAD']
> +    return repo
> +
> +
> +def get_hashes_for_commits(repo, stop_at):
> +    # We don't care about the first commit, but if needed it should be
> +    # possible to diff it against an empty tree and yield its hash as well.
> +    commit = repo['HEAD']
> +
> +    while len(commit.parents) > 0:
> +        commit_id = commit.id
> +        if commit_id == stop_at:
> +            break
> +
> +        parent = repo[commit.parents[0]]

What about possible other parents?  I would say every merged branch has
to be inspected until the merge-base of the immediate predecessors --
maybe recursion works here.  I forgot to mention Git-Python in the other
mail.  It has a wrapper to execute git commands with python syntax.
That can facilitate such things.

https://github.com/gitpython-developers/GitPython

> +        diff = StringIO()
> +        # In the case of merges, this won't have the same behavior as 'git
> +        # show', which seems to omit files not changed since any of the
> +        # parents (thanks to 'git tree-diff --cc'), but I think this is not a
> +        # big deal as patches in Patchwork would never be identical to the
> +        # diff of a merge anyway, would they?
> +        try:
> +            write_tree_diff(
> +                diff, repo.object_store, parent.tree, commit.tree)
> +        except KeyError, e:
> +            # XXX: This happens in qemu because there's a commit that is
> +            # actually on a submodule
> +            # (8b06c62ae48b67b320f7420dcd4854c5559e1532) and the old commit_id
> +            # used for the submodule
> +            # (06d0bdd9e2e20377b3180e4986b14c8549b393e4) is gone (possibly
> +            # because of a rebase?), so dulwich crashes with a KeyError.
> +            raise
> +
> +        commit = parent
> +        diff.seek(0)
> +        try:
> +            patch, _ = parse_patch(diff.read().decode('utf-8'))
> +        except UnicodeDecodeError:
> +            # TODO: Need to find out why this happens and see if we really
> +            # need to skip such commits.
> +            print "Skipping %s" % commit.id
> +            continue
> +
> +        # When commits just add files or change permissions the diff will be
> +        # empty and thus parse_patch() will return None.
> +        if patch is not None:
> +            yield commit_id, hash_patch(patch).hexdigest()
Guilherme Salgado April 1, 2011, 1:43 p.m. UTC | #4
On Fri, 2011-04-01 at 13:01 +0200, Dirk Wallenstein wrote:
> On Wed, Mar 30, 2011 at 06:25:12PM -0300, Guilherme Salgado wrote:
> > On Wed, 2011-03-30 at 12:10 +0800, Jeremy Kerr wrote:
[...]
> > +def ensure_source_checkout_for_project(project):
> > +    forest = '/home/salgado/src' # This is where we store the trees we checkout
> > +    root = os.path.join(forest, project.linkname)
> > +    if not os.path.exists(root):
> > +        repo = Repo.init(root, mkdir=True)
> > +    else:
> > +        repo = Repo(root)
> > +
> > +    transport, path = get_transport_and_path(project.source_tree)
> > +    refs = transport.fetch(path, repo)
> > +    # XXX: Is this the appropriate thing to do?  will there always be a master
> > +    # branch?
> > +    repo.refs['refs/heads/master'] = refs['HEAD']
> > +    return repo
> > +
> > +
> > +def get_hashes_for_commits(repo, stop_at):
> > +    # We don't care about the first commit, but if needed it should be
> > +    # possible to diff it against an empty tree and yield its hash as well.
> > +    commit = repo['HEAD']
> > +
> > +    while len(commit.parents) > 0:
> > +        commit_id = commit.id
> > +        if commit_id == stop_at:
> > +            break
> > +
> > +        parent = repo[commit.parents[0]]
> 
> What about possible other parents?  I would say every merged branch has
> to be inspected until the merge-base of the immediate predecessors --

That's one of the reasons that led me to run git as a subprocess instead
of using python-dulwich. In a followup to this thread I sent a new
version of this patch which uses just git.

> maybe recursion works here.  I forgot to mention Git-Python in the other
> mail.  It has a wrapper to execute git commands with python syntax.
> That can facilitate such things.
> 
> https://github.com/gitpython-developers/GitPython

Oh, cool, I'll have a look at that.

Cheers,
Dirk Wallenstein April 1, 2011, 2:08 p.m. UTC | #5
On Fri, Apr 01, 2011 at 01:43:06PM +0000, Guilherme Salgado wrote:
> On Fri, 2011-04-01 at 13:01 +0200, Dirk Wallenstein wrote:
> > On Wed, Mar 30, 2011 at 06:25:12PM -0300, Guilherme Salgado wrote:
> > > On Wed, 2011-03-30 at 12:10 +0800, Jeremy Kerr wrote:
> [...]
> > > +def ensure_source_checkout_for_project(project):
> > > +    forest = '/home/salgado/src' # This is where we store the trees we checkout
> > > +    root = os.path.join(forest, project.linkname)
> > > +    if not os.path.exists(root):
> > > +        repo = Repo.init(root, mkdir=True)
> > > +    else:
> > > +        repo = Repo(root)
> > > +
> > > +    transport, path = get_transport_and_path(project.source_tree)
> > > +    refs = transport.fetch(path, repo)
> > > +    # XXX: Is this the appropriate thing to do?  will there always be a master
> > > +    # branch?
> > > +    repo.refs['refs/heads/master'] = refs['HEAD']
> > > +    return repo
> > > +
> > > +
> > > +def get_hashes_for_commits(repo, stop_at):
> > > +    # We don't care about the first commit, but if needed it should be
> > > +    # possible to diff it against an empty tree and yield its hash as well.
> > > +    commit = repo['HEAD']
> > > +
> > > +    while len(commit.parents) > 0:
> > > +        commit_id = commit.id
> > > +        if commit_id == stop_at:
> > > +            break
> > > +
> > > +        parent = repo[commit.parents[0]]
> > 
> > What about possible other parents?  I would say every merged branch has
> > to be inspected until the merge-base of the immediate predecessors --
> 
> That's one of the reasons that led me to run git as a subprocess instead
> of using python-dulwich. In a followup to this thread I sent a new
> version of this patch which uses just git.
> 
> > maybe recursion works here.  I forgot to mention Git-Python in the other
> > mail.  It has a wrapper to execute git commands with python syntax.
> > That can facilitate such things.
> > 
> > https://github.com/gitpython-developers/GitPython
> 
> Oh, cool, I'll have a look at that.

I would say, the commits to inspect are basically:
  gitpythonrepo.git.rev_list(no_merges=True, 'HEAD', '^' + stop_at)
Dirk Wallenstein April 3, 2011, 11:44 a.m. UTC | #6
On Fri, Apr 01, 2011 at 10:34:23AM +0200, Dirk Wallenstein wrote:
> On Thu, Mar 31, 2011 at 06:32:33PM -0300, Guilherme Salgado wrote:
> > On Wed, 2011-03-30 at 18:25 -0300, Guilherme Salgado wrote:
> > > On Wed, 2011-03-30 at 12:10 +0800, Jeremy Kerr wrote:
> > > > Hi Guilherme,
> > > > 
> > > > >  apps/patchwork/models.py                      |    1 +
> > > > >  lib/sql/migration/008-project-source-tree.sql |    3 +++
> > > > >  2 files changed, 4 insertions(+), 0 deletions(-)
> > > > >  create mode 100644 lib/sql/migration/008-project-source-tree.sql
> > > > 
> > > > Looks good. I'd like to wait until there is a user of this field before 
> > > > merging the change though - or are you using this for linaro-internal things?
> > > 
> > > I'm using it on a script I'm writing to fetch the git history of every
> > > project and scan that looking for patches that have been committed. Just
> > > like the existing patchwork-update-commits script does but this one is
> > > fully automated, and to make it more easily testable I'm experimenting
> > > with python-dulwich to scan the git history. 
> > 
> > So, although I could do most things using python-dulwich it was failing
> > in some cases and I just don't have the time to chase down these
> > failures and fix them, so I resorted to running git on a subprocess. I'm
> > still writing the script in python as that allows me to have at least
> > some test coverage, which is something very valuable to me.
> > 
> > This new version of the script works similarly to the previous version.
> > There are a few things that need to be changed but it has some tests and
> > is able to scan the qemu tree and update the status of a couple patches.
> > Also note that it updates the commit_ref of a patch, which will allow us
> > to generate a link to the project's gitweb once we have the commit_url
> > field.
> > 
> > I'd appreciate some feedback on this; specially whether or not this is
> > something that's going to be useful upstream and, if so, if the current
> > approach is reasonable.
> 
> I think something like that would be really really useful.
> 
> I was wondering about a way to include patches that were slightly
> modified when they were applied.  The author, author date, and subject
> do not change in most cases (ok, maybe the subject is extended at the
> start to add a section, and maybe space and case changes). 
> 
> I think if author and author date match, and if a case insensitive
> space-less version of the mail-subject is contained in the
> commit-subject it is the corresponding commit.

Too bad that the author-date does not match when authors set up a branch
to be pulled.  But at least this could work for the occasional
contributor who might not have a patchwork account.

Theoretically, it might be possible to solely base that determination on
author and subject if the chain of superseding patches is recorded.
That could also rely on author and subject triggered by the common
(AFAIK) v1,v2,...,vn tokens in the patch sign.  Such chains could link
mail threads together and form topics.  Maybe assemble lists for cases
in which these computations failed and optionally send an email back.  I
think, that for a lot of cases relying on the uniqueness and
monotonicity of subjects could work.  Maybe add support for an X-Header
"X-Patchwork-Previous-Subject" for the cases when it changes.

Linking threads into topics would be really nice, I think.  It would be
possible to bookmark topics and see if there have been changes to a
topic -- something that is not immediately visible in a mail client.

Just musings and I would love to hear about cases where this would not
work most of the time (given there is a "v#" token policy).
Jeremy Kerr April 14, 2011, 4:55 a.m. UTC | #7
Hi Guilherme,

> I'd appreciate some feedback on this; specially whether or not this is
> something that's going to be useful upstream and, if so, if the current
> approach is reasonable.

This is the approach I (and other patchwork maintainers) have been
using:

  git rev-list rev1..rev2 |
  while read commit
  do
      hash=$(git show $commit | pwparser --hash)
      pwclient update -h $hash -s Accepted -c $commit
  done

(`pwparser` being apps/patchwork/parser.py)

If you'd like to make this more automated, I'd much rather something
that doesn't require a full patchwork source tree to work, and can be
easily run on maintainers' machines. It's very unlikely that we'd run
these processes on ozlabs.org or kernel.org, due to the high processing
overhead.

Cheers,


Jeremy
Guilherme Salgado April 14, 2011, 4:09 p.m. UTC | #8
Hi Jeremy,

On Thu, 2011-04-14 at 12:55 +0800, Jeremy Kerr wrote:
> Hi Guilherme,
> 
> > I'd appreciate some feedback on this; specially whether or not this is
> > something that's going to be useful upstream and, if so, if the current
> > approach is reasonable.
> 
> This is the approach I (and other patchwork maintainers) have been
> using:
> 
>   git rev-list rev1..rev2 |
>   while read commit
>   do
>       hash=$(git show $commit | pwparser --hash)
>       pwclient update -h $hash -s Accepted -c $commit
>   done
> 
> (`pwparser` being apps/patchwork/parser.py)
> 
> If you'd like to make this more automated, I'd much rather something
> that doesn't require a full patchwork source tree to work, and can be
> easily run on maintainers' machines. It's very unlikely that we'd run
> these processes on ozlabs.org or kernel.org, due to the high processing
> overhead.

I've no idea about the sort of hw that those instances run on, but is it
really that big an overhead?  I'd expect it to be so the first time it
scans a given repo but all further runs of the script should be much
quicker/cheaper as it will start from the last commit parsed in the
previous run.  

Anyway, the biggest disadvantage of decoupling the commit scanner from
patchwork itself is that we'd have to go via pwclient/xml-rpc to get
anything from patchwork, and that's a lot less
flexible/easy-to-experiment-with than direct access to the DB.  However,
it should be possible to design the scanner in a way that the script
itself doesn't depend on patchwork and the functions that read/write
data from/to patchwork are easily interchangeable, so that we can have
one version of it with direct DB access and another using pwclient. I'd
be more than happy to write the two versions of the script if you like
that idea.

Cheers,
diff mbox

Patch

commit dd10a245c50f651b38e7074d49413e91b2d82e14
Author: Guilherme Salgado <guilherme.salgado@linaro.org>
Date:   Fri Mar 25 15:59:28 2011 -0300

    Adds a script which goes through all registered projects looking for patches that have been committed already
    
    It does that by checking out the project's source code from its VCS of choice
    (currently only git is supported, though), scanning the commits there and
    comparing them to the patches in Patchwork.

diff --git a/apps/patchwork/bin/update-committed-patches.py b/apps/patchwork/bin/update-committed-patches.py
new file mode 100755
index 0000000..a444783
--- /dev/null
+++ b/apps/patchwork/bin/update-committed-patches.py
@@ -0,0 +1,35 @@ 
+#!/usr/bin/python
+
+import _pythonpath
+from patchwork.models import Patch, Project, State
+from patchwork.utils import (
+    ensure_source_checkout_for_project, get_hashes_for_commits)
+
+
+#for project in Project.objects.all():
+for project in Project.objects.filter(linkname='linux-kernel'):
+    if project.source_tree is None:
+        continue
+
+    repo = ensure_source_checkout_for_project(project)
+    if repo is None:
+        print ("Skipping %s as we couldn't get a source checkout" %
+               project.name)
+        continue
+
+    hashes = get_hashes_for_commits(repo,
+                                    stop_at=project.last_seen_commit_ref)
+    for commit_id, patch_hash in hashes:
+        # There may be multiple patches with the same hash. That's usually
+        # the case when a second version of a patch series is submitted
+        # and some of the patches in the series are identical in both
+        # series.
+        for patch in Patch.objects.filter(project=project, hash=patch_hash):
+            patch.state = State.objects.get(name='Accepted')
+            patch.commit_ref = commit_id
+            print patch, patch.state
+    else:
+        print "No new %s commits to parse" % project.name
+
+    project.last_seen_commit_ref = repo.head()
+    project.save()
diff --git a/apps/patchwork/tests/__init__.py b/apps/patchwork/tests/__init__.py
index 68fe563..e79331b 100644
--- a/apps/patchwork/tests/__init__.py
+++ b/apps/patchwork/tests/__init__.py
@@ -23,3 +23,4 @@  from patchwork.tests.bundles import *
 from patchwork.tests.mboxviews import *
 from patchwork.tests.updates import *
 from patchwork.tests.filters import *
+from patchwork.tests.test_utils import *
diff --git a/apps/patchwork/tests/test_utils.py b/apps/patchwork/tests/test_utils.py
new file mode 100644
index 0000000..0c60e74
--- /dev/null
+++ b/apps/patchwork/tests/test_utils.py
@@ -0,0 +1,74 @@ 
+
+import tempfile
+from time import time
+from unittest import TestCase
+
+from dulwich.objects import Blob, Commit, parse_timezone, Tree
+from dulwich.repo import Repo
+
+from patchwork.utils import get_hashes_for_commits
+
+
+class TestGetHashesForCommits(TestCase):
+
+    def test_one_commit(self):
+        repo = self.create_git_repo()
+        commit = self.add_file_and_commit(repo, 'foo', 'Content1')
+        # Here there are no hashes because get_hashes_for_commits() skips the
+        # first one as it's unlikely to be of any interest to us.
+        self.assertEqual(
+            [], list(get_hashes_for_commits(repo, stop_at=None)))
+
+    def test_two_commits(self):
+        repo = self.create_git_repo()
+        commit = self.add_file_and_commit(repo, 'foo', 'Content1')
+        commit2 = self.add_file_and_commit(repo, 'bar', 'Content2', commit)
+        self.assertEqual(
+            [(commit2.id, '5c010402c5673981ee3e1712e6a037de3ff9cae4')],
+            list(get_hashes_for_commits(repo, stop_at=None)))
+
+    def test_empty_patch(self):
+        repo = self.create_git_repo()
+        commit = self.add_file_and_commit(repo, 'foo', 'Content1')
+        head = self.add_file_and_commit(repo, 'bar', '', commit)
+        self.assertEqual(
+            [], list(get_hashes_for_commits(repo, stop_at=None)))
+
+    def test_stop_at(self):
+        repo = self.create_git_repo()
+        commit = self.add_file_and_commit(repo, 'foo', 'Content1')
+        commit2 = self.add_file_and_commit(repo, 'bar', 'Content2', commit)
+        commit3 = self.add_file_and_commit(repo, 'baz', 'Content3', commit2)
+        self.assertEqual(
+            [(commit3.id, '11d22fa0986b3bb341baa76b8a6a757a46a2f916')],
+            list(get_hashes_for_commits(repo, stop_at=commit2.id)))
+
+    def create_git_repo(self):
+        tmpdir = tempfile.mkdtemp()
+        repo = Repo.init(tmpdir)
+        return repo
+
+    def add_file_and_commit(self, repo, filename, data, parent=None):
+        blob = Blob.from_string(data)
+        parents = []
+        tree = Tree()
+        if parent is not None:
+            tree = repo[parent.tree]
+            parents = [parent.id]
+        tree.add(0100644, filename, blob.id)
+        commit = Commit()
+        commit.tree = tree.id
+        author = 'You <you@example.com>'
+        commit.author = commit.committer = author
+        commit.commit_time = commit.author_time = int(time())
+        tz = parse_timezone('-0200')[0]
+        commit.commit_timezone = commit.author_timezone = tz
+        commit.encoding = "UTF-8"
+        commit.message = "A commit"
+        commit.parents = parents
+        object_store = repo.object_store
+        object_store.add_object(blob)
+        object_store.add_object(tree)
+        object_store.add_object(commit)
+        repo.refs['refs/heads/master'] = commit.id
+        return commit
diff --git a/apps/patchwork/utils.py b/apps/patchwork/utils.py
index e41ffb6..353147e 100644
--- a/apps/patchwork/utils.py
+++ b/apps/patchwork/utils.py
@@ -17,8 +17,15 @@ 
 # along with Patchwork; if not, write to the Free Software
 # Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
 
+import os
+from StringIO import StringIO
 
-from patchwork.models import Bundle, Project, BundlePatch
+from dulwich.client import get_transport_and_path
+from dulwich.patch import write_tree_diff
+from dulwich.repo import Repo
+
+from patchwork.parser import hash_patch, parse_patch
+from patchwork.models import Bundle, BundlePatch
 from django.shortcuts import get_object_or_404
 
 def get_patch_ids(d, prefix = 'patch_id'):
@@ -137,3 +144,64 @@  def set_bundle(user, project, action, data, patches, context):
     bundle.save()
 
     return []
+
+
+def ensure_source_checkout_for_project(project):
+    forest = '/home/salgado/src' # This is where we store the trees we checkout
+    root = os.path.join(forest, project.linkname)
+    if not os.path.exists(root):
+        repo = Repo.init(root, mkdir=True)
+    else:
+        repo = Repo(root)
+
+    transport, path = get_transport_and_path(project.source_tree)
+    refs = transport.fetch(path, repo)
+    # XXX: Is this the appropriate thing to do?  will there always be a master
+    # branch?
+    repo.refs['refs/heads/master'] = refs['HEAD']
+    return repo
+
+
+def get_hashes_for_commits(repo, stop_at):
+    # We don't care about the first commit, but if needed it should be
+    # possible to diff it against an empty tree and yield its hash as well.
+    commit = repo['HEAD']
+
+    while len(commit.parents) > 0:
+        commit_id = commit.id
+        if commit_id == stop_at:
+            break
+
+        parent = repo[commit.parents[0]]
+        diff = StringIO()
+        # In the case of merges, this won't have the same behavior as 'git
+        # show', which seems to omit files not changed since any of the
+        # parents (thanks to 'git tree-diff --cc'), but I think this is not a
+        # big deal as patches in Patchwork would never be identical to the
+        # diff of a merge anyway, would they?
+        try:
+            write_tree_diff(
+                diff, repo.object_store, parent.tree, commit.tree)
+        except KeyError, e:
+            # XXX: This happens in qemu because there's a commit that is
+            # actually on a submodule
+            # (8b06c62ae48b67b320f7420dcd4854c5559e1532) and the old commit_id
+            # used for the submodule
+            # (06d0bdd9e2e20377b3180e4986b14c8549b393e4) is gone (possibly
+            # because of a rebase?), so dulwich crashes with a KeyError.
+            raise
+
+        commit = parent
+        diff.seek(0)
+        try:
+            patch, _ = parse_patch(diff.read().decode('utf-8'))
+        except UnicodeDecodeError:
+            # TODO: Need to find out why this happens and see if we really
+            # need to skip such commits.
+            print "Skipping %s" % commit.id
+            continue
+
+        # When commits just add files or change permissions the diff will be
+        # empty and thus parse_patch() will return None.
+        if patch is not None:
+            yield commit_id, hash_patch(patch).hexdigest()