Wednesday, February 23, 2011

Size does matter, Crunching data for the cloud with a Google Protocol Buffer Serializer

So, most cloud services charge by some unit of size (Amazon charges by the Gig). If you’re working for a company that uses these services, then, well, duhh….send around the smallest bits of data you can, right?

The way I’ve got Ellemy stuff for serialization now is that I have one little interface.

   1:  public interface ISerializer
   2:      {
   3:          object Deserialize(string input, Type desiredType);
   4:          object DeserializeObject(string input);
   5:          string Serialize(object input);
   6:      }

I hate the DeserializeObject thing, but, well, that’s for another time.


I have one concrete implementation of it currently, it uses JSON. I won’t show it here, because that’s not the point of the post.


So, I’ve been looking at Google Protocol Buffers and the .Net project for it, so I figured I’d write a ISerializer that uses protocol buffers. It looks pretty nice, apparently about half the size of Json serialization and a bit quicker too. The not so nice part is that the data is not self descriptive, and you MUST know the type of the object being requested. Honestly though, that’s not a problem for most apps I write, Just send the type in the message, and we’re good right?


So taking a look at the Protobuf-net wiki, I see that we’ll need data contracts for our messages.


Given a class like so.


   1:   public class SomeThingToSerialize
   2:      {
   3:          public string SomeStringProperty { get; set; }
   4:          public Guid SomeGuidProperty { get; set; }
   5:          public int SomeIntProperty { get; set; }
   6:      }

We’d need a class like this so that Protocol Buffer can do its thing.


   1:  [ProtoContract]
   2:      public class SomeThingToSerialize
   3:      {
   4:          [ProtoMember(1)]
   5:          public string SomeStringProperty { get; set; }
   6:          [ProtoMember(2)]
   7:          public Guid SomeGuidProperty { get; set; }
   8:          [ProtoMember(3)]
   9:          public int SomeIntProperty { get; set; }
  10:      }

I don’t want to muddy my problem domains with sterilization specific attributes, so I think what I’ll do is actually generate the classes for the messages on the fly – let’s play in CodeDom a bit!


Test


   1:   [TestFixture]
   2:      public class ProtocolBufferGenerator_tests
   3:      {
   4:          private ProtocolBufferDataContractGenerator _generator;
   5:          [SetUp]
   6:          public void arrange()
   7:          {
   8:              _generator = new ProtocolBufferDataContractGenerator();
   9:          }
  10:          [Test]
  11:          public void message_is_decorated_with_ProtoContract_attribute()
  12:          {
  13:              var random = new Random();
  14:              var testThing = new SomeThingToSerialize
  15:                                  {
  16:                                      SomeGuidProperty = Guid.NewGuid(),
  17:                                      SomeIntProperty = random.Next(),
  18:                                      SomeStringProperty = "Blah"
  19:                                  };
  20:              var protoClass = _generator.GenerateProtoFor(testThing);
  21:              var foundProtoContractAttribute = protoClass.GetType()
  22:                  .GetCustomAttributes(false)
  23:                  .Any(attribute => attribute.GetType().Name == "ProtoContractAttribute");
  24:   
  25:              Assert.IsTrue(foundProtoContractAttribute);
  26:          }
  27:      }

Yeah, just what the test says, it just checks that the object this ProtocolBufferDataContractGenerator is decorated with the correct attribute. Lets make it so.


   1:  public class ProtocolBufferDataContractGenerator
   2:      {
   3:          private const string _codeContractsNamespace = "Ellemy.CQRS.Serializers.GoogleProtocolBuffers.Contracts";
   4:          
   5:   
   6:          public object GenerateProtoFor<T>(T thing)
   7:          {
   8:              var nameSpace = new CodeNamespace(_codeContractsNamespace);
   9:              nameSpace.Imports.Add(new CodeNamespaceImport("ProtoBuf"));
  10:              nameSpace.Imports.Add(new CodeNamespaceImport(thing.GetType().Namespace));
  11:              var @class = new CodeTypeDeclaration(thing.GetType().Name)
  12:                               {
  13:                                   IsClass = true,
  14:                                   Attributes = MemberAttributes.Public
  15:                               };
  16:              var protoContractAttribute = new CodeAttributeDeclaration("ProtoContract");
  17:              @class.CustomAttributes.Add(protoContractAttribute);
  18:              nameSpace.Types.Add(@class);
  19:              var compileUnit = new CodeCompileUnit();
  20:              compileUnit.Namespaces.Add(nameSpace);
  21:              compileUnit.ReferencedAssemblies.Add("protobuf-net.dll");
  22:              var thingAssembly = thing.GetType().Assembly;
  23:              var assemblyToAdd = thingAssembly.GetName().Name + ".dll";
  24:              compileUnit.ReferencedAssemblies.Add(assemblyToAdd);
  25:              var parameters = new CompilerParameters {GenerateInMemory = true};
  26:   
  27:              var provider = new CSharpCodeProvider();
  28:              var results = provider.CompileAssemblyFromDom(parameters,compileUnit);
  29:              if(results.Errors.Count != 0)
  30:              {
  31:                  throw new InvalidOperationException(results.Errors[0].ErrorText);
  32:              }
  33:             
  34:              return
  35:                  results.CompiledAssembly.CreateInstance(string.Format("{0}.{1}", _codeContractsNamespace,thing.GetType().Name));
  36:   
  37:   
  38:          }
  39:   
  40:       
  41:          }

Lots of code there, but it does work! If you’re not familiar with code dom, it’s actually not very complicated. We’re just using C# to gen c#, and the code is fairly self documenting.


Pretty neat that we’re leaving the assembly in memory, but I’m thinking that in the future, we might wanna save that assembly, and not pay the cost of genning that class multiple times, but we’ll get to that soon enough.


So far, we’re creating an empty class decorated with the ProtoContract attribute, which is perfectly worthless since we don’t have the properties. Let’s fix that.


First, make sure we’re adding properties on the message.


   1:   [Test]
   2:          public void properties_are_added()
   3:          {
   4:              var expectedNumberOfProperties = typeof(SomeThingToSerialize).GetProperties().Count();
   5:              var actualNumberOfProperties = _protoClass.GetType().GetProperties().Count();
   6:              Assert.AreEqual(expectedNumberOfProperties,actualNumberOfProperties);
   7:          }

To make it pass I wrote this little method and called it on line 19 (actually doesn’t matter where) from the GenerateProto for method.


   1:   private void AddProperties<T>(T thing, CodeTypeDeclaration @class)
   2:          {
   3:              foreach (var propertyInfo in thing.GetType().GetProperties().OrderBy(p => p.Name))
   4:              {
   5:                  var field = new CodeMemberField
   6:                                  {
   7:                                      Type = new CodeTypeReference(propertyInfo.PropertyType.FullName),
   8:                                      Attributes = MemberAttributes.Private,
   9:                                      Name = "_" + propertyInfo.Name
  10:                                  };
  11:                  @class.Members.Add(field);
  12:                  var @property = new CodeMemberProperty
  13:                                      {
  14:                                          Name = propertyInfo.Name,
  15:                                          HasGet = true,
  16:                                          HasSet = true,
  17:                                          Type = new CodeTypeReference(propertyInfo.PropertyType.FullName),
  18:                                          Attributes = MemberAttributes.Public,
  19:                                      };
  20:                  var getter = new CodeSnippetStatement(String.Format("return _{0};",propertyInfo.Name));
  21:                  @property.GetStatements.Add(getter);
  22:                  var setter = new CodeSnippetStatement(String.Format("_{0} = value;", propertyInfo.Name));
  23:                  @property.SetStatements.Add(setter);
  24:                  @class.Members.Add(@property);
  25:              }
  26:          }







Ok, so now we need to decorate each property with a ProtoMember attribute. Here’s the test.


   1:  [Test]
   2:          public void every_property_on_the_message_is_decorated_with_a_ProtoMember_attribute()
   3:          {
   4:              
   5:              foreach(var propertyInfo in _protoClass.GetType().GetProperties())
   6:              {
   7:                  var foundProtoMemberAttribute = _protoClass.GetType().GetProperty(propertyInfo.Name)
   8:                 .GetCustomAttributes(false)
   9:                 .Any(attribute => attribute.GetType().Name == "ProtoMemberAttribute");
  10:                  Assert.IsTrue(foundProtoMemberAttribute);
  11:   
  12:              }
  13:          }


And now lets make it pass. I wrote this little method and added it as in that loop in AddProperties.


   1:   private void AddProtoMemberAttribute(CodeMemberProperty property, int memberNumber)
   2:          {
   3:              var protoBuffAttribute = new CodeAttributeDeclaration("ProtoMember");
   4:              var attributeArgument = new CodeAttributeArgument(new CodePrimitiveExpression(memberNumber));
   5:              protoBuffAttribute.Arguments.Add(attributeArgument);
   6:              @property.CustomAttributes.Add(protoBuffAttribute);
   7:          }

Sweet, now we’re generating the DataContracts! We’re still not actually serializing objects though. We’re just making stuff that Google Protocol Buffers can work with.


Lets add a new test.



   1:   [TestFixture]
   2:      public class using_the_GoogleProtocolBuffer_serializer
   3:      {
   4:          private Serializer _serializer;
   5:          [SetUp]
   6:          public void Arrange()
   7:          {
   8:              _serializer = new Serializer();
   9:          }
  10:          
  11:          [Test]
  12:          public void serialize_an_non_DataContract_class()
  13:          {
  14:              var testThing = new TestThing { Guid = Guid.NewGuid(), Int = 1, String = "Some String"};
  15:              var output = _serializer.Serialize(testThing);
  16:              Assert.IsNotNullOrEmpty(output);
  17:              Console.WriteLine(output);
  18:          }


Ok, so not much going on here, we’re just testing that the output is actually not null, and (by extension), that the Serializer class doesn’t throw an exception.


Here’s what I did to make it work.


 

   1:   public class Serializer : ISerializer 
   2:      {
   3:          private readonly ProtocolBufferDataContractGenerator _protocolBufferDataContractGenerator;
   4:   
   5:          public Serializer()
   6:          {
   7:              _protocolBufferDataContractGenerator = new ProtocolBufferDataContractGenerator();
   8:          }
   9:          public object Deserialize(string input, Type desiredType)
  10:          {throw new NotImplementedException("patience is a virtue"); }
  11:   
  12:          public object DeserializeObject(string input)
  13:          {
  14:              throw new NotSupportedException("nope, dis dont werk ");
  15:          }
  16:          public string Serialize(object input)
  17:          {
  18:              var t = _protocolBufferDataContractGenerator.GenerateProtoFor(input);
  19:              foreach (var property in input.GetType().GetProperties())
  20:              {
  21:                  var setterForT = t.GetType().GetProperty(property.Name);
  22:                  var value = property.GetValue(input, null);
  23:                  setterForT.SetValue(t, value,null);
  24:              }
  25:              string data;
  26:              using (var writer = new MemoryStream())
  27:              {
  28:                  ProtoBuf.Serializer.NonGeneric.Serialize(writer, t);
  29:                  writer.Position = 0;
  30:                  using (var reader = new StreamReader(writer,Encoding.ASCII))
  31:                  {
  32:                      data = reader.ReadToEnd();
  33:                  }
  34:              }
  35:              return data;
  36:          }
  37:      }
  38:  }

on line 18, I simply get an instance of the DataContract class (we just saw what’s in there). I then loop through all the types on the DataContract via reflection and set all the values appropriately.


On line 26-34, we’re just using Protobuff-net to serialize the object, and return the results.


Not horribly complicated, but we’re still not done, because we can’t yet deserialize.


New test.


   1:   [Test]
   2:          public void serialize_then_deserialize()
   3:          {
   4:              var testThing = new TestThing { Guid = Guid.NewGuid(), Int = 1, String = "Some String" };
   5:              var output = _serializer.Serialize(testThing);
   6:              var result = (TestThing)_serializer.Deserialize(output, typeof(TestThing));
   7:              Assert.AreEqual(testThing.Guid, result.Guid);
   8:              Assert.AreEqual(testThing.String, result.String);
   9:              Assert.AreEqual(testThing.Int, result.Int);
  10:              Assert.AreEqual(testThing.Enum1, result.Enum1);
  11:          }


Yeah, now we’re testing that it actually works. Lets make this pass.


   1:   public object Deserialize(string input, Type desiredType)
   2:          {
   3:              var bytes = ASCIIEncoding.ASCII.GetBytes(input);
   4:              var @event = Activator.CreateInstance(desiredType);
   5:              using (var stream = new MemoryStream(bytes))
   6:              {
   7:                  var thisSucksINeedToFixIt = Activator.CreateInstance(desiredType);
   8:                  var protobufferType = _protocolBufferDataContractGenerator.GenerateProtoFor(thisSucksINeedToFixIt).GetType();
   9:                  var protobuffer = ProtoBuf.Serializer.NonGeneric.Deserialize(protobufferType, stream);
  10:                  
  11:                  foreach (var fieldInfo in protobuffer.GetType().GetProperties())
  12:                  {
  13:                      var setter = desiredType.GetProperty(fieldInfo.Name);
  14:                      var value = fieldInfo.GetValue(protobuffer,null);
  15:                      setter.SetValue(@event, value, null);
  16:                  }
  17:              }
  18:              return @event;
  19:          }

Not too bad, huh?


Except for that thisSucksINeedToFixIt variable. I should defer to some IOC container or something there, but honestly, I don’t think events should have dependencies – yeah, argument for a different day.


Ok, so the test passes except for the Guid assertion. Gunna go take a look at that now.


Ok, after a lot of googling I realized that the issue was my ASCII encoding. I doubt anyone reading this blog cares too much, but what I did to fix it was to use BitConverter.ToString() method when Serializing and some code I ripped off off stack overflow to deserialize. If you’re interested, yeah go git the code!


So there we go, I like it, but there’s still a lot of work to go before it’s ready for prime time.


Some things I have to get done.



  • Handle versioning well (the order is very important, if you added a new property with the current implementation in the middle of a class, it would break old stuff)

  • Maybe even output .proto files, and have the GoogleProtocolBufferGenerator use them if they exist, that way it’ll work by default with conventions, but allow customization, and it’ll save on the overhead of generating the contract.

Have fun with it! Thoughts?

6 comments:

  1. Cool stuff. For info, in v2 (not yet fully released, but stable) you **don't** need to muddy your objects - you can define all the metadata at runtime. Also, re adding properties - it shouldn't break anything, and there is an interface (or easier: base-class) for providing safe round-trip of data you don't know about explicitly.

    ReplyDelete
  2. THE Marc Gravell!
    Thanks man!

    Also thanks for writing such a nice wrapper around Protocol Buffers!

    I love the idea of doing all that at runtime, since my little framework already has bootstrap specific stuff going on. When someone specifies the serialization mechanism, I could go run that code.

    Can you point me at an example of specifying the metadata at runtime? That's exiting stuff.

    ReplyDelete
  3. So let me get this straight. You use CodeDom to generate some code that Marc's library then reflects on to pull out the values? Seems like you could skip a couple of steps and cancel out a lot of work. Looks like v2 makes that possible.

    I've studied Protocol Buffers, but I admit I haven't looked at protobuf-net. PB handles versioning with field numbers to detect added or missing fields. If you don't have the attributes or some other mechanism to assign unchanging numbers to fields, you'll break this mechanism. You can't just sequentially number the fields, since a code change will alter those numbers.

    So I think you will need to read the .proto files in order to preserve field identity. You can write them, too, but only as an aide to the developer. The dev has to be able to uniquely identify fields somehow, and the .proto file seems the best place to do it.

    ReplyDelete
  4. Re runtime; TypeModel.Create; http://code.google.com/p/protobuf-net/source/browse/trunk/protobuf-net.unittest/Meta/Basic.cs

    ReplyDelete
  5. @Michael,
    Yeah, I mentioned the versioning issue at the end of the blog.

    I'll go for "some other mechanism" over attributes to handle versioning though.

    I want my framework to be be Serialization mechanism agnostic.

    I'm thinking that I'll output the actual generated contracts (either an assembly of classes that use Marc's stuff, or actual PB native proto files) to disk, then modify the "Generator" to use them but update as needed. That would handle the versioning. At least I think it would :)

    The point is to make using Protocol Buffers as frictionless as possible. If that means my object(s) that implement my ISerializer interface have a little more to do, that's ok by me, because

    a) it works
    b) it's easy to use (low friction to consumer)
    c) it's well tested


    @Marc,

    Thanks, how does TypeModel.Create handle versioning?

    Does the approach I just mentioned to Michael seem like it would work?

    ReplyDelete
  6. Also, I didn't blog it, but I'm caching the generated classes, so we're only taking the CodeDom hit once per message type.

    ReplyDelete