Default Java Serialization
An
ObjectOutputStream
writes primitive data types and graphs of Java objects to anOutputStream
. The objects can be read (reconstituted) using anObjectInputStream
. Persistent storage of objects can be accomplished by using a file for the stream. If the stream is a network socket stream, the objects can be reconstituted on another host or in another process.
Only objects that support the
java.io.Serializable
interface can be written to streams. The class of each serializable object is encoded including the class name and signature of the class, the values of the object’s fields and arrays, and the closure of any other objects referenced from the initial objects.
The method
writeObject
is used to write an object to the stream. Any object, includingString
s andArray
s, is written withwriteObject
. Multiple objects or primitives can be written to the stream. The objects must be read back from the correspondingObjectInputstream
with the same types and in the same order as they were written.
Primitive data types can also be written to the stream using the appropriate methods from
DataOutput
.String
s can also be written using thewriteUTF
method.
The default serialization mechanism for an object writes the class of the object, the class signature, and the values of all non-transient and non-static fields. References to other objects (except in transient or static fields) cause those objects to be written also. Multiple references to a single object are encoded using a reference sharing mechanism so that graphs of objects can be restored to the same shape as when the original was written.
In order to serialize an object, a custom defined class object, with the default serialization mechanism,
simply extending the java.io.Serializable
, then the JVM will handle all of it.
And according to the above documentation from ObjectOutputStream
, the default java serialization will
using a reference sharing mechanism to record the graphs of objects, so that when multiple objects
refers to the same object, then only one object will be serialized by default.
And in the inheritance, if a super class doesn’t extend the java.io.Serializable
, to make its subclasses
serializable, the super class must have a constructor which takes no arguments so as to be called by
JVM to create the super part of the subclass in the deserialization phase.
serialVersionUID
serialVersionUID
is an ID which is stamped on object when it get serialized usually hashcode of
object, you can use tool serialver to see serialVersionUID
of a serialized object . serialVersionUID
is used for version control of object. you can specify serialVersionUID
in your class file also.
Consequence of not specifying serialVersionUID
is that when you add or modify any field in class
then already serialized class will not be able to recover because serialVersionUID
generated for new
class and for old serialized object will be different. Java serialization process relies on correct
serialVersionUID
for recovering state of serialized object and throws java.io.InvalidClassException
in case of serialVersionUID
mismatch.
Customized Serialization
One way to change the default behavior of serialization is to mark some fields as transient
and then
serialize those fields by your defined function.
For example, a Point
class,
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
public class Point extends Serializable {
private transient int x;
private transient int y;
...
private void writeObject(ObjectOutputStream oos) throws IOException {
oos.defaultWriteObject();
oos.writeInt(x);
oos.writeInt(y);
}
private void readObject(OjbectInputStream ois) throws IOException, ClassNotFoundException {
ois.defaultReadObject();
x = ois.readInt();
y = ois.readInt();
}
}
The above code will improve the deserialization performance. But the below code will produce wrong logic,
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
public class PointCollection extends Serializable {
private transient Point[] points;
...
private void writeObject(ObjectOutputStream oos) throws IOException {
oos.defaultWriteObject();
oos.writeInt(points.length);
for(int i = 0; i < points.length; ++i) {
oos.writeInt(points[i].getX());
oos.writeInt(points[i].getY());
}
}
private void readObject(OjbectInputStream ois) throws IOException, ClassNotFoundException {
ois.defaultReadObject();
int length = ois.readInt();
points = new Point[length];
for(int i = 0; i < length; ++i) {
points[i] = new Point(ois.readInt(), ois.readInt());
}
}
}
The problem is that in the original object, Point
instances in the points
field may share references
with each other referring to some Point
s, but after the deserialization, every Point
in the points
is a distinct object instance.
Extending java.io.Externalizable
By extending java.io.Serializable
, without implementing the logic, the JVM will automatically use
reflection to marshal and unmarshal the object. Before Java 1.3, reflection was very slow, so by
extending java.io.Externalizable
, and implementing the writeExternal
method and readExternal
method help to get around of the bottleneck caused by reflection.
And by extending java.io.Externalizable
, it means that it is a must to maintain the logic manually.
And also be careful with case of generating wrong logic in customized serialization above.
Kryo
Kryo is a 3rd party serialization library. And after some searching, it seems that there exists other 3rd party serialization library that claims achieving better performance compared with Kryo. And all of them seem to have some issue on stability.
Serialization of closure in Scala
Here are two pieces of useful resource on this topic, link1, and link2, and spark also implements how to clean the closure so as to serialize less stuff as link3.
In summary, when a function get serialized in Scala, then Scala will serialize the whole closure of that function, so it can recreate the function correctly.
1
2
3
4
5
6
7
object Foo {
val capturedValue = 11
def f() = (x: Int) => capturedValue * x
}
val f = Foo.f
In the above example, the function value f
is serializable, and since the variable Foo.capturedValue
is
involved in the closure, it will be serialized together.
But in the following exmaple, unless the class Foo
extends java.io.Serializable
, it is not serializable.
1
2
3
4
5
6
7
class Foo {
val capturedValue = 11
def f() = (x: Int) => capturedValue * x
}
val f = new Foo().f
In this case, Scala will serialize the entire instance of Foo class so as to serialize the f
function,
even the closure only includes one specific value of that instance.
For the static methods and non-static methods, JVM serializes them differently, see here and here.
So to avoid serialize the entire instance in this kind of cases, Spark does some work to eliminate unnecessary serializations.